A project in the field must adapt to harsh reality
Scott Rosenthal
February, 1997
We've all experienced failures such as cars reporting errors that don't exist, VCRs
that flash the time as 12:00 and computerized toys that don't function. The reason is
simple: compared to the benign environment of a development lab, the real world is a harsh
place for electronic systems. Hence engineers must design, develop and test products with
real-world events in mind. Ignoring "cockpit" error, I've found that the three
main causes for failures in the field are inadequate signal debouncing, electrostatic
discharge (ESD) and power. Therefore, this month I'll begin addressing how to make
embedded systems more reliable or robust by looking at these three issues, starting with
signal debouncing.
Rebounding logic levels
While many real-world conditions can obviously affect electronic systems adversely, one
of the most common problems is transient conditions on input signals. Im not
primarily concerned with debouncing operator controls and switchesthose are problems
you find in the lab. Instead, Im talking about a lack of robustness in sensing
faulty conditions.
Many systems incorporate status sensors to indicate if a faulty condition exists.
Examples include lamp burnout, door open, high-voltage failure and air-pressure faults.
Often a system uses a signal from one of these status sensors to abort its operation. This
function normally works great in the lab where a designer can emulate the signal input
with a clean logic level. In the field, though, problems can arise with this monitoring
when contact bounce or other transient conditions fool the monitor software into declaring
a fault.
If a sensor signal can stop system operation, the system must then verify that the
problem signal really exists. One technique is to debounce all input status signals. The
assumption that makes this technique work is that most legitimate faults aren't transient
events. Consider a system in which a motor drives a fan; the system then monitors the
fan's operation with an air-pressure switch. When the motor starts, a transient event
lasting a few milliseconds can occur on the air-pressure signal. This event might result
from crosstalk between signals, a ground loop, a mechanical coupling between the fan and
sensor or the sensor's construction. The important point is that in most situations, a
glitch of a few milliseconds on an air stream is probably meaningless; a true air-pressure
fault lasts for a second or more. Using this information, you can design software to
ignore fast transients and declare faults only on stable long-term signals.
One way of debouncing a status signal in software is to set up a structure such as the
one in Listing 1a. The two general-purpose debouncing functions in Listing 1b, in turn,
can employ this structure to determine the validity of any status signal. Listing 2 shows
this code in action debouncing an air-pressure switch that the system checks every 100
msec. This code assumes that to declare a failure, the switch must return an error state
for 1 sec (ten monitor cycles).
The advantage of using the structure and the two support functions is that the software
module becomes a generic debouncer for status signals. Just make sure that the hysteresis
value multiplied by the repetition rate of the code is long enough to truly debounce the
signal. In addition, this technique can also debounce measurement problems. For example,
consider a system monitoring an analog signal. Any analog measurement demonstrates some
degree of variability, but as a level approaches an error threshold this variability
appears as contact bounce. The code in Listing 1 debounces this input, as well, and the
only difference is that the code measures an analog and not a binary signal.
Baby lightning
After debouncing, the area causing me the most problems over the past year has been
ESD. In many companies, ESD testing and protection is an engineering discipline in itself.
Obviously, I won't be able to do justice to the entire topic in just part of a column, but
I will present two examplesone common and the other fairly esoteric.
Since the mid 70s, one of the most common chips in embedded designs has been the 8255
(both the NMOS and now the CMOS 82C55). This device goes by many names, including PIO and
PIC. For those readers not familiar with it, the chip provides three 8-bit I/O ports that
software can configure independently as input, output or as a special function.
I've worked with this device many times and have learned that its Reset line is
extremely sensitive to any disturbances. The reset causes all I/O pins to return to their
input state and generally kills a system's operation. For years my solution to this
problem was to ground the Reset line and reset the chip with softwarean approach
that worked well until the age of ESD testing.
The first thing I discovered is that the sensitivity of the Reset signal to a
disturbancewhether power-supply noise, ESD or whatevervaries tremendously from
vendor to vendor. For example, Intel's parts seem extremely sensitive to any type of
glitch. When I informed Intel of this problem, that supplier's reaction was to tell us to
use someone else's part! Apparently a large number of manufacturers are following this
advice because I did a quick check of ESD-certified boards and found they all use the
82C55 from NECwhich is essentially immune to the interference.
The esoteric experience occurred to another engineer at my company. During one test
sequence, he had to move a module from one piece of equipment to another across the room.
When inserting the module into the second system he created an error in the device. We
traced the problem back to his wheelchair. It turns out that the hubs on his rear wheels
create as much as 50 kV of charge. This phenomenon also raised the potential of the module
on his lap. When he plugged it into the assembly, the module discharged and thereby caused
the failure. This problem pathway helped us uncover and solve an intermittent field
problem that occurred during transport of this module. (Incidentally, only one of his two
wheelchairs exhibited this problem.)
Power problems
Finally, it's amazing how often line power can cause problems. I remember being in
Venezuela with an instrument that required a good ground. The only one available was a
water pipe on the outside ledge on the tenth floor. I crawled out onto the ledge, sanded
the pipe and with a hose clamp attached a wire to the instrument.
In addition, power coming out of the wall might not be what you think it is. Line
voltages vary between different countries, so just because a power supply says 115/230
doesn't mean it necessarily works in Japan at 100V or in the UK at 240V. Further, voltage
and the frequency specs always include a tolerance, which the power company uses to define
"good" power. Remember that just because the power is OK at the line coming into
the building, there's no guarantee that it's still good at an outlet.
Beyond voltage/frequency considerations, noise on line power can also be a killer. A
company I previously worked for sold measurement instruments to grain co-ops in the
Midwest. Most of these units needed conditioning boxes to suppress power problems caused
by numerous summer thunderstorms.
In a similar vein, I recently came across a test specification for electronic systems
used in a fast-food restaurant that simulate the power problems such a unit might
experience in deployment. For example, as the heater for the French fryer switches On and
Off, the spec wants to ensure that the milkshake machine doesn't start spitting out fluid
onto the counter. All electronic devices purchased for this particular fast-food joint
must pass a test consisting of various line-cycle disturbances including missing cycles,
voltage spikes, sags and interference. If you really want to harden a device against
power-line problems, perform these kinds of tests, as well. PE&IN
Listing 1a: Debounce data structure
typedef struct {
int Hysteresis; /* number of bad in a row needed */
int LocalCount; /* keeps track of hysteresis count */
int Error; /* if not 0, then we have an error */
} TDebounced;
Listing 1b: Debounce code
void ErrorClear(TDebounced *ptr) {
ptr->Error = 0; /* no error */
ptr->LocalCount = ptr->Hysteresis; /* reinitialize counter */
}
int ErrorSet(TDebounced *ptr) {
if (ptr->Error == NO) { /* if no error yet */
if (ptr->LocalCount != 0) {
ptr->LocalCount; /* one less time period */
}
if (ptr->LocalCount == 0) {
ptr->Error = 1; /* we have an error */
}
}
return ptr->Error; /* return the current error code */
}
Listing 2: Debounce in action
static TDebounced AirError = {10}; /* 10 times in a row is one second */
int AirPressureTask(int flag) {
if (flag == 1) { /* if 1, then initialize */
ErrorClear(&AirError); /* this will initialize the debounce info */
}
if (PressureSw() == 0) { /* 0 means a failure */
ErrorSet(&AirError); /* an error see if fully debounced */
} else {
ErrorClear(&AirError); /* no error reset debounce info */
}
return AirError.Error; /* return the current error status */
}
Adapted from an article that appeared in Personal Engineering & Instrumentation
News.
Return to the article index.
|