Watchdog Timer Techniques

When you design an embedded system, you want to make certain the hardware and software work properly. It follows that the hardware and software will always work perfectly, right? Well, maybe not. Even a perfect system can get clobbered by something as exotic as a stray cosmic ray. And most systems are not quite perfect, as we all know. So, a conservatively designed system should include provisions for error detection and recovery both in software and hardware. One common technique is to use a watchdog timer.

A watchdog timer is a hardware circuit that counts input pulses up to a certain limit. The input pulses usually come from a very reliable clock source such as the main system clock generator. If the limiting count is reached, the circuit generates an output signal that is treated as a fault and can be used to initiate recovery. The fault signal is usually tied into the system reset circuitry and causes the system hardware and software to reset. To keep the counter from reaching the limit, the software must toggle another input signal to reset the counter. This is called retriggering the watchdog timer, and periodic retriggering keeps the watchdog timer from expiring and resetting the system.

A watchdog timer is not a very sophisticated error detection and recovery mechanism. The only error the watchdog timer detects is that you have failed to poke it often enough. And the only recovery is to reset the entire system. So a watchdog timer is used as a final, crude, fail-safe mechanism to detect catastrophic unrecoverable errors and to resort to an equally catastrophic recovery. The hope is that resetting the system will fix whatever is broken and will at least leave the system in a better state.

However, through some extra hardware and software, you can get more mileage out of your simple watchdog timer circuit. Let's look at how to implement those extra features.

Hardware Features

To be most reliable, a watchdog timer should start automatically as soon as power is applied. It shouldn't require any software configuration, because if the software fails, the watchdog timer won't start.

The watchdog timer should use a very reliable clock as the input clock, such as the main system clock generator. Unfortunately, this is usually a very fast clock, so the watchdog timer counting circuit requires many "divide by" stages. You might be tempted to use a more convenient (slower) clock source, but beware of clocks that come from software-configured devices. If the software fails, the watchdog timer might not work. Sometimes, you can get a convenient clock from an external source, but this isn't a good idea either. If the external source fails or becomes disconnected, your watchdog timer won't work.

It should be difficult to accidentally retrigger the watchdog timer. Retriggering it means "everything is OK." If the software fails but the watchdog timer keeps getting retriggered, the system might not recover. One way to make unintentional retriggering more difficult is to fully decode the memory or I/O address of the watchdog timer. This way, the software must write to a single address out of perhaps millions. A partial decode gets by with simpler hardware, but now the software can write to any one of hundreds or thousands of addresses, depending on the hardware. Only one of those addresses is valid—all the others indicate failed software, but they retrigger the watchdog timer anyway. Another reliable mechanism is to require writes to two different addresses within microseconds. You can arrange the timing so the only valid retriggering sequence is two back-to-back store or output instructions. If the second instruction doesn't immediately follow, the watchdog timer ignores the retrigger operation.

It should be nearly impossible for the software to accidentally disable the watchdog timer. You could use a full address decode and back-to-back writes, or you could eliminate all software disabling mechanisms. With some internal CPU watchdog timers, however, you don't have the latter choice. In those instances, try periodically reenabling the watchdog timer, just in case you inadvertently disabled it.

You should have a way to disable the watchdog timer using an external jumper. When you debug software using an emulator and you hit a breakpoint, the software stops running and the watchdog timer won't be retriggered. If the watchdog timer expires, it will reset the system and make debugging difficult. For some internal CPU watchdog timers, the software will have to read the external jumper and disable the watchdog timer via software. Design your jumper configuration so it's difficult to ship a system with the watchdog timer disabled. You can bring the disable signal off the board via an existing connector and install an external jumper on systems used for software development.

The watchdog timer duration determines how often the software has to retrigger the watchdog timer. The duration shouldn't be too long or too short. If the duration is as long as several seconds, the system can stay impaired for several seconds before being reset. This means your system could be doing something dangerous until things are brought under control. If the duration is a short as a few milliseconds, it could be inconvenient for the software to keep retriggering the watchdog timer. It's common to retrigger the watchdog timer once every pass through the background task loop, and task loops usually don't run very fast.

Don't forget that during software initialization, you might be doing lengthy RAM tests or other diagnostic functions that also need to retrigger the watchdog timer. You might also have a problem if you're using a high-level language like C. During initialization, you have to call a vendor-provided initialization routine to initialize the C run-time support (such as setting up initialized variables). On a slow 8-bit processor with a lot of initialized variables, this could take many milliseconds. Whatever the watchdog timer duration, it's never a problem if the software retriggers the watchdog timer more frequently than necessary.

Timeout Actions

If the watchdog timer expires, it is because the software hasn't retriggered it often enough. The software might be stuck in a loop somewhere, or it might have crashed altogether. In any event, it is a problem, and you have to recover.

The most common recovery mechanism is to perform a hardware reset on the entire system. Since it's usually not possible to diagnose the failure at run-time and perform a more graceful recovery, a system reset must suffice, however drastic. When you reset the system due to a watchdog timeout, be sure to reset all the system hardware as if the power were just applied. Some internal CPU watchdog timers reset the CPU via a special restart vector. In this situation, whenever the CPU starts up, the software should toggle a reset line for the rest of the hardware, in case the CPU reset was due to a watchdog timeout. It's important to make sure the watchdog timeout resets all the hardware, so the normal system initialization software can assume that a known set of initial conditions exists (such as the "reset conditions" specified in peripheral chip data books). A full hardware reset also stops whatever the hardware was doing when the watchdog timer expired, perhaps controlling some dangerous external process. To this end, the reset condition of your external I/O hardware should place all parts of the system in a benign, inactive state.

Another recovery mechanism is to generate a non-maskable interrupt (NMI) to the software. Although you wouldn't normally want to rely on software to recover from a watchdog timeout, the NMI handler can be designed as a very simple standalone routine that doesn't require the rest of the system to work. If your system is troubled by unexplained watchdog timeouts, you can use the NMI handler to take a snapshot of CPU registers, software variables, and hardware registers before resetting the system. If you have an emulator hooked up, you can also set a breakpoint on the NMI handler entry point.

If a watchdog timeout occurs, it's helpful to set a hardware status bit somewhere. This bit should only be reset by a power-up or after the software reads it. It shouldn't be reset by the general reset that happens after a watchdog timeout. Whenever the software starts up, it can read the bit (thus clearing it) to see if the last reset was due to a watchdog timeout. Although the watchdog timer expiration resets most of the hardware, the software variables in RAM should still have their previous values. You can set a breakpoint and try to debug the watchdog timeout cause.

Hardware Examples

Some microprocessor chips have a built-in hardware watchdog timer circuit, especially those designed for embedded systems. An internal watchdog timer is configured and controlled by accessing internal CPU hardware registers that look like memory or I/O locations (depending on the CPU). The main advantage of an internal watchdog timer is that it doesn't cost anything extra or take up any more space or power. However, if it's not exactly what you want, you can't change it, but you can build your own using external hardware.

The Motorola 68HC11 microprocessor has an internal watchdog timer. It's pretty good in that it doesn't require any software interaction to start, and it's difficult to accidentally retrigger (you need to write two consecutive special data patterns to one specific memory-mapped register). It has a fairly convenient range of timeout intervals (four values from 16 milliseconds to one second). However, it's not possible to disable the internal watchdog timer using an external jumper without some moderately complicated software interaction (you have to reprogram an internal EEPROM configuration register and reset the system). This also means it's difficult to accidentally disable the watchdog timer. If the watchdog timer expires, the CPU resets via a special vector and pulses the CPU reset line low. Other devices connected to the reset line might require a longer reset pulse, and the main hardware reset generator might not let them recognize the 68HC11 pulsing the reset line low, since there would be two devices actively driving the reset line.

The Motorola 68302 microprocessor also has an internal watchdog timer. It doesn't need any software interaction to start and has a very convenient range of timeout intervals (any value from 0.5 milliseconds to 16.67 seconds). However, it's fairly easy to accidentally retrigger via a single write of any value to one memory-mapped register, or disable using a single write with bit zero low. You can configure the CPU so the timeout signal comes out of the CPU on a separate pin (the default), and you can disable the timeout action using an external jumper. Or, you could put the jumper on a general-purpose input and have the software read the jumper and disable the watchdog timer through software. Another available timeout option is to have the CPU generate an internal interrupt, but this requires working software to service it.

The Siemens 80535 (an 8051 variation) also has an internal watchdog timer. Unlike other CPUs, this watchdog timer doesn't automatically start; you have to start it with software. Once started, however, the watchdog timer can't be disabled by software (you have to reset the system). To disable the watchdog timer using an external jumper, read the jumper during initialization via a general-purpose input, and if disabled, don't start the watchdog timer. It's quite difficult to accidentally retrigger the watchdog timer, since it requires two different back-to-back writes. The timeout interval is fixed a 65 milliseconds, assuming a standard 12-MHz clock. If the watchdog timer expires, the CPU resets but leaves an internal status bit set, so the initialization software can detect if the reset was due to a watchdog timeout.

Maxim makes several versions of a microprocessor supervisory chip, most of which include a watchdog timer. For example, the MAX705 has a reset generator (for power-up, power-down, or brownout), a power-fail warning comparator, a manual reset input, and a watchdog timer. The watchdog timer starts automatically after reset is released and can't be disabled by software. However, it's easy to disable with an external jumper. The chance of accidentally retriggering the watchdog timer depends entirely on your hardware design, since the Maxim chip just has an input lead that must be toggled. If your design requires two back-to-back writes, the watchdog timer will be fairly difficult to accidentally retrigger. For this particular chip, the watchdog timeout interval is fixed at 1.6 seconds. (Other chips are adjustable.) You can hook up the chip so that upon timeout, it automatically generates a reset (as if you pressed the reset button), or you can hook it up to generate an NMI to the CPU.

Software Example

With all this talk of hardware, you might forget that the software also has work to do. After all, one of the main purposes of a watchdog timer is to detect software faults. There are a few things the software can do to enhance the usefulness of a watchdog timer.

It's important for the software to actively try to detect faults and not just wait for the unexpected crash. When you call a function and pass in parameters, the function can check the parameters for validity and bail out if an invalid parameter is detected. You probably don't need to check every parameter every time for all functions, because some of the software interfaces are considered to be "trusted." However, whenever you first process a parameter that comes from an external interface (such as a communication line, a disk file, or a hardware port), you should check it for validity. Also, you might not always want to reset if an invalid parameter is detected. In some cases, the system design might allow a more graceful recovery, such as replacing the invalid parameter with a reasonable value, notifying the calling routine, or just ignoring the invalid message. But if you decide to reset the system, you can do this by disabling interrupts and deliberately hanging up in a loop, letting the watchdog timer expire. If you make a function out of this, you can pass a parameter that gives the reason for resetting (such as out of buffers, invalid task ID, invalid interrupt, invalid mode, and so on). The function can save the reset reason in a special place in RAM that isn't cleared by the initialization software, so you can use a resident debugger or emulator to look at the value. For example, in 68HC11 assembly language:

; This routine is called when the software detects a fatal error. The
; routine resets the system by waiting for the watchdog timer to expire.
; The caller supplies the 8-bit restart reason in register B.

      sei                  ; disable interrupts
      stab  _reset_reason  ; save reset reason for emulator use

reset_loop:                ; wait for watchdog timer to expire
      bra   reset_loop

; We never get here. Instead, the watchdog timer expires and the system resets.

As an aside, notice that _fatal_error and _reset_reason have leading underscores. This makes them accessible to C routines (as well as assembly language), since most C compilers put a leading underscore on all function and variable name references.

You need to be careful where the software retriggers the watchdog timer. You might retrigger it after every pass through the background task loop. However, one type of software fault accidentally leaves interrupts disabled. The non-interrupt background software keeps running, but all of a sudden, no more interrupts get serviced. If you retrigger the watchdog timer from the background software, it would get retriggered in spite of a major problem with the system. If you retrigger the watchdog timer from an interrupt routine, it's possible for the background software to fail and the interrupt software to run fine, retriggering the watchdog timer. To be on the safe side, the watchdog timer must only be retriggered when both the background and foreground software are working properly.

If you have a fairly sophisticated real-time operating system, you could create a watchdog timer background task and assign it a low priority. Then, you could have the watchdog timer task start a software timer and only retrigger the watchdog timer when the software timer expires. Assuming your software timers require a real-time clock interrupt to work, this method verifies that both the background and foreground software are working. By setting the background task to lowest priority, you can make sure your system doesn't run out of CPU power for long periods of time. If that happens, the watchdog timer won't be retriggered, and the system will reset.

If you don't have a sophisticated real-time operating system, you can design your software so the background checks on the foreground, and the foreground checks on the background. If either one detects that the other has failed, the software calls the fatal_error() routine to reset the system. For example, the background periodically checks to see if a foreground OK flag gets set (by the foreground). If so, the background just clears the flag and resumes checking. If the foreground OK flag isn't set after a reasonable amount of time, the background assumes that there's a problem and calls fatal_error(). The foreground can check up on the background using a similar mechanism, except checking and clearing the background OK flag. The "OK" flags are set by the software at some convenient point.

One problem with this software-based mechanism is that you don't really know how often the software is going to set the "OK" flag, so you don't know how long to wait before declaring a fault. Say the foreground software executes every 10 milliseconds and checks the background OK flag every time, and assume the background sets the background OK flag every pass through the background task loop. After the foreground clears the background OK flag, how long should the foreground wait for the flag to be set again? You don't really know how long the background takes to execute all its tasks, especially in the worst case, so you should come up with a conservative value (let's say one second) that you're sure will always work. So the foreground shouldn't worry about background OK remaining clear until at least one second has gone by.

The problem is somewhat more difficult for the background checking the foreground OK flag, because the background checking isn't necessarily done at a precise rate. When the background software doesn't have a lot to do, the background task loop may execute very rapidly, racing through all the tasks as quickly as possible. If you check the foreground OK flag once per background task loop iteration, you really should allow many iterations before declaring the foreground "failed" since each iteration can be very short. If the background is very busy, each task loop iteration can take a much longer time. However, the background software still waits for the same number of iterations, so it can take much longer to detect a foreground failure. It is important to make sure the software-based checking mechanism tolerates worst-case software timing without making a mistake and declaring a fault when there isn't a problem.

The listing at the end of this article shows an example of background and foreground software checking on each other.

Error Codes

Supposing you're debugging your software, and all of a sudden it resets. How can you tell what went wrong? Well, if the software called fatal_error(), it will have passed in a software error code that you can examine with an emulator or debugger. But how can you tell if the watchdog timer expired? By its nature, a watchdog timer reset occurs suddenly and without warning. One way is to store the error code for "watchdog timeout" in reset_reason during software initialization. If the software calls fatal_error(), it will pass in a new error code that will overwrite the code for "watchdog timeout." However, if the hardware watchdog timer suddenly expires and resets the system, the watchdog timeout reason will still be present in reset_reason. For this to work, the initialization software must not clear or initialize reset_reason unless the initialization software knows that this is a power-up. If it is a power-up, reset_reason just contains garbage. In all other cases, reset_reason contains a meaningful error code that should not arbitrarily be wiped out. By the way, you can detect power-up by looking for a special pattern in a couple of bytes (like 0x55, 0xaa) which are not likely to occur randomly upon power-up. If you don't detect the special pattern, it's a power-up, so set the special pattern and then initialize reset_reason.

By using these hardware and software techniques, we can get a lot of extra mileage out of a simple watchdog timer. With the right hardware and software, it is possible to simplify debugging and increase system reliability. In our quest to make embedded systems more reliable, we need all the help we can get.

Listing 1 - An example of background and foreground software checking on each other.

/* Variables used by the background and foreground to check on each other. */

unsigned int fg_ok, bg_ok;      /* set by fg and bg when all is OK */
unsigned int fg_wait, bg_wait;  /* used when waiting for flag to be set */

#define TRUE    1
#define FALSE   0

/* The maximum number of background task loop iterations we'll wait for fg_ok
to be set. Assume fastest iteration time is 50 usec and longest time to wait is 50 msec. */

#define MAX_FG_WAIT     1000
#define FG_FAILED       1       /* internal software error code */

/* The maximum number of foreground interrupt ticks we'll wait for bg_ok to be set. Assume 
tick interval is 10 msec and longest time to wait is 500 msec. */

#define MAX_BG_WAIT     50
#define BG_FAILED       2       /* internal software error code */

/* Call this routine from the background, every pass through the background
task loop. Check to make sure the foreground is still OK. */

void bg_checks_fg (void)
    if (fg_ok) {
        fg_ok = FALSE;                  /* foreground OK, clear OK flag */
        fg_wait = 0;                    /* start waiting for next OK */
    } else {                            /* else, can we wait any longer? */
        if (++fg_wait > MAX_FG_WAIT)
            fatal_error (FG_FAILED);    /* if no, reset the system */
/* Call this routine from the foreground interrupt service routine, we'll assume it executes 
every 10 msec. Check to make sure the background is still OK. */

void fg_checks_bg (void)
    /* In this example, we'll wait up to 500 msec for the background to declare it's OK. 
    But supposing the hardware watchdog timer expires in 65 msec, like on the 80535? 
    We need to retrigger the hardware watchdog timer here regardless of the background 
    status, just to keep the hardware watchdog timer from expiring while we wait for the 
    background to declare "OK". */

    retrigger_wdt ();

    if (bg_ok) {
        bg_ok = FALSE;                  /* background OK, clear OK flag */
        bg_wait = 0;                    /* start waiting for next OK */
    } else {                            /* else, can we wait any longer? */
        if (++bg_wait > MAX_BG_WAIT)
            fatal_error (BG_FAILED);    /* if no, reset the system */
/* Put this statement somewhere in the background task loop. */

bg_ok = TRUE;        /* indicate background is still OK */

/* Put this statement somewhere in the real-time clock interrupt handler. */

fg_ok = TRUE;        /* indicate foreground is still OK */

Computer Page   Home Page