Expert Witness Forensic Engineer Industrial Controls Automation New Orleans
Arthur Zatarain

[Note: this text version is only for web crawler.
Click HERE: PUBLICATIONS to access high quality PDF version ] 

 

 

Title: Testing Redundant and Backup Systems

Deck: Locate missed opportunities to validate redundant and backup control systems

= = = = =

Many of us are inclined to ignore the time honored adage of, "If it ain't broke, don't fix it." Sometimes our handyman instincts can't leave well enough alone. Yet that same intuition also encourages us to believe, "If it ain't broke, don't test it." This reluctance is especially evident when it comes to testing the redundant and backup features of our critical control systems. At best, failures resulting from inadequate testing will only cost you lost production. At worst, insufficient testing can cost you your job.

Get hip to R&B

Although the terms "redundant" and "backup" (R&B) are often interchanged, each represents a different aspect of reliability design. A redundant system uses multiple similar components in a configuration that permits simultaneous performance of the same (or similar) function. A redundancy failure causes no reduction of system operation or capability. Simple examples include parallel power supplies and series shutdown valves. A more sophisticated example is a redundant PLC system: a microchip fails, a warning light comes on, and production continues normally. A key aspect of redundant systems is that multiple components do the same job at the same time.

A backup system takes a different approach to reliability by providing an independent means of performing all or part of the overall control function, usually in a "primary" and "standby" configuration. Manual or automatic transfer mechanisms determine which component takes the lead. For increased reliability, backup systems can use alternate configurations and technologies to improve resistance to single point and common mode failures. For example, a simple local controller that can operate without assistance from a plant-wide control system is a common instance of backup technology. The local system may lack bells and whistles, but at least it can maintain safe production should the primary system go offline.

Backups can also be found within the control system's support structure. A frequent example is an uninterruptible power supply (UPS) that delivers reliable energy to many electronic control systems. If the primary power goes away, the UPS instantly takes over to maintain essential control functions--at least as long as the batteries hold out.

Note that redundancy and backup are not mutually exclusive. Many control systems contain separate elements of both, and some even combine them into "redundant backup" systems with high levels of fault tolerance. Such systems include two or more similar control entities, each having full capability, but based on different technologies. Having two independent and diverse control systems is often considered the best protection against unanticipated failures.

In addition to improving reliability, R&B controls can simplify routine maintenance of an operating facility. R&B concepts allow portions of the control system to be repaired offline while the controlled process remains in service. Special operating modes such as manual supervision may be required, but the ability to perform online testing of items such as relief valves and meter runs is a valuable benefit of high-reliability systems.

Why test R&B?

Some industries, such as aerospace and nuclear, routinely test their redundant and backup system because reliable technology is essential to their high-risk business. But less-risky industrial users don't always adopt a "mission critical" approach to testing R&B performance. Everyone in industry has heard war stories of redundant and backup systems that failed to do their job. "The UPS should have kept us going" or "the redundant processor had an outdated program." The subsequent diagnosis is often preformed through a rear-view mirror, with perhaps some adjustment to future maintenance procedures. But in reality, proper testing of R&B systems remains on the back burner of many maintenance programs.

Of course, the less exotic elements of R&B systems, such as inputs and outputs, are often tested during routine maintenance of the control system. However, such checks are often limited to calibration and physical care. Such maintenance may test the heart of an R&B system, but not its soul. A true test requires simulation of the special transient conditions for which the R&B systems are designed. Proper R&B testing requires more than simply faking a process fault to verify that the system performs its normal role. R&B testing should also include challenging their unique "non stop" features to verify reliable performance even while partially disabled.

Further, the requirement to routinely verify R&B operation is becoming increasingly important because of safety-related standards such as IEC 61511 and ANSI/ISA S84-2004. These internationally accepted guidelines define Safety Integrity Levels (SIL) and Safety Instrumented Systems (SIS) that generally rely on redundant and/or backup systems. Merely designing controls to meet those standards isn't sufficient to satisfy existing and pending regulations. Proper testing and verification of specific redundant and backup features is essential in meeting both the spirit and letter of those standards.

How to test?

A proper test of redundancy and backup requires creating operating conditions that mimic failures of the control system itself, and also of its various support systems. Such tests must go beyond the manual or automatic diagnostics built into many R&B systems (i.e. the UPS "test" button). Those diagnostics are generally local to the device and may not adequately test responses to external problems. So although built-in tests help verify operation of an R&B component, they cannot verify reliable system operation for situations that involve interconnected units.

So how can the R&B functions themselves best be tested? There's no easy answer here--every redundant and backup system has its own special requirements. But a common theme is to simulate fault conditions that are unrelated to the controlled machine or process. A significant goal is to test the redundant or backup system's ability to maintain operation during and after a transient condition that interrupts normal conditions, including loss of the primary control system. Therefore, testing one part of an R&B system usually requires disabling other parts under conditions that simulate real-world failures.

Another key testing goal is to validate the R&B system's ability to alert operators to a partial failure. In addition to seamlessly maintaining operations, the R&B system must accurately indicate that it or its partner is impaired. Without such notification, corrective action may be delayed or overlooked until after the remaining portions fail.

Fortunately, functional testing of R&B systems is usually more "fun" than routine maintenance work. Rather than calibrating transmitters or greasing actuators, we get to kill half of a redundant system and suffer nothing beyond a warning light. Or we can disable a remote speed control and watch the lowly backup governor maintain operation. And then there's everyone's favorite---pulling the plug on a UPS and grinning when nothing bad happens. Can testing really be that simple? Maybe not.

The UPS example just cited may seem like a good idea, but many UPS manufacturers will disagree. An often-overlooked effect of "pulling the plug" is disconnection of important ground and neutral references that help the UPS monitor primary power. A better UPS test procedure is to remove power at the circuit breaker or other convenient point to introduce transients similar to a power outage. Only then can a true test of the UPS's ability to sense, switch, and supply be performed in the field.

Likewise, simulations that merely "pull the plug" on an input, communications link, or processor may not represent realistic R&B failure modes. Input signals don't usually go away, but they do drift out of specification. Similarly, communications links don't always go quiet--in fact, they're more likely to get noisy when failed. And processors are rarely known to leap from their happy home in the electronics rack. A more realistic procedure will mess with the power or communications going into a processor, or to an output coming from the processor, to determine if its R&B control partner can carry on.

Establishing adequate test procedures therefore requires careful consideration and planning. The tests can't merely be convenient or arbitrary--they need to be realistic. And they need to be part of the facility's regular maintenance plan.

When to test?

In theory, we should be able to test redundant and backup systems anytime we want--if they work properly, there's nothing to fear. But in reality, R&B testing for a "non-shutdown" rarely occurs until after the system fails to perform. Perhaps the lapse is due to fear that the R&B system won't work--no one wants an unexpected shutdown noted in their permanent file. The logical solution is to combine R&B testing with other maintenance procedures in which an unexpected shutdown can be tolerated.

For example, many offline maintenance activities begin with a functional test of the emergency shutdown (ESD) system. Few maintenance tasks are more satisfying than watching an automatic control system stop a complex machine or process in a safe, organized sequence. We expect nothing less when we push the big red button, yet it's still a kick to watch the dominoes tumble toward a happy ending. Similarly, a planned shutin is an ideal time to test the failure modes of redundant and backup systems to verify that they don't shutdown a process. Therefore, functional R&B tests are usually best accomplished just prior to performing the scheduled ESD.

What's next?

If you suspect shortcomings in your R&B maintenance, consider building a multi-disciplined team to raise awareness and evaluate your needs. Proper testing will likely require input from many sources. Be sure to include the usual suspects such as Plant Utilities, Communications, Engineering, and Operations. But also include lesser players such as Safety, Training, and Administration, all of whom share your interest in seeing redundancy and backup systems perform as planned. There's little doubt that attainable goals can be set. But chances are, the path to those goals begins with you.

= = = = = = = = = = = = = = = = = = = = = = = = = = = = =

Bio information for Arthur Zatarain, PE

Arthur Zatarain, PE, consults in technology and intellectual property through Arthur Zatarain. He is also Vice President of TEST Automation & Controls, a provider of industrial systems worldwide. He can be reached through www.artzat.com.

 

 


Copyright © 2026 Arthur Zatarain, all rights reserved. Some images are modified for confidentiality
or illustration clarity. This site should not be used as a technical or legal reference.

Best Viewed in Mozilla Firefox