Table of Contents >> Show >> Hide
- What “Redundant Flip-Flops” Really Means
- Voting Logic 101: Who Gets to Decide “Reality”?
- TMR: Triple Modular Redundancy in Plain English
- Redundant Flip-Flop Patterns You’ll Actually See
- Voting Logic Design: Details That Separate “Works” From “Works in the Lab”
- Concrete Examples: Where Redundant Flip-Flops and Voting Logic Show Up
- Choosing the Right Approach: A Quick Decision Guide
- Implementation Pitfalls (a.k.a. How Redundancy Can Betray You)
- Practical Checklist: Designing with Redundant Flip-Flops and Voters
- Field Notes: of Real-World “Circuit VR” Experiences
Flip-flops are the tiny memory cells that quietly run your digital life. They also have a bad habit of being
the first thing to “get creative” when the world gets harshradiation in space, electrical noise in industrial
gear, or the kind of timing glitch that shows up only after you ship 10,000 units.
That’s where redundant flip-flops and voting logic come in. The idea is simple:
don’t trust a single bit of state with your whole system. Give that state backups, then add a referee (a voter)
to decide what the state “really” is when one copy misbehaves. It’s like asking three friends what time the movie
starts instead of trusting the one who still thinks daylight saving time is a government conspiracy.
This article breaks down the most common circuit patternsDMR (dual modular redundancy),
TMR (triple modular redundancy), hardened storage cells like DICE, and the
voter designs that make redundancy actually work. We’ll also cover the trade-offs that separate
“robust” from “robust-ish,” plus real-world lessons that designers learn the fun way (which is to say: the hard way).
What “Redundant Flip-Flops” Really Means
A standard flip-flop stores one bit. If that bit flips unexpectedlydue to a single event upset (SEU),
power noise, metastability, or a design erroryou don’t just lose a bit. You might lose a state machine transition,
corrupt a safety check, or derail a control loop.
A redundant flip-flop stores the same logical bit in multiple physical elements. The system then:
(1) compares those copies to detect disagreement, and/or (2) uses a voter to mask an error by
choosing the “majority” value.
The two goals: detection vs masking
-
Detect errors (fail-safe): notice something is wrong and transition to a safe state.
This is common in functional safety and control systems. -
Mask errors (fail-operational): keep operating correctly even when one copy is wrong.
This is common in high-reliability systems where stopping is worse than continuing (think spacecraft,
remote infrastructure, or mission-critical controls).
Voting Logic 101: Who Gets to Decide “Reality”?
Voting logic is the circuit that takes multiple copies of a signal and decides what downstream logic will see.
The classic is majority voting: if at least two out of three agree, that’s the output.
Majority voters: the bread-and-butter of TMR
In a 3-way majority voter, the output is 1 if at least two inputs are 1. Hardware designers love it because
it’s fast, compact, and predictable. Software folks love it because it feels like democracyuntil they realize
democracy still needs a functioning ballot counter.
Comparators: the backbone of DMR and lockstep
With two copies (DMR), you can’t do majority voting. If A ≠ B, you know something is wrong, but you don’t
know which one is correct. The usual move is to raise a fault flag and either:
(a) switch to a safe state, (b) retry, (c) use a third “tie-breaker” source, or (d) rely on time redundancy
(compute twice and compare).
Don’t forget the “voter problem”
Redundancy can accidentally create a new single point of failure: the voter itself. If one voter output feeds
the whole design and that voter is wrong, congratulationsyou built a very reliable way to fail consistently.
In high-reliability designs, designers may harden the voter, duplicate it, or structure voting so that no single
small block can take the system down.
TMR: Triple Modular Redundancy in Plain English
Triple Modular Redundancy (TMR) means you implement three copies of a function (combinational logic,
sequential logic, or an entire module), then vote their outputs. If one copy is wrong, the other two outvote it.
Where TMR shines
- Radiation environments where bit flips are expected, not “rare.”
- Long-lived systems where reliability matters more than area or power.
- Safety-critical controls where detection alone isn’t enough, or where uptime is essential.
Local vs distributed TMR: it’s about where you place the voters
Designers often distinguish between flavors of TMR based on how much logic gets triplicated and where voting occurs.
A common split is:
-
Local TMR: triplicate the flip-flops, and vote near the register boundaries so state is protected.
This reduces overhead but can leave parts of combinational logic unprotected. -
Distributed (or full) TMR: triplicate both flip-flops and combinational logic paths, then vote at
strategic points (often at or after registers). More robust, more expensive.
In FPGA designs, these choices are not just philosophicalthey affect routing, timing closure, and whether a
single radiation-induced transient can sneak through before the next clock edge.
Redundant Flip-Flop Patterns You’ll Actually See
1) DMR (Dual Modular Redundancy) flip-flops + compare
Two identical flip-flops store the same bit. A comparator checks if they match. If they disagree, you set an
error flag and trigger a response. This pattern is popular when “detect and recover” is acceptableand when
you can’t afford the extra area/power of a third copy.
Practical twist: the comparator can be external (a separate logic block) or integrated into the register slice
for tighter control and faster fault signaling. In lockstep CPU designs, a similar concept compares two cores’
outputs cycle-by-cycle.
2) TMR registers (three flip-flops) + majority voter
The canonical “redundant flip-flop” is really three flip-flops storing one logical bit, with voting
logic deciding the output. When implemented carefully, a single fault in one storage element is masked by the
other two.
A key design choice is whether you:
- Vote only on read (masking): output is voted, but internal copies may diverge until repaired.
-
Vote and refresh (correcting): feed the voted result back so the “bad” copy gets overwritten
on the next update cycle (or through explicit scrubbing logic).
3) Hardened storage cells (DICE and friends)
Instead of triplicating full flip-flops, you can harden the storage element itself. One well-known approach is the
Dual Interlocked Storage Cell (DICE), which uses interlocked nodes so that a single node upset is less
likely to flip the stored value permanently.
Important nuance: “hardened” doesn’t mean “invincible.” At modern process nodes, charge sharing and layout effects can
change how well a given cell resists upsets. That’s why you’ll see continued research and new variants that target
specific technologies and fault models.
4) Time redundancy: “do it twice”
Sometimes you don’t duplicate hardware; you duplicate time. Compute a result twice (or three times) and compare.
If results disagree, you retry, vote, or flag an error. This is common in software safety patterns and in hardware
where spare cycles are cheaper than spare gates.
Voting Logic Design: Details That Separate “Works” From “Works in the Lab”
Bitwise vs word-level voting
In many systems, voting happens bitwise (each bit of a bus is majority-voted independently). That’s
great for random independent bit flips. But it can be awkward if faults become correlated (for example, a whole
module output is wrong in a consistent way).
Word-level voting (treating an entire word as one “vote”) can be useful when outputs represent encoded values,
state IDs, or structured messagesbut it can also be harder to implement quickly and cleanly.
Metastability and timing: the uninvited guest
Voting logic doesn’t magically remove metastability. If redundant signals cross clock domains or arrive with skew,
a voter can become a metastability amplifier (which is not a product category anyone wants).
Good practice includes: synchronizing inputs before voting, constraining timing paths, minimizing skew between
redundant routes, and treating voters like critical timing elementsnot “just a few gates.”
Single point of failure: protect the voter (or design around it)
If the voter is a single block, it can fail. Common mitigation strategies include:
- Triplicate the voter and vote the voters (yes, really).
- Use self-checking voter logic that can detect internal inconsistencies.
- Place voters at boundaries so faults are contained and easier to diagnose.
Placement and “where errors can hide”
A voter placed only at the end of a pipeline can hide internal divergence for many cycles. A voter placed at
register boundaries can stop an error from becoming statebut can increase overhead and complicate timing.
The “right” answer depends on your fault model, uptime goals, and how expensive recovery is.
Concrete Examples: Where Redundant Flip-Flops and Voting Logic Show Up
Space and high-radiation electronics
In radiation-heavy environments, single event effects can flip storage bits or create transient pulses that get captured
at clock edges. One reason TMR is so common in space-grade designs is that it can mask a single error without
immediate system interruptionespecially when paired with strategies like scrubbing, endpoint protection, and careful
partitioning of state.
Functional safety microcontrollers and lockstep processing
In automotive and industrial safety contexts, you often see dual-core lockstep approaches: two cores
execute the same instructions, and comparator logic flags mismatches. This usually targets high diagnostic coverage,
allowing the system to detect faults and transition to a safe state. It’s redundancy with an attitude: “I’m not here
to keep going at any cost; I’m here to stop safely when something looks off.”
FPGAs in mission-critical control
FPGAs are powerfuland sensitive. Designers frequently apply TMR to state machines, counters, and control paths, then
selectively harden or replicate the logic that matters most. Tool flows exist to help insert TMR and verify that
triplication and voters landed where they should.
Memory systems: ECC as “voting” over time
While this article focuses on flip-flops, it’s worth noting that ECC/EDAC is a cousin of voting logic:
multiple redundant bits (parity/check bits) allow detection and correction of corrupted data words. In practice, many
systems combine ECC-protected memories with redundancy on key state registers.
Choosing the Right Approach: A Quick Decision Guide
If you need to keep running through a single fault
You’re likely in TMR territory, at least for key state. Focus on:
(1) where voters sit, (2) how you prevent voter failure from becoming catastrophic, and (3) whether you “refresh”
the redundant copies back to agreement.
If your priority is detection and safe shutdown
DMR + compare (or lockstep) can be an excellent fit. It can also be easier to validate, since you’re
primarily proving “faults get detected” rather than “faults get masked under every possible timing condition.”
If you’re fighting a specific fault mechanism in storage
Consider hardened flip-flops (like DICE-based or other SEU-tolerant cells) when triplication is too
costly or when you want protection at the device level. Be realistic about technology-node effects and validate against
your actual environment.
Implementation Pitfalls (a.k.a. How Redundancy Can Betray You)
1) Correlated faults
Redundancy assumes failures aren’t identical across copies. But if the same clock glitch, EMI event, or power droop hits
all three replicas the same way, your TMR “majority” can confidently vote for the wrong answer.
2) Shared resources that aren’t redundant
A common gotcha is leaving critical resources singular: a single reset line, a single clock tree, a single configuration
controller, or a single voter feeding everything. If that shared resource fails, redundancy elsewhere won’t save you.
3) Voting too late (or too early)
Vote too late, and bad logic can become bad state. Vote too early, and you pay a big overhead tax and might introduce
timing problems. Great designs treat voter placement like an architectural decision, not a post-layout decoration.
4) Verification that stops at “it simulates”
Functional simulation won’t automatically prove fault tolerance. Designers often use targeted fault injection, formal
checks, or specialized verification flows to confirm that triplication and voting behave as intendedespecially in FPGA
designs where automated TMR insertion is used.
Practical Checklist: Designing with Redundant Flip-Flops and Voters
- Define your fault model: SEU, transient, timing fault, stuck-at, bridging, power droop, EMI, etc.
- Decide: detect vs mask: DMR for detection; TMR for masking (generally).
- Choose voter strategy: majority vote, compare + safe state, or hybrid.
- Plan voter placement: at register boundaries, after pipelines, or at module outputs.
- Handle divergence: do you refresh/correct, scrub, or just flag?
- Protect shared resources: clocks, resets, configuration paths, and voters.
- Verify intentionally: fault injection, assertions, and coverage for “fault detected/masked.”
Field Notes: of Real-World “Circuit VR” Experiences
Engineers who work with redundant flip-flops and voting logic tend to collect the same stories, even if they work in
different industries. One of the most common “first lessons” is that redundancy doesn’t remove engineering effortit
moves it. You save yourself from a rare bit flip, then spend that saved time arguing with timing closure,
routing constraints, and the realization that your beautifully triplicated logic still shares one reset signal.
A classic experience is discovering that the voter is now the star of your timing report. In a normal design, a few
extra gates might be negligible. In a redundant design, voters can land on critical paths because they sit at module
boundaries where everything converges. Designers learn to pipeline voters, constrain them carefully, or move them so
that they don’t become the bottleneck that forces a lower clock frequency.
Another recurring “aha” moment is correlated failure. Teams often assume that three copies guarantee safety, then a
shared clock glitch or power droop knocks all three replicas in the same direction. The voter shrugs and outputs the
wrong value with absolute confidence. This is why experienced designers talk about independence as much as
redundancyspatial separation in an FPGA floorplan, separate routing, separate power domains when possible, and careful
thinking about what a single disturbance can touch at once.
In verification, many teams start with basic functional tests, then realize they’ve only proven the design works when
nothing goes wronglike testing a parachute by wearing it indoors. Practical redundancy verification often includes
fault injection: forcing one replica’s register bit to flip, temporarily corrupting one module’s output, or injecting
mismatches to ensure the system (a) masks correctly in TMR, or (b) detects and transitions safely in DMR/lockstep. The
first time you run fault injection, you usually uncover at least one “oops”: a voter missing on a critical path, an
error flag not latched, or a recovery path that depends on the very state that got corrupted.
There’s also a human-factors experience: redundancy changes debugging culture. When a bug appears, the first question
becomes “Is it real logic, or the fault-tolerance layer doing its job?” Engineers learn to add observabilitystatus
registers that count voter disagreements, logs of compare faults, and telemetry that shows which replica diverged. These
signals turn “it failed once in the field” into “replica B disagreed for 3 cycles before correction,” which is the
difference between guesswork and engineering.
Finally, teams learn that redundancy is most cost-effective when it is selective. Triplicating everything can
be expensive in area and power, and it can turn simple changes into a routing nightmare. Many successful designs focus
redundancy on the places where state matters most: control FSMs, safety interlocks, configuration registers, and
boundary points where a wrong decision can propagate. In other words, “Circuit VR” thinking becomes an architectural
habit: protect the decisions, protect the memory of those decisions, and make the voter trustworthy enough that it
earns its referee shirt.