Blog: Building Reliable Systems with Redundancy in Space Computing

Building Reliable Systems from Unreliable Hardware: The Future of Space Computing

Co-Authors: Jason Cerundolo & Amanda Mork

In the unforgiving vacuum of space, reliability isn’t a luxury—it’s a lifeline. A single bad solder joint can derail an entire mission, such as Apollo 14’s abort switch— so NASA’s obsession with component ‘quality’ makes perfect sense. Consequences are extreme. When billion-dollar crewed missions and satellites the size of school buses were on the line, failure wasn't an option.

But here's the reality: everything fails eventually.

Designing for Failure

NASA's current doctrine states that any system cannot be more reliable than its components. High-risk missions demand premium parts. It's logical, intuitive, and fundamentally flawed.

Payload missions traditionally use hardware classified by NASA as Class A through D, with Class A being the most reliable and Class D the least. Similarly, components are graded from Class 1 (highest quality) to Class 4. For high-stakes missions, the rule has been simple: Class A missions may use only Class 1 components. The result is stringent standards, enormous costs, and long development cycles.

However, this approach assumes that system reliability is only as good as its individual components. Yet, even the best parts fail eventually, especially in space, where extreme temperatures, radiation, and physical stress accelerate degradation.

Consider the math: With five critical components each 90% reliable, your system reliability plummets to 60%.

image of a mathematical formula: Pgood = (.9)^5 = 59%

But redundancy changes everything. Those same components in parallel pairs yield 95% system reliability:

image of a mathematical formula: Pgood = [1-(1-0.9)^2)^5 = 95%

More importantly, redundancy buys time. When one system fails, you can repair it while its twin maintains operation. The mission continues as long as you restore systems faster than they fail.

The Power of Redundancy

Google nailed the redundancy concept in their data centers. They saw that high-quality drives eventually failed anyway—and expensively. Their solution? Expect failure. Design for it. Buy cheaper drives but run them in parallel. Today's cloud infrastructure depends on this principle: multiple software copies across distributed servers, self-healing through coordinated redundancy.

The aerospace industry isn't entirely foreign to this concept. Commercial aircraft carry redundant engines, hydraulics, and flight computers. Spacecraft carry backup radios and duplicate flight software. And we now fly entire fleets of satellites assuming some of them may fail.

So why not embrace this at the component level?

At Colossus, we leverage redundancy for our computing systems in a unique way to improve reliability. Take Kestrel, our first GPU: we intentionally duplicate critical components to eliminate single points of failure. Our error-correction protocols are built with the assumption that radiation will inevitably corrupt data. By applying the first principles of redundancy, we’ve engineered a system with a remarkably low failure rate, ensuring robust performance even in the harshest conditions.

A Better Path Forward

Imagine constellations of affordable satellites where individual malfunctions become irrelevant. Picture modular rovers that keep exploring after non-critical component breakdowns. Envision distributed sensor networks built from commercial parts, studying distant worlds.

Systems aren’t necessarily as strong as their weakest link; they can be stronger. Specifically, Class A missions should be able to use Class 2+ components, assuming sufficient work is done at the system design level. With launch access becoming cheaper and easier to obtain, there’s plenty of mass budget for new approaches.

The key is rigorous validation:

Each redundant system must be thoroughly tested
Failure modes must be completely understood
Cascade effects must be prevented
Clear criteria must determine where this approach is appropriate
Comprehensive testing must validate system-level reliability

Let's be clear: this isn't about lowering standards – it's about raising them by thinking differently. As Mark Watney says in The Martian: "At some point everything is going to go south on you… you just begin, you do the math, you solve that one, and then you solve the next one." The key is knowing which problems need redundancy, which need premium components, and having the engineering rigor to tell the difference.

The future of space doesn't lie in flawless components. It lies in flawless system design.