This happens. BGP is notoriously fragile, and Facebook had integrated BGP management into their DevOps pipelines to drive agility. So risk of failure/issues was high.
What made this worse for Facebook is that everything is integrated and runs on their stack internally. So we heard news of employees being unable to get into buildings because their badge readers wouldn’t work, rumours of needing to use grinders to break into server cabinets.
In the modern enterprise, everything runs on top of IP. So if you have a catastrophic IP failure and no resiliency, the impact is far and wide. This highlights an important distinction between redundancy and resiliency.
Redundancy ensures you have multiple points of coverage in case of failure. Dual power supplies. Dual fibre entrances into your building. Active/active servers. Redundant databases.
The problem with redundancy is it is subject to a failure of kind. If there is a software flaw in your dual databases, both get hit. If there is an issue with your routing protocol, it takes down your redundant network connections. Redundancy is easy to design and typically lower cost because you only have one tech stack to manage, you need fewer parts for fixing failed systems, etc.
Resiliency on the other hand, requires diversity and heterogeneity. If your home internet goes down, odds are your cell service is not on the same network. If you can’t drive into the city, you can take the train, bike, walk. Diversity has costs – maintaining a train system, bike paths, roadways, sidewalks. But if one goes down you can still get home.
In a world where we think in things from an atomic perspective – fibre conduits, bridges, train lines – redundancy via duplication seems like a natural approach. But our modern society runs on bits just as much as it does on atoms, and vulnerabilities have a way of taking down whole systems regardless of physical duplication.
Be thoughtful in how you design, plan, test. Design for resiliency, not just redundancy.