Designing and testing complex products can be hard. What are complex products? Products or systems that have many moving parts and many sub-systems operating in a lot of layers, can be imagined as a complex system. When I was studying Software Engineering, one course was interesting to me, called The Systems Theory. This theory describes complex systems and calculates how they work from the inside out. I like one simple law when approaching the architecture of software systems called Gall’s Law, which also came out from The Systems Theory. Some of the knowledge from this theory is used in Machine Learning algorithms, especially in the part of training these models using the Backpropagation algorithm. However, I learned a lot about SaaS architecture and design from the time when I was managing huge projects of building Tier 3 data centers. How those two are connected?
Designing the data center and what can we learn from it
You all know what is Cloud computing and probably use some form of it. All services from the Cloud are being served from some of the data centers spread out over the globe. I’ve been in the role of solution architect and project manager in building 3 big data centers.
The data center is a complex system. It has many parts like the building itself, the iron cage (to protect the data center from lightning bolts and eavesdropping), powerful climate systems, fire extinguishing system, security system, complex electricity schema, complex network schema, battery pool, generators and many more. Typical Tier 3 data centers must endure power loss, heavy storms, network outages, fire, flood, and in some cases bombs and bullets (yes, I have worked to deliver data centers also for the military, top secret 😉 Enduring means, if something of bad things happens, the data center must continue to operate. Well, if you are doing backups of your database in some Cloud service, that’s why the data center must operate.
Tier 3 data centers are defined by specific standards and requirements, which make them more reliable and resilient than Tier 1 and Tier 2 facilities. Here are the key characteristics and requirements of a Tier 3 data center:
- Redundancy and Fault Tolerance: Tier 3 data centers must have N+1 redundancy in power and cooling resources. This means there is at least one backup for every component (N), plus an additional backup (1). This ensures that any single failure of a component doesn’t disrupt services.
- Concurrent Maintainability: This feature allows for any component of the data center to be maintained or replaced without interrupting operations. This is crucial for performing regular maintenance without affecting uptime.
- Uptime: A Tier 3 data center is designed to achieve 99.982% availability. This translates to a maximum of 1.6 hours of downtime per year.
- Power and Cooling: The data center must have multiple paths for power and cooling, with systems set up in a way that allows for the maintenance of power and cooling systems without a shutdown.
- Security and Monitoring: Enhanced security measures, including multi-factor access control, video surveillance, and regular security audits. Also, continuous monitoring of network and environmental parameters to ensure optimal performance and immediate incident response.
- Connectivity: Multiple diverse and redundant network connections to ensure continuous network availability.
- Backup Systems: This includes backups for critical components like generators, UPS systems, and HVAC systems.
- Fire Suppression and Environmental Controls: Advanced fire suppression systems and environmental controls to manage humidity and temperature levels are essential.
- Certification and Compliance: Tier 3 data centers often seek certifications from recognized bodies like the Uptime Institute or comply with standards like ISO 27001 for information security management.
- Support Availability: Round-the-clock on-site support and technical staff availability to address issues immediately.
How can we test if the design or architecture is valid?
Each of those capabilities needs to be tested if they are up to the standards and performing well. Tier 3 data center is by design prepared so that failure of one system has a low impact on another system. That is coming from The System Theory.
For me, the most interesting part of testing was to test the fire suppression system. To extinguish the fire in the data center, FM200 or a similar gas-based system is used. To test it, a lot of subsystems must work together.
The test is simple, put a metal bucket in the data center, mix gasoline with rubber, and light it on. It will burn slowly but will produce a lot of smoke and over time develop high temperatures, similar to how wires and hardware equipment burn and the data center is full of it.
Detectors can detect the presence of CO2, CO, NO2, and other gases that are produced in fire as well as heat that is higher than normal. Then subsystem for Security checks if inside the data center are persons checked in. Backup cameras and motion sensors confirm the data from the Security system. The alarm is triggered at the same time. If there are no people inside, all windows are automatically closed and the ventilation system is shut down to prevent fresh oxygen from going in. After all checks for this are OK, FM200 pumps (more like explode) a huge volume of gas into the data center that instantly suppresses any fire but also removes all oxygen.
When heat detectors detect that heat is normal again and there is no fire, subsystems open the windows, and the ventilation system pumps smoke out of the data center that continues to operate normally. That is very impressive when you see how those systems work together and suppress fire in less than 30 seconds from when we started it.
Architecting and delivering software and services can learn a lot from Tier 3 data center designs and procedures.
- Design your software as a combination of connected subsystems where each of the subsystems is predictable in how they behave.
- Isolate subsystems enough that failure of one does not cause cascade failure of another subsystem.
- Install sensors inside each subsystem and between them which will help you to monitor the health and performance of the product.
- When an outage happens, detect from sensors what was the cause and implement an automated procedure to fix the cause. This is the golden rule of monitoring huge and complex systems, like SaaS apps or social networks.
- If you can not identify the cause, raise an alarm to the support engineer, and telemetry data later will help you to postmortem the outage and prepare automation for fixing it.
- Automate everything, and focus on total quality, it will pay off.