Skip to Main Content
 

Major Digest Home Cooling crisis at CME: A wakeup call for modern infrastructure governance - Major Digest

Cooling crisis at CME: A wakeup call for modern infrastructure governance

Cooling crisis at CME: A wakeup call for modern infrastructure governance
Credit: Network World

There are several lessons that IT executives can learn from the Thanksgiving outage that led to an hours long shutdown of the global financial system of CME Group, an organization that describes itself as the “world’s leading derivatives marketplace.”

Bloomberg reported the following day that the problem had been in the cooling system at a data center complex 50 miles west of Chicago, operated by CyrusOne in Aurora, Illinois, which, it said, “serves as the main hub for trillions of dollars of derivatives traded each day. Inside temperatures soared past 100F despite the frigid weather, according to people familiar with the matter.”

Network World reached out to both CME and CyrusOne for comment, but only the latter replied with the following statement: “CyrusOne has restored stable and secure operations at its Chicago 1 data center in Aurora, Illinois. To further enhance continuity, we have installed additional redundancy to the cooling systems.”

Sanchit Vir Gogia, the chief analyst at Greyhound Research, said Monday, “the CME outage in Aurora is a case study in how a single physical failure inside a data center can escalate into a global market disruption when governance, failover logic, and environmental engineering are not aligned with the realities of modern infrastructure.”

The situation that occurred, he said, “was not a black swan. It was a predictable failure pattern rooted in the physics of cooling systems, the rising thermal load of contemporary computing, and the long-standing habit of treating cooling and environmental systems as peripheral rather than mission critical.”

Gogia noted that the fundamental issue was not only that a cooling plant malfunction knocked out multiple chillers simultaneously, but also that the plant’s backup systems failed. “It was the failure cascading across redundant units that should have been designed and tested to fail independently,” he said. The rapid temperature spike to unsafe levels “made it impossible for CME to keep matching engines online, and once the thermal curve moved beyond a certain point, human decision-making lagged behind physical reality.”

Organizations should reassess redundancy

However, he pointed out, “the deeper concern is that CME had a secondary data center ready to take the load, yet the failover threshold was set too high, and the activation sequence remained manually gated. The decision to wait for the cooling issue to self-correct rather than trigger the backup site immediately revealed a governance model that had not evolved to keep pace with the operational tempo of modern markets.”

Thermal failures, he said, “do not unfold on the timelines assumed in traditional disaster recovery playbooks. They escalate within minutes and demand automated responses that do not depend on human certainty about whether a facility will recover in time.”

Matt Kimball, VP and principal analyst at Moor Insights & Strategy, said that to some degree what happened in Aurora highlights an issue that may arise on occasion: “the communications gap that can exist between IT executives and data center operators. Think of ‘rack in versus rack out’ mindsets.”

Often, he said, the operational elements of that data center environment, such as cooling, power, fire hazards, physical security, and so forth, fall outside the realm of an IT executive focused on delivering IT services to the business. “And even if they don’t fall outside the realm, these elements are certainly not a primary focus,” he noted. “This was certainly true when I was living in the IT world.”

Additionally, said Kimball, “this highlights the need for organizations to reassess redundancy and resilience in a new light. Again, in IT, we tend to focus on resilience and redundancy at the app, server, and workload layers. Maybe even cluster level. But as we continue to place more and more of a premium on data, and the terms ‘business critical’ or ‘mission critical’ have real relevance, we have to zoom out and look more at the infrastructure level.”

A lesson in risk management

When looking at datacenter management tools like Siemens DCIM, he said that a lot of telemetry data can be captured from the equipment that provides the power and cooling to racks and servers. “[There’s] deep down telemetry with some machine learning to predict failures before they happen. So, that chiller [failure] in the CyrusOne datacenter could have and should have been anticipated. Further, redundant equipment should be in operation to enable failover.”

It highlights a critical issue, stated John Annand, senior technical counselor at Info-Tech Research Group. “[The fact that] no less than the CME Group experienced a significant business disruption because of an outsourced data center operator should remove any doubt that business continuity is a matter of when, not if.”

CyrusOne, he said, is not “[as big as] Equinix or NTT Data, but as recent outages from AWS and Cloudflare demonstrate, size is not a perfect insulator against catastrophe. This time, it wasn’t the complexities of DNS, but rather a relatively straightforward HVAC problem, but regardless of root cause, every enterprise owes it to itself to have a disaster recovery and business continuity plan.”

The lesson to be learned here “is not one of preparation; rather, it’s around risk management on plan execution,” Annand said. “At some point in time, the CME Group Incident Commander decided that rather than failing over to their secondary site, it was better to let CyrusOne continue to attempt to fix the problem primarily. That choice turned what might have been a minor (but certain) disruption of service to an open-ended (very uncertain) outage lasting some 10 hours.”

A key piece of advice for any organization, “is hope for the best but prepare for the worst,” he said. “This is a great example of shared responsibility and of how CME Group is still accountable for its choices despite working through a significant and professional colocation service provider.”

Failover, said Annand, “comes with its own risks (of data loss and reputational impact), and there are repatriation costs and complexity when the primary site comes back up. IT’s obligation is to contextualize the risk calculation that organizational leadership needs to make. How likely is it that the team will be able to fix the problem in production in the next hour or four hours? Balance that against the certainty of a 30-minute, one hour, or two hour-long disruption as you bring up the secondary site.”

The industry is ‘running hotter than ever’

Kimball added that as he engages with IT executives, he tries to reinforce the importance of looking at IT environments holistically, which means also focusing on that “rack out” kind of environment: Power budget, power cleanliness, cooling, and the like, as well as ensuring that there is redundancy at every level of the stack. It includes ensuring redundancy accounts for all scenarios, even at the power grid level. “We have seen examples of these power issues with increasing frequency lately,” he pointed out.

In addition, Gogia said, “the real takeaway for CIOs and other IT executives is that resilience is no longer an abstract design goal sitting in a strategy slide. It has become a day-to-day operational responsibility. The industry is running hotter than ever, both literally and figuratively.”

Servers, he said, draw more power, chips throw off more heat, and cooling plants are working at their limits. “That shrinking buffer means the old assumption that cooling failures are rare or unfold slowly no longer holds true,” he said.

“[Environmental systems] now sit on the same fault line as software bugs, power interruptions, or network failures, and they deserve the same level of investment and scrutiny,” he noted. “Teams that once viewed cooling as background infrastructure now have to treat it as part of the uptime equation. Put simply, the physical environment around the compute stack is just as capable of taking a business offline as any digital component.”

Sources:
Published: