AWS hit by US-East-1 outage after data center thermal event

A power outage triggered by a thermal event inside an Amazon Web Services data center in Northern Virginia disrupted Elastic Compute Cloud (EC2) instances and Elastic Block Store (EBS) volumes in the US-EAST-1 region late on Thursday, the cloud provider confirmed in updates posted to its Health Dashboard.

In an incident report timestamped 5:25 PM PDT (00:25 UTC Friday), AWS said it had spotted issues in the use1-az4 availability zone and confirmed that “EC2 instances and EBS volumes hosted on impacted hardware are affected by the loss of power during the thermal event.” Rising temperatures inside a single data center had caused the impairments, the company said in a statement.

AWS shifted traffic away from the affected zone for most services and warned of longer-than-usual provisioning times.

As the evening progressed, the company struggled to bring temperatures down. By 6:47 PM PDT, AWS warned that “Other AWS services that depend on the affected EC2 instances and EBS volumes in this Availability Zone may also experience impairments,” and at 8:06 PM PDT, it conceded that “progress is slower than originally anticipated,” recommending that customers needing immediate recovery restore from EBS snapshots or launch resources in unaffected zones.

By 10:11 PM PDT, AWS reported “incremental progress to restore cooling systems” but said users were still “experiencing elevated error rates and latencies for some workflows.”

AWS was still working to resolve the problem at 6:51 AM PDT, when it offered more insight into what was behind its “thermal event” and “power outage” jargon. After cooling systems failed in the affected availability zone, it said, “servers automatically shut down when the temperatures exceeded the operating thresholds in order to protect the hardware.” The power outage was simply the servers powering down as they had been programmed to do. Before AWS could restore services, it had to get cooling systems working again, and it took until 1:50 PM PDT on Friday to bring cooling system capacity back up to same levels as before the incident, it said.

The May 7 incident is not the first time US-EAST-1 has gone down. The region suffered two outages in October 2025, including a 15-hour disruption on October 19 and 20 caused by a race condition in DynamoDB’s automated DNS management system that affected over 70 AWS services and produced cascading failures across Slack, Atlassian, Snapchat, and other dependent services. AWS regions in Ohio have also experienced power-related outages tied to EC2 instances in past years.

Customer services go dark

As recovery progressed through the night, AWS confirmed that some services were coming back online faster than others.

“Some AWS services, such as IoT Core, ELB, NAT Gateway, and Redshift, continue to see significant improvements in the recovery of their workflows,” AWS said in a later update. “However, some customers will continue to see their affected EC2 instances and EBS volumes as impaired until we achieve full recovery.”

KoboToolbox, a data collection platform used by humanitarian and development organisations, said its global instance went offline at 00:32 UTC on May 8 because of the AWS infrastructure problem, according to a community advisory posted by Kobo staff. The platform’s EU instance was unaffected.

Physical-layer risk gets a fresh look

Such outages are not unique to AWS, said Bhuvie Chhabra, senior principal analyst at Gartner. “All major cloud providers have experienced similar incidents, highlighting the inherent complexity and challenges of operating at hyperscale,” Chhabra said.

The May 7 event raises a question CISOs should not assume away. CISOs should assess “to what degree AZs are located in physically distinct facilities versus coexisting within the same physical data center” and whether each zone has independent power, networking, cooling, and physical security, Chhabra said. Even when virtual instances are redundant across zones, an application will fail if its database is not similarly redundant, he added.

Kaustubh K, practice director at Everest Group, said physical-layer failures should push enterprises to broaden their resilience playbooks. “Physical-layer failures such as power and cooling disruptions highlight that enterprises should extend resilience planning beyond software and cyber risks, particularly for mission-critical applications,” he said. CISOs should identify critical workloads where infrastructure-level disruptions could materially impact operations and ensure appropriate redundancy, failover, and recovery mechanisms are built into the architecture, Kaustubh added.

Concentration risk back in focus

What sets US-EAST-1 apart from other AWS regions is the weight of global dependencies it carries. Many AWS global services, including Identity and Access Management authentication, CloudFront, Route 53, and DynamoDB Global Tables, depend on US-EAST-1 endpoints even for resources deployed in other regions, AWS confirmed in updates during the October 2025 incident.

US-EAST-1 is a critical global dependency for AWS, and except for Oracle, all hyperscale providers carry some global dependencies, Chhabra said. AWS is unique in publicly documenting these in its Fault Isolation Boundaries white paper. “Reducing the concentration risk to zero is unattainable,” Chhabra said, adding that CISOs must instead manage it through a life cycle approach to third-party risk management, partnering with sourcing, procurement, and vendor management to track changes in the vendor footprint.

“While Availability Zone separation continues to provide an important resilience layer, enterprises running mission-critical workloads should periodically reassess regional concentration risk and validate whether their resilience posture aligns with business continuity expectations,” Kaustubh said.

This article has been updated with fresh information from AWS in the sixth paragraph.

Sources: Network World
Published: May 11, 2026, 4:31:55 AM EDT