iTnews
  • Home
  • News
  • Technology
  • Cloud

Microsoft had three staff at Australian data centre campus when Azure went out

By Ry Crozier
Sep 4 2023 6:55AM

Cascading failures and root causes revealed.

Microsoft had “insufficient” staff levels at its data centre campus last week when a power sag knocked its chiller plant for two data halls offline, cooking portions of its storage hardware.

Microsoft had three staff at Australian data centre campus when Azure went out

The company has released a preliminary post-incident report (PIR) for the large-scale failure, which saw large enterprise customers including Bank of Queensland and Jetstar completely lose service.

The PIR sheds light on why some enterprises lost service altogether: so many storage nodes were gracefully shut down - or had components fried - in the incident that data, and all replicas of it, were offline.

In addition, after storage nodes were finally recovered, a "tenant ring" hosting over 250,000 databases, failed - albeit with uneven impact on customers.

Chillers offline

Microsoft said the cooling capacity for the two affected data halls “consisted of seven chillers, with five chillers in operation and two chillers in standby (N+2)”.

A power sag - voltage dip - caused the five operating chillers to fault. In addition, only one of the standby units worked.

Microsoft said the onsite staff “performed our documented emergency operational procedures (EOP) to attempt to bring the chillers back online, but were not successful.”

The company appeared to be caught out by the scale of the incident, with not enough staff onsite, and its emergency procedures not catering for the size of the issue.

“Due to the size of the data centre campus, the staffing of the team at night was insufficient to restart the chillers in a timely manner,” the company said.

“We have temporarily increased the team size from three to seven, until the underlying issues are better understood and appropriate mitigations can be put in place.”

On its EOP, Microsoft said: “The EOP for restarting chillers is slow to execute for an event with such a significant blast radius.”

“We are exploring ways to improve existing automation to be more resilient to various voltage sag event types.”

While there weren’t enough staff to execute the documented procedures, having more staff would’ve gotten to the same result faster, as the chillers themselves have issues.

Preliminary investigations showed the chiller plant did not automatically restart “because the corresponding pumps did not get the run signal from the chillers.”

“This is important as it is integral to the successful restarting of the chiller units,” Microsoft said.

“We are partnering with our OEM vendor to investigate why the chillers did not command their respective pump to start.”

Microsoft said the faulted chillers could not be manually restarted “as the chilled water loop temperature had exceeded the threshold.”

With rising temperatures, and thermal warnings from infrastructure, Microsoft had no choice but to shut down servers.

“This successfully allowed the chilled water loop temperature to drop below the required threshold and enabled the restoration of the cooling capacity,” it said.

Storage, SQL database recovery

Still, not everything recovered smoothly.

The incident impacted seven storage tenants - five “standard”, two “premium”.

Some storage hardware was “damaged by the data hall temperatures”, Microsoft said. 

Diagnostics weren’t available for troubleshooting because the storage nodes were offline.

“As a result, our onsite data centre team needed to remove components manually, and re-seat them one by one to identify which particular component(s) were preventing each node from booting,” Microsoft said.

“Several components needed to be replaced for successful data recovery and to restore impacted nodes. 

“In order to completely recover data, some of the original/faulty components were required to be temporarily re-installed in individual servers.”

An infrastructure-as-code automation also failed, “incorrectly approving stale requests, and marking some healthy nodes as unhealthy, which slowed storage recovery efforts.”

The failure of a tenant ring hosting over 250,000 SQL databases further slowed recovery, Microsoft said.

“As we attempted to migrate databases out of the degraded ring, SQL did not have well tested tools on hand that were built to move databases when the source ring was in [a] degraded health scenario,” the company said.

“Soon this became our largest impediment to mitigating impact.”

A final PIR is expected to be completed in a few weeks.

Got a news tip for our journalists? Share it with us anonymously here.
Copyright © iTnews.com.au . All rights reserved.
Tags:
azurecloudmicrosoftoutagestorage

Related Articles

  • Defence advances 'novel' analytics capability Defence advances 'novel' analytics capability
  • CSR uses analytics for transport, manufacturing visibility CSR uses analytics for transport, manufacturing visibility
  • BHP taps Azure to keep to its ERP transformation timeline BHP taps Azure to keep to its ERP transformation timeline
  • Defence to build 'virtual environments' to model decisions and systems Defence to build 'virtual environments' to model decisions and systems

Partner Content

AFL and Okta Team Up for a Game-Changing Play in Digital Security and Identity Management
Partner Content AFL and Okta Team Up for a Game-Changing Play in Digital Security and Identity Management
Dual Challenge: Securing Modern Enterprises While Enabling Remote Work
Partner Content Dual Challenge: Securing Modern Enterprises While Enabling Remote Work
Non-technical job seekers are missing out on this in-demand cybersecurity career
Partner Content Non-technical job seekers are missing out on this in-demand cybersecurity career
SOCO Reveals Microsoft AI with Power Platform Use Cases at Upcoming Government Event
Partner Content SOCO Reveals Microsoft AI with Power Platform Use Cases at Upcoming Government Event

Sponsored Whitepapers

Nine Ways To Prepare Your Database for a High-Traffic Event
Nine Ways To Prepare Your Database for a High-Traffic Event
How to Put AI at the Heart of Business Growth
How to Put AI at the Heart of Business Growth
Streamline Your Processes and Reduce Managed File Transfer Expenses
Streamline Your Processes and Reduce Managed File Transfer Expenses
Maximise Your Azure Investment with Fusion5
Maximise Your Azure Investment with Fusion5
CyberArk's 2024 Playbook: Identity Security and Cloud Compliance
CyberArk's 2024 Playbook: Identity Security and Cloud Compliance

Events

  • Integrate Integrate
Share on Facebook Share on LinkedIn Share on Whatsapp Email A Friend

Most Read Articles

BHP taps Azure to keep to its ERP transformation timeline

BHP taps Azure to keep to its ERP transformation timeline

Bendigo and Adelaide Bank uses GenAI, MongoDB to refactor application

Bendigo and Adelaide Bank uses GenAI, MongoDB to refactor application

Defence to build 'virtual environments' to model decisions and systems

Defence to build 'virtual environments' to model decisions and systems

NAB uses Ada to shift to real-time data ingestion

NAB uses Ada to shift to real-time data ingestion

Digital Nation

More than half of loyalty members concerned about their data
More than half of loyalty members concerned about their data
COVER STORY: What AI regulation might look like in Australia
COVER STORY: What AI regulation might look like in Australia
State of Security 2023
State of Security 2023
How eBay uses interaction analytics to improve CX
How eBay uses interaction analytics to improve CX
Health tech startup Kismet raises $4m in pre-seed funding
Health tech startup Kismet raises $4m in pre-seed funding
All rights reserved. This material may not be published, broadcast, rewritten or redistributed in any form without prior authorisation.
Your use of this website constitutes acceptance of nextmedia's Privacy Policy and Terms & Conditions.