Intelligent Data Centres Issue 60 | Page 34

POWER PROBLEMS AND HUMAN ERROR ARE CITED AS AMONG THE MAIN DOWNTIME CAUSES .
E D I T O R ' S Q U E S T I O N

MATTHEW FARNELL , GLOBAL DIRECTOR , EKKOSENSE

According to research from Uptime , data centre downtime costs – and their impact – continue to be problematic for many operators , with reported US $ 100,000 + incidents increasing 39 % since 2019 and those over US $ 1,000,000 by 15 %. Power problems and human error are cited as among the main downtime causes .

Implementing a strategy to mitigate downtime must encompass all aspects of data centre operations : systems ,

POWER PROBLEMS AND HUMAN ERROR ARE CITED AS AMONG THE MAIN DOWNTIME CAUSES .
people and processes . These range from ensuring redundancy in the design of failover systems and UPS power backup systems ; effective climate and environmental monitoring ; data replication ; Disaster Recovery ; monitoring and alerting systems ; planned maintenance scheduling and employee training ; and documentation and process control .
With respect to climate and environmental , and monitoring and alerting systems , many data centre operators continue to rely on Building Management Systems ( BMS ). While the BMS is an essential and overarching system management platform for dayto-day data centre operations , eventbased alerting tends to happen after the event and leads operations teams to be reactive rather than pro-active .
As the industry grapples to come to terms with the impact of hosting highdensity AI systems , it will be interesting to see how capacity planning strategies must adapt with the impact of highdensity 60kW AI systems and the subsequent heat generation in data halls that were originally designed to host the traditional 3 – 5kW per rack .
Liquid cooling technologies will need to play a part , and hybrid cooling strategies will become commonplace . The other dynamic that affects human error is the shortage of skilled data centre staff across the industry , and the pressures that places on today ’ s operations teams and management .
Ensuring uninterrupted mission-critical operations remains the highest priority for data centre teams , and throughout 2024 we ’ re busy developing solutions that will help operators to maintain uptime . Applying AI and Machine Learning proves effective in analysing the very large datasets produced by M & E systems – providing real-time visibility into what ’ s really happening in the data hall . A key factor here is the use of gaming technology to provide a 3D user experience , making it much easier for operations teams to visualise what ’ s going on without the need for intensive training .
Another key innovation is the cooling anomaly advisor that uses data analytics to highlight cooling trends . If cooling unit performance trends up or down , we notify the operator but also highlight the underlying causes . This helps operations teams to be much more proactive and get ahead of potential issues . �
34 www . intelligentdatacentres . com