Monday, 29 November 2010

Wonder what this cable does....

Interesting day today. Was supposed to go to a meeting in Wolverhampton, but it got canceled because of bad weather (it's snowing!). So, thought I'd got a free day to write lots of reports. Unfortunately it didn't turn out like that. Morning got taken up with various things, and in the middle of the afternoon we had a fairly interesting "incident". Power went off to the computer centre (which houses our main data centre). No real panic at first as we have a generator, and it looked as though it had come on (a cloud of smoke was spotted). So, not ideal, but we thought we were running on generator power. Soon it became apparent that we weren't. Generator hadn't come on and we were running the whole data centre on the UPS (big batteries if you like simple explanations like me!), which have about 20 minutes life.

So, we quickly declared an incident and we gathered to decide what to. Get the power back on was an obvious priority - it quickly transpired that a contractor working in the plant room had done "something" to a cable, and it had all gone up in smoke. Didn't know why generator hadn't come in, but what we did know was that we needed our data centre manager, who was there but couldn't get into the plant room without colleagues from estates, and we need electricians. Time was running out. Lots of discussions about whether to turn things off so they failed cleanly, but decided we had no time. Managed to get some communications out to staff and students about the risk, and then we just waited for everything to fail!  We watched the UPS go down to 0% power, where it seemed to keep everything running for about 4 minutes. Then, as a message from the UPS told us it had failed, we got another message timed to the same second to say the power was back on. Phew. Quite literally we were a second away from everything failing, and yet no services had been lost.

An incident review later in the afternoon explained what had happened - lets just say it was complex and the generator hadn't failed but had responded correctly under the circumstances. But a lot of lessons learned. Don't want that to happen again!

1 comment:

James Smith said...

I watched the UPS go down to 0% power, where it seemed to keep everything running for about 4 minutes.