Wednesday 1 September 2010

Interesting times

The Chinese allegedly use the phrase "May you live in interesting times" as a curse.  Well, for many of us today, it has been an interesting time. After yesterday's disc failure, and subsequent filesystem rebuild failure, the infrastructure team had spent most of the night cloning the discs and repairing errors to start the rebuild again - we were all fairly confident that we would be able to have email up and running again mid morning. By just after 10am we were delivering mail again, and waiting until we had delivered the backlog before releasing it to users.  The unix team office was a bit like mission control as we all gathered to make sure it was stable, when we suffered another disc failure. You could have heard a pin drop. And a few expletives. And we ran out of biscuits. Luckily we could get more biscuits.

What followed was several hours of analysis, creative thinking, problem solving, phoning our email administrator who was on holiday in France who dropped everything to help remotely. Unfortunately the pressure is on very few people - most of us can offer support etc but we really rely on those with the technical ability to solve these problems.  I was feeling very out of the loop, as I had to leave at lunchtime to catch a train to Edinburgh for a UCISA meeting. My phone battery was running out, the power sockets in the carriage weren't working, and there was virtually no 3G signal! But, thank goodness for twitter and that 3 of the team doing the fixing were using it, so I just about kept in touch.

Eventually, by moving the much of the mail on to our new central filestore were were able to get a mail service up by mid afternoon for over 80% of users, with the rest being recovered and will be live tomorrow morning.

A huge thanks to everyone  - the infrastructure team who worked almost throughout the night (and are still working as I type this), to the incident management team, the communications team, the web team, the switchboard who handled the external calls, and the helpdesk staff who had to deal with many, many calls - over 800 yesterday afternoon alone.

When I arrived at my hotel, I was told I was staying in a converted asylum. Seemed rather appropriate somehow!

5 comments:

Tom said...

I'm sure I'm not the only one to ask in light of the e-mail failure, but you've mentioned in the past possibly moving to Google for staff services...any chance this is in the pipeline?

Anonymous said...

So you go to meetings and get in biscuits, but the unix people are the ones with "the technical ability to solve these problems."

I presume this is reflected in the relative gradings of the posts.

Richard said...

haha very good 'Anonymous' - I like that. So by that logic any senior member of staff should be able to do the jobs of everyone who works for them.

I'm sure the VC can do the job of everyone in the University too!

Jonathan said...

It could've been a good opportunity to promote the MyChat service as an alternative to the lack of email, at least for internal users.

Andrew Horne said...

I'd nipped out to get a sarnie so missed the biccies arriving, consequently the chocolate ones had gone by the time I spotted the tin.

May I suggest business continuity Hob Nobs for next time?