Thursday 21 February 2013

Genomics for Geeks

Opening session at RUGIT is Human Genetics and genomics. The Science of the 21st century and why it needs infrastructure. Delivered by Ewan Birney from European Bioinformatics Institute. EBI.

Absolutely fascinating talk. Great for me as an IT Director with a couple of degrees in Genetics!

As usual, I've made notes during the talk, and they are reproduced here. Hope they make sense.

Starts with a crash course in genomics for geeks. Hope I understand it!
DNA is a covalently linked polymer nearly always found in anti parallel non covalent pairs. Only 4 monomers which are ATCG. Always written as a string of letters. 1 monomer is a base pair.
A genome is all of our DNA. Every cell has two copies of 3 times 10 to 9 base pairs in 24 polymers (chromosomes).
Is that clear to those of you reading this? Don't worry if its not.

Fred Sager, who won 2 Nobel prizes, invented DNA sequencing in 1977 (incidentally just as I was completing my first degree). Costs have come down, and sequencing much better. In 2007 saw next generation machines, huge drop in costs. Halving in costs every 6 months.

Molecular biologists share their data. Submit to a global database. Synchronises every night. But amount of data increasing hugely. So, data compression scheme now used for DNA sequencing.

Back in 2000 effectively sequenced 1 human. It was epic!
Now, same data volume is generated in 3 minutes in a current large scale centre. Now, it's all about the analytics.

We now know about populations. Only 3 in 10,000 bases between any two individuals are different. Very inbred!
We all have a bit of Neanderthal in us, c2%.

DNA sequencing has 3 potential big impacts on medicine

1 Germ line impact.
Everyone has different risk of disease, but shift is small
Some bigger risk. One is FH, Familial hypercholesterolaemia, or high cholesterol levels.
With FH, if spotted, there is drug that works. Most discovered by cholesterol test in 30s. Seem don't get picked up. So, could pick them up by sequencing.

2 Precision cancer diagnosis
Cancer is a genomic disease where a cell uncontrollably grows.
By sequencing cancer you can understand its molecular form better. Sequence normal as well.
To spot changes, sequence heavily to see errors. Very data heavy. But already showing results. Different molecular forms of cancer react differently to different drugs.
Sequencing of cancer will hopefully become standard practice.

3 Hospital acquired pathogens.
Provides clear cut diagnosis of pathogens
Can be used to sequence the environment, eg a hospital
Can spot things before they take hold, for example asymptomatic carriers. Happened in Cambridge recently.

What can geeks do for biology?
Biology is a big data science
Not quite as big as high energy physics, but only 1 order of magnitude smaller.
Heterogeneity and diversity far larger
Always have dirty data
Need stable algorithms
Very high dimensional statistics problem
Often I/O not CPU bound.

Biology needs geeks!. Can we Convert physicists to 21st century science. NB Don't tell Brian Cox....

Infrastructures are critical, but we only notice them when they go wrong.

EBIs technical infrastructure:
20 PB of raw disk. Don't back up but is mirror in US and it's cheaper to fly discs over than do tape backups!
20,000 cores in 2 major farms.
A VMware cloud allowing remote users to directly mount large data sets
4 machine rooms. 2 London, 2 Cambridge. Only 1 near them.
JANET uplink at 2 Gb/sec, permission to spike to 10 Gb/sec. Moving to 40Gb/sec

There's a big need for data to be available in multiple places. Eg hospitals
Need to broaden base of infrastructure in Europe.
There's a lot of species. Each one needs to be sequenced, data cleaned, kept, curated, available.
Need a robust network with a strong hub.
This is the ELIXIR project and the Hub is at EBI.

Some fun to finish:
Over a beer.....At some point all the data we store is going to be DNA. Why don't we store it as DNA.
Did it on a napkin and got a letter in Nature.
Then did it! Stored all of Shakespeare's sonnets, JPG, PDFs etc.
2 PB information stored in less than a gram of DNA
But will be 600 years before it will be cost effective!

Great talk, I love science! Interesting Q and A at the end about how we as IT Directors can help genome researchers in our institutions, and how we can avoid the self assembled data centres in labs storing their data.


- Posted using BlogPress from my iPad

No comments: