CoreGenomics: June 2012

Friday 29 June 2012

Epigenome-seq state-of-the-art: but is hmC worth the effort for everyone?

A couple of recent papers have demonstrated the ability to distinguish between 5-methylcytosine (5mC) and 5-hydroxymethylcytosine (5hmC) using modified bisulfite sequencing protocols. These methods are likely to make a real impact on epigenomics when combined with Bis-seq. By sequencing two genomes, Bis-seq for mC and oxBS-seq or TAB-seq for hmC a fuller picture of methylation will emerge. However it is not clear how important biologically the additional data will be or how much it is worth to researchers.

Trying to complete this kind of experiment today on mammalian genomes requires quite a lot of sequencing muscle. Bis-seq depth guidelines are lacking but most people would aim for 50x or greater coverage. This suggests a 100 fold genome coverage per sample, or an expensive experiment. This could leave the approaches as a niche application for people with a strong focus on epigenetics.

oxBS-seq: The University of Cambridge’s method prevents hmC from being protected in a normal bisulfite conversion. 5-hmC is oxidised such that upon bisulfite treatment the bases are converted to uracils. See Quantitative sequencing of 5-methylcytosine and 5-hydroxymethylcytosine at single-base resolution.

TAB-seq: The University of Chicago’s method differs by protecting 5-hmC from conversion. 5-hmC is protected from oxidation by using beta glucosyltransferase whilst 5-mC’s are oxidised such that during bisulfite treatment they behave as if they were not methylated and are converted to uracils. See Base-resolution analysis of 5-hydroxymethylcytosine in the Mammalian genome.

How to make the methods accessible to all: An area that the two methods may have an impact on more quickly is for capture-based studies in cell lines where material is not limiting. It would be possible to create a PCR-free library and perform capture for an exome (or regions of equivalent genomic size) then follow this capture with bisulfite conversion and sequencing. This MethCap-seq (BisCap, oxBS-Cap or TAB-Cap) method could allow larger sample numbers to be run at the depth required without being too expensive.

We have been testing a Nextera based capture prep in my lab which potentially would allow this to be done on very small inputs, however the method is not yet released and requires quite a bit of amplification.

Another alternative would be to combine the methods with the recent Enhanced Reduced representation bisulfite sequencing (ERRBS) published by Maria E. Figueroa’s group in PLoS Genetics.

Wednesday 27 June 2012

Has Qiagen bought the Aldi* or Wal-Mart* of DNA sequencing companies?

This update was added on 11 July 2012: An article on GenomeWeb made interesting reading as it talked about the current litigation between Columbia University and Illumina. In it Thomas Theuringer (Qiagen PR Director) said "We support Columbia's litigation against Illumina, and are also entitled to royalties if we prevail". Maybe the royalties will be worth the reported $50M they paid?

Qiagen has just purchased Intelligent Bio-Systems and is, for now, the new kid on the NGS block. But what does IBS offer, what are the MaxSeq and Mini-20, can they compete against the monster that is Illumina and who owns the rights to SBS chemistry?

The Max and Mini instruments have been on my radar for a year or so now and I really did not think they would compete against Illumina when I first heard about them. I think one has been installed in Europe.

The instruments use technology from the noughties with maximum read lengths of 55bp and 1/3rd the yield of HiSeq. If the instrument and running costs are cheap enough users might be tempted. However a cheap alternative to a gold standard often turns out to be a disappointment.

If the title reads a little harsh then I’ll explain some of my thinking below.

*Aldi and Wal-Mart have a reputation in the UK for cheaper and lower quality products, they offer an alternative to supermarkets like Sainsburys. You should not make any NGS purchasing decisions based on a personal preference of where I like to do my weekly shop!

What did Intelligent BioSystems offer Qiagen and is there space in the NGS market for another player: Qiagen wants to get into the diagnostic sequencing space. However it is not clear if a Qiagen sequencer can compete against Roche, Life and Illumina in the research or clinical space. Clinically Roche already have a strong diagnostics division even if their 454 technology appears to be suffering with the strong competition of MiSeq and PGM. Illumina have thrown their weight behind clinical development and have a reputation for investing in R&D so expect them to become a major player. Life have a reputation for delivering products that work (lets ignore SOLiD) and the continued development of PGM and now Proton has got to keep them in a strong position.

Can Qiagen take any market share: The instrument will have to work robustly, deliver high quality data, be competitive on cost and include bioinformatics solutions. It is not clear what will differentiate the Qiagen instrument from others already widely adopted.

Although I think it is going to be tough for Qiagen it is not impossible; and if there is a $25B diagnostics market (UnitedHealth Group’s personalized medicine paper suggests that the molecular diagnostics market could grow to $25bn by 2021), even a 5% share of this is equates to $1.25B!

What will the Qiagen sequencers look like: The sequencers offered by IBS make use of sequencing-by-synthesis technology licensed from Columbia university. This technology was published in 2006 by Jingyue Ju in Nicholas Turro’s chemical engineering group at Columbia University. Their PNAS paper describes a system that will be familiar to anyone using Illumina’s SBS. 3-O-allyl-dNTP-allyl fluorophores are incorporated by DNA polymerase, imaged and then cleaved with Palladium catalyzed deallylation ready for the next cycle. They sequenced 13bp in the 2006 paper at almost the same time as Illumina were buying Solexa for $600M.

Perhaps the most intriguing aspect of the technology is that the flowcells are reusable! Sounds great, but will clinical labs see this as a benefit over disposable consumables? I think not as there is too much risk a sample will become contaminated. The same goes for removing the need to barcoding on Mini-20 by using a 20 sample carousel. Barcoding is useful in labs just for sample tracking even if you end up doing a run in a single lane/flowcell.

Max-seq: In the brochure they claim that “thousands of genomes have been sequenced utilizing 2nd generation technologies, such as the MAX-Seq”, I would argue that 10,000’s of genomes have been sequenced but I am not aware of a single Max-seq genome to date.

Library prep uses ePCR bead-based or DNA nanoball (Rolony) methods. Libraries are loaded onto a flow-cell and sequenced with SBS chemistry. The instrument has dual-flowcells and each will generate 100M reads per lane in single or paired-end format at 35 or 55bp length, with >80% of bases being Q30 or higher.

The Azco website suggests that the Max-seq is thousands of dollars cheaper than a SOLiD or HiSeq instrument, and that run costs are 35% cheaper than Illumina or ABI. If you only get 25-50% of the data this make the system cost more like twice as much per run for lower quality data and much shorter reads.

Mini-20: I am not certain the Mini-20 actually exists yet. The brochure on the Azco website has a picture of the Max-seq with a line drawing of a flowcell carousel. The carousel should allow loading up to 20 flowcells and run up to 10 samples per day (SE35bp). A flow cell will generate 35M reads in single or paired-end format at 35 or 55bp length, with >80% of bases being Q30 or higher and 4Gb per flowcell.

Cost per run is predicted to be about $300 per flowcell. However it is not clear what the price would be if you wished to dispose of the reusable flowcells after a single use.

Those numbers do not add up to me but they must have made sense to Qiagen.

Friday 22 June 2012

Improving small and miRNA NGS analysis or an introduction to HDsRNA-seq

Small RNA biases have been very well interrogated in a series of papers released in the last 12 months. The RNA ligation has been shown to be the major source of bias and the articles discussed in this post offer some simple fixes to current protocols which should allow even better detection, quantification and discovery in your experiments.

Small RNA plays an important regulatory role and this has been revealed by almost every method possible that can be used to measure RNA abundance; northern, real-time qPCR, microarrays and more recently next-generation sequencing. These methods do not agree particularly well with each other and the most likely candidate issue is technical biases of the different platforms.

Even though it has its own biases, small RNA sequencing appears to be the best method available for several reasons, it does not rely on probe design and hybridization, you can discriminate amongst members of the same microRNA family and you can detect, quantitate and discover in the same experiment (Linsen et al ref).

Improving small RNA sequencing: As NGS has been adopted for smallRNA analysis focus has appropriately been made on the biases in library preparation. Nearly all library prep methods use ligation of RNA adapters to the 3’ and 5’ ends of smallRNAs using T4 RNA ligases, before reverse transcription from 3’ adapters and amplification by PCR. However RNA ligase has strong sequence preferences and unless addressed these lead to bias in the final results of sequencing experiments.

All four of the papers below show major improvements to RNA-seq bias for small RNA protocols.

I particularly like the experiments performed in the Silence paper using a degenerate 21-mer RNA oligonucleotide. Briefly the theory is that a 21-mer degenerate oligo has trillions of possible sequence combinations and that in a standard sequencing run each sequence should appear no more than once as only a few million sequences are read. The results from a standard Illumina prep showed strong biases for some sequences that were significantly different from the expected Poisson distribution, and where almost 60,000 sequences were found more than 10 times instead of once as expected (the red line in figure A from their paper reproduced below). When they used adapters where four degenerate bases were added to the 5′ end of the 3′ adapter and to the 3′ end of the 5′ adapter, they achieved results much closer to those expected (blue line).

I don’t think we should be too worried about differential expression studies as long as the comparisons used the same methods for both groups, the results we have are probably true. However we may well have missed many smallRNAs because of the bias and our understanding of biology is likely to be enhanced by these improved protocols.

The recent papers:

Jayaprakash et al NAR 2011: showed that RNA ligases have “significant sequence specificity” and that “the profiles of small RNAs are strongly dependent on the adapters used for sample preparation”. They strongly suggest modifications to current protocols for smallRNA library prep using a mix of adapters, "the pooled-adapter strategy developed here provides a means to overcome issues of bias, and generate more accurate small RNA profiles."

Sun et al RNA 2011: "adaptor pooling could be an easy work-around solution to reveal the “true” small RNAome."

Zhuang et al NAR 2012: showed that the biases of T4 RNA ligases is not simply sequence preference but affected by structural features of RNAs and adapters. They suggested "using adapters with randomized regions results in higher ligation efficiency and reduced ligation bias".

Sorefan et al Silence 2012: demonstrate that secondary structure preferences of RNA ligase impact cloning and NGS library prep of small RNAs. They present “a high definition (HD) protocol that reduces the RNA ligase-dependent cloning bias” and suggest that “previous small RNA profiling experiments should be re-evaluated” as “new microRNAs are likely to be found, which were selected against by existing adapters”, a powerful if worrying argument.

Monday 18 June 2012

Even easier box plots and pretty easy stats help uncover a three-fold increase in Illumina PhiX error rate!

One of the things I wanted to do with this blog was share things that make my job easier. One of the jobs I often have to do is communicate numbers quickly and effectively and a box plot can really help. I also have the same kind of troubles most people face with statistics, I find it hard! In this post I will discuss the GraphPad Prism package from GraphPad. This allows you to use stats confidently and make lovely plots (although annotating them is a nightmare). Recently the statisticians in our Bioinformatics core gave a short course in using GraphPad Prism. I thought I'd explain the box plot in a little more detail and tell you a bit about GraphPad.

Previously I had shown how to create a box plot using Excel. I went down this route because I did not have time to learn a new package and Excel is available almost everywhere. However the result is less than perfect and it is hard work, indeed one major reason for writing the previous blog post was so I had somewhere to go next time I needed to create a plot! Statisticians like box plots as they can get across a lot more than just the mean and also say something about the size of the population being investigated.

Explaining box plots: A box plot is a graphical representation of some descriptive statistics; generally the mean and one of the following; standard deviation, standard error or Inter Quartile Range. A dot box plot is a version that allows these figures to be represented along with all the data points which allows the size of the population to be clearly seen. This helps enormously when comparing sample groups and deciding if a change in mean is statistically significant or not.

Dot box plots rule!

In figure 1 very similar mean and standard error are plotted with each dot representing a sample, see how the removal of some data does not significantly affect the "results" but seeing the data allows you to make a call on how much you are willing to trust it.

Figure 1

In figure 2 you can clearly see that there appear to be some "outliers" in group 2 but these have no affect on the results as the number of measurements is so high compared to group 1. Deciding if any "outliers" are in group 1 is much harder as the number of samples is so much lower. Removing outliers is really hard, and our statisticians generally advise against it.

Figure 2

GraphPad Prism: The software costs about $300 for a personal license. It might be a lot when budgets are tight but an academic license is not so expensive when shared across a department or institute. I’d certainly encourage you to take the plunge. It very quickly allows you to produce plots like the ones in this post as well as run standard statistical tests, and a whole lot more I won't go into. Take a look at their product tour if you want to find out more.

PhiX error rates: I used GraphPad to investigate an issue I had suspected for a while. We have been seeing a bias in the error rate on Illumina sequencing flowcells where lane one appeared to be higher than other lanes. Whilst the absolute numbers are not terrible and all lanes pass our QC there may be a real impact on results if this is not taken into account; mutation calling single lane samples and comparing tumour to normal for instance.

I took one months GAIIx data (8 flowcells) and plotted error rate for each lane. Entering the data into GraphPad is the most annoying bit and I usually copy and paste from Excel. However generation of the statistics and plots (figure 3) took about three minutes from start to finish.

A one way ANOVA with a Bonferroin correction showed how significant the differences were, with a very significant difference between lanes 1&2 and the rest. In fact there appears to be more of a gradient across a flowcell as lane 2 is affected, but at a lesser degree to lane 1.

A 2 way ANOVA allowed me to determine that in this data set lane accounted for 80% of the variance and instrument only 2.5%.

Figure 3

The biggest headache with GraphPad is the woefully inadequate annotation of graphs. Quite simply you will have to get an image out of the software and into Illustrator or PowerPoint. I guess if they are making the stats easy we should not complain too much.

I am using GraphPad on a weekly basis and for most reports where I have to summarise larger datasets. Why don't you give it a try.

PS: I'll let you know what Illumina say about the error rates. Please tell me if you've seen similar.

Pages