Saturday, March 30, 2013

Bacterial genome annotation systems

Introduction

You get a bacterial isolate. You sequence it. You manage to get some contigs after mucking around with some de novo assembly software. Now what? Annotation of course! Your FASTA file is teeming with lifeless chunks of bacterial DNA yearning to be adorned with insightfully labelled features, so it can get some more attention from you, and maybe even be reunited with some old friends in Genbank/ENA. If this sounds familiar, then this blog post is for you.

What is genome annotation?

Genome annotation is the process of identifying features of interest on a genome sequence. Some of the   features relevant to bacterial genomes are protein coding genes, non-coding RNAs, and operons. Features can have all sorts of useful information associated with them in addition to their genomic location and feature type. For example, a protein-coding gene annotation could include items such as the predicted protein product, whether it has a signal peptide, a gene abbreviation and an enzyme classification number. The accuracy and richness of a genome annotation is important, and sometimes critical, to downstream biological interpretation.

In the old days, a basic ORF finder would be run over the contigs. Then the truly dedicated curators would comb over the ORFs, trim back to good looking start codon sites, delete spurious looking ORFs, and so on. Later gene predictor software and BLASTX helped bootstrap this process further. Now there are various "automatic annotation" systems which do a reasonably good job. Manual refinement of the automatic annotation can then be done using curation applications.

Below I list the tools I am aware of for performing and curating bacterial genome annotation. If I've missed any please let me know and I will add them.

Web submission systems


Standalone systems


Curation systems

  • Manatee (web interface + database backend)
  • Artemis (local app, can be combined with Chado SQL backend)
  • Apollo (local app, can be combined with Chado SQL backend)
  • Wasabi (my old non-public awful CGI/Perl/make mess that we still use internally)

Conclusion

Beware of systems claiming to do "microbial" annotation. Most of them are only designed for annotating bacteria. They will perform poorly on viruses, fungi and other microbes.




Sunday, November 11, 2012

Tools to merge overlapping paired-end reads

Introduction

In very simple terms, current sequencing technology begins by breaking up long pieces of DNA into lots more short pieces of DNA. The resultant set of DNA is called a "library" and the short pieces are called "fragments". Each of  the fragments in the library are then sequenced individually and in parallel. There are two ways of sequencing a fragment - either just from one end, or from both ends of a fragment. If only one end is sequenced, you get a single read. If your technology can sequence both ends, you get a "pair" of reads for each fragment. These "paired-end" reads are standard practice on Illumina instruments like the GAIIx, HiSeq and MiSeq.

Now, for single-end reads, you need to make sure your read length (L) is shorter than your fragment length (F) or otherwise the sequence will run out of DNA to read! Typical Illumina fragment libraries would use F ~ 450bp but this is variable. For paired-end reads, you want to make sure that F is long enough to fit two reads. This means you need F to be at least 2L. As L=100 or 150bp these days for most people, using F~450bp is fine, there is a still a safety margin in the middle.

However, some things have changed in the Illumina ecosystem this year. Firstly, read lengths are now moving to >150bp on the HiSeq (and have already been on the GAIIx), and to >250bp on the MiSeq, with possibilities of longer ones coming soon! This means that the standard library size F~450bp has become too small, and paired end reads will overlap. Secondly, the new enyzmatic Nextera library preparation system produces a wide spread of F sizes compared to the previous TruSeq system. With Nextera, we see F ranging from 100bp to 900bp in the same library. So some reads will overlap, and others won't. It's starting to get messy.

The whole point of paired-end reads is to get the benefit of longer reads without actually being able to sequence reads that long. A paired-end read (two reads of length L) from a fragment of length F, is a bit like a single-read of length F, except a bunch of bases in the middle of it are unknown, and how many of them there are is only roughly known (as libraries are only nominally of length F, each read will vary). This gives the reads a longer context, and this particularly helps in de novo assembly and in aligning more reads unambiguously to a reference genome. However, many software tools will get confused if you give them overlapping pairs, and if we could overlap them and turn them into longer single-end reads, many tools will produce better results, and faster.


The tools

Here is a list of tools which can do the overlapping procedure. I am NOT going to review them all here. I've used one tool (FLASH) to overlap some MiSeq 2x150 PE reads, and then assembled them using Velvet, and the merged reads produced a "better" assembly than with the paired reads. But that's it. I write this post to inform people of the problem, and to collate all the tools in one place to save others effort. Enjoy!



Features to look for

  • Keeps original IDs in merged reads
  • Outputs the un-overlapped paired reads
  • Ability to strip adaptors first
  • Rescores the Phred qualities across the overlapped region
  • Parameters to control the overlap sensitivity
  • Handle .gz and .bz2 compressed files
  • Multi-threading support
  • Written in C/C++ (faster compiled) rather than Python/Perl (slower)

Tuesday, November 6, 2012

Sorting FASTQ files by sequence length


The problem


The FASTQ file format is used to store short sequence reads, most commonly those from Illumina. The reads are typically between 36 and 250 bp long, and typical data sets contain between 1M to 1B reads. Each entry in a FASTQ file contains three pieces of information: the sequence ID, the DNA sequence, and a quality string. I recently read through the documentation for the khmer package, and noticed a comment stating that digital normalisation would work better if the FASTQ reads were in order from longest to shortest.

I googled for some FASTQ sorting information. I found some posts on sorting by the sequence itself to remove duplicates (Justin's blog), and some posts on sorting on ID to help with finding displaced pairs (BioStar), but nothing on length sorting. There were some posts related to sorting FASTA files (Joe Fass) but it read the whole file into RAM and didn't use a Schwartzian transform.

Some solutions

Our first attempt was to use the paste trick I blogged about previously, where in.fq contains about 10M clipped reads with lengths between 24 and 100 bp:

cat in.fq 
| paste - - - - 
| perl -ne '@x=split m/\t/; unshift @x, length($x[1]); print join "\t",@x;'  
| sort -n 
| cut -f2- 
| tr "\t" "\n"
> out.fq

This turns a 4-line FASTQ entry into a single tab separated line, adds a column with the length of each read, passes it to Unix sort, removes the length column, and converts it back into a FASTQ file. And it works faster than I thought it would. About 2 minutes for a 10M read file. The whole process is I/O bound, so the faster your disks the better. The sort usually uses /tmp so if that is on a different disk unit  ("spindle" for my more mature readers) to your input and/or output files, you'll get better performance. There are various chunk parameters etc you can change on sort, but it probably doesn't make too much difference.

Unix sort breaks the input into chunks, sorts each of those, then merges all the chunks in the output (merge sort). Another possibility which avoids explicit sorting is to read through the input, and put reads into separate files based on their read length (a simplistic radix sort). We can do this because we have a finite number of read lengths. The final output simply concatenates all the bin files in order. Here is an implementation in Perl, which is about 2x faster on average:

#!/usr/bin/env perl
use strict;
use warnings;
use Fatal;
use File::Temp qw(tempdir);

my $dir = tempdir(CLEANUP=>1);
my @line;
my @fh;

while ( ! eof () ) {
  $line[$_] = scalar(<>) for (0..3);
  my $len = length($line[1]) - 1;
  if ( not defined $fh[$len] ) {
    open $fh[$len], '>', "$dir/$len";
  }
  print {$fh[$len]} @line;
}

for my $len (0 .. $#fh) {
  next unless defined $fh[$len];
  system("cat", "$dir/$len");
}

Results

Here are some timing measurements across three files:

10 million reads
Bash real 1m42.407s user 1m57.863s sys 0m52.479s
Perl real 0m51.698s user 0m35.626s sys 0m9.897s
20 million reads
Bash real 3m58.520s user 4m2.155s  sys 1m58.167s
Perl real  1m39.824s user 1m12.765s sys 0m20.409s
30 million reads
Bash real 5m53.346s user 6m16.736s sys 2m51.483s
Perl real  2m23.200s user 1m46.815s sys 0m30.754s

On face value, it appears the shell version ("Bash") is behaving in an O(N.logN) fashion, whereas the Perl implementation seems to be O(N) linear, which is somewhat expected given it reads all the data, writes all the data, then writes it all again. More testing and thinking is needed.

Conclusion

Sorting huge FASTQ files is feasible. If applications like diginorm become mainstream and sorting clipped FASTQ files make it work better, than this is handy to know. Nothing groundbreaking here, and I may have repeated other people's work, but I hope it is useful to somebody.








Saturday, October 20, 2012

The joy of academic conference spam


Once you get your names on a few papers, the influx of  academic spam increases. I probably get at least two of these spams a day in my inbox (not counting the ones that are diverted by Google's spam filter). I suspect those academics higher up the chain get even more. The most common ones are requests for conference attendance, session chairing, and reviewing; closely followed by international students and graduates looking for jobs or Ph.D. placements. They are nearly always totally unrelated to what I work on (computational genomics).

I got a classic example today:

To: torsten.seemann@monash.spam.edu
Dear Torsten Seeman: 

Umm, you spelled my surname incorrectly. Sorry for misleading you with the correct spelling in my email address, websites, blog, and Twitter account.

The four sponsoring societies of DDW invite you to submit an abstract for presentation at DDW 2013 Scientific Sessions, to be held Saturday, May 18-Tuesday, May 21, 2013 in Orlando, FL. The DDW 2013 online abstract submission site is now open and will close on Saturday, December 1, 2012 at 9 p.m. ET. 



So it's in "Orlando, FL". Because I am Australian, I do have some idea of what else is in the world, and can deduce it is in the USA. But you are ignorant if you think the whole of the world knows all your two-letter USPS state codes! eg. CA = California or Canada? And unlike you, I know Canada is a country and not just a French speaking state of the USA.


<snip>DDW 2013 Abstract Submission Site: 
http://ddw2013.abstractcentral.com 
Poster sessions and DDW programming will start on Saturday, May 18, 2013. If your abstract is accepted, it may be scheduled for presentation on Saturday, Sunday, Monday or Tuesday. 
Sincerely, 
DDW Administration 
DDWprogram@gastro.org



Hmm, you still haven't explained what the hell "DDW" is. Your email domain is giving me a hint, but that was the last line of the email. OK, I'll click on the link to find out - I can't help myself. "Digestive Disease Week" eh? Thanks for giving me the shits.

[Report Spam] clicked.



Monday, October 8, 2012

Building a bioinformatics server on a budget in 2012


Introduction

As more labs get into the "next generation sequencing" game, they are finding themselves unwittingly entering into the "high performance computing" game too. Some lucky labs will be able to access a local or national HPC service, or collaborate with those who do, or pay for access to one like Amazon EC2. But ultimately any bioinformatician will want (and need) a "general server" to experiment, develop, and test on. This is particularly true in terms of large, fast storage which is usually in short supply or costly in HPC environments.

Details


With the small lab in mind, here I outline how, for about A$2500/US$2600/2000), you can build your own server with 6/12 cores, 64GB RAM, and 12TB RAID storage from affordable commodity parts:

  • CPU
    • Intel CORE i7 3930K - LGA 2011 
    • $580
    • The Intel CPUs are about 40% faster for the same clock rate than the equivalent AMD. This is a 3.2 Ghz 6 core CPU (12 threads) and can go to 3.8 Ghz when only using some of the cores. It has good RAM bandwidth too.
  • CPU Cooler
  • Motherboard
    • Gigabyte GA-X79-UP4 Intel Mainboard - LGA 2011 
    • $269
    • This motherboard was chosen as it takes the new LGA2011 CPU, has 8 DIMM slots, and 8 SATA ports, including 6.0Gbsp SATA3 ones which we will want for the SSD boot disk described later. We will save money by using Linux software RAID.
  • RAM
  • Root disk
    • OCZ Agility 3 Series 120GB 2.5" SSD Sata 6Gbs 
    • $170 (2 drives)
    • For the boot/root disk, we will use 120GB SSD instead of mechanical hard disk, for speed. Two disks in RAID1 (mirror) configuration. This would store the operating system, and any important databases eg. BLAST indices, SQL etc.
  • Data disk
    • Seagate Barracuda 3TB 
    • $924 (6 drives)
    • In bioinformatics you can never have enough disk space! Here I suggest using six 3TB 7200rpm drives in RAID6 arrangement for a total of 12TB storage. Reading will be fast, but writing a bit slower. We need to compromise on a budget.
  • Case
    • Antec P280 
    • $135
    • To hold all this stuff, this case is great value. It has good ventilation, which is crucial for keeping all the disks and the CPU cool.
  • Power supply
    • Antec TruePower New Series 650W 
    • $120
    • The case doesn't come with a power supply. I've chosen this one because it has 9 SATA power connectors for the disks, and some unneeded cable sets can be removed in a modular fashion.
  • Operating system
    • Ubuntu Server
    • $0
    • Ubuntu Server edition is pre-tuned for server settings as opposed to interactive desktop use. Because it is Debian based, you have access to far more packaged bioinformatics applications than Centos, including those in the BioLinux project.
  • Miscellaneous
    • Extra 120mm cooling fans (for the hard disk zones)
    • Extra SATA3 cables (ones with locking clips)
    • CPU thermal paste (the grey stuff)
    • PCIe graphics card (for the screen, any old one will do)


Conclusion

For about A$2500/US$2600/2000 and a little bit of Linux know-how, you can build your own "high-end" server. It is missing some of the features of commercial servers from Dell etc (eg. hardware RAID controller, IPMI remote management, hot-swap disks) but it is much more affordable.


Friday, September 21, 2012

Problematic FASTQ output from Ion TorrentSuite 3.0

Yesterday we got an email from a Nesoni user who said that the SHRiMP aligner was failing on his FASTQ data. Then again today we got another similar report from our trusted colleague Tim Stinear with the same issue. The evidence was mounting for a bug either in Nesoni or in the FASTQ file, rather than user error. Tim had recently upgraded his PGM Sequencer to Torrent Suite v3.0 (point zero release alarm bells!), and Nesoni saved the shrimp_log.txt file and it contained this:


note: quality value format set to PHRED+33
done r/hr r/core-hr
The qv-offset might be set incorrectly! Currenty qvs are interpreted as PHRED+33 and a qv of 62 was observed. To disable this error, etiher set the offset correctly or disable this check (see README).

Wow! The PGM has improved dramatically, calling some bases at Q62, that's better than 1 in 1 million error rate... Here's a breakdown of the Q values in the file:

Symbol ASCII Q+33 Frequency
+ 43 10 1105774
, 44 11 347753
- 45 12 1167099
. 46 13 673276
/ 47 14 137220
0 48 15 225893
1 49 16 1621714
2 50 17 1731775
3 51 18 2726736
4 52 19 4280447
5 53 20 4272951
6 54 21 2556953
7 55 22 7783535
8 56 23 5153631
9 57 24 2362016
: 58 25 2406869
; 59 26 2517450
< 60 27 5762153
= 61 28 4334469
> 62 29 7109066
? 63 30 11113780
@ 64 31 13934507
A 65 32 9227417
B 66 33 12758868
C 67 34 8228985
D 68 35 9935410
E 69 36 1459950
F 70 37 2358692
G 71 38 682190
H 72 39 158322
I 73 40 168311
J 74 41 269
K 75 42 199121
L 76 43 83457
M 77 44 4971
N 78 45 464143
O 79 46 8657
P 80 47 0
Q 81 48 0
R 82 49 0
S 83 50 0
T 84 51 0
U 85 52 0
V 86 53 0
W 87 54 0
X 88 55 0
Y 89 56 0
Z 90 57 0
[ 91 58 0
\ 92 59 0
] 93 60 0
^ 94 61 0
_ 95 62 746


Ok, there really are 746 bases with Q62. Some grep work shows me they are all occuring alone in 746 reads, all in position 1 in the read, like this:

@QWRK0:02580:02076
GGGAATCAAAACGCTGATTTTTGATGAACAGAATAACGAA
+
_8,59=<<@@6@BCCDDDFFF3FEEEFAEC@@;77-7777

The table also shows quite a few >Q41 bases, which aren't typically seen in FASTQ files. These ones are probably ok (?) but the Q62 ones surely must be some artifact or bug in the version 3.0 of Torrent Suite.
In the meantime, our solution has been this:

sed 's/^_/H/' < dodgy.fastq > fixed.fastq

Be interested to see if others have encountered this problem.


Saturday, September 8, 2012

Using Velvet with mate-pair sequences

Introduction

Illumina sequencing instruments (HiSeq, MiSeq, Genome Analyzer) can produce three main types of reads when sequencing genomic DNA:
  1. Single-end
    Each "read" is a single sequence from one end of a DNA fragment. The fragment is usually 200-800bp long, with the amount being read can be chosen between 50 and 250 bp.
  2. Paired-end
    Each "read" is two sequences (a pair) from each end of the same genomic DNA fragment (more info). The distance between the reads on the original genome sequence is equal to the length of the DNA fragment that was sequenced (usually 200-800 bp).
  3. Mate-pair:
    Like paired-end reads, each "read" is two sequences from each end of the same DNA fragment, but the DNA fragment has been engineered from a circularization process (more info) such that the distance between the reads on the original genome sequence is much longer (say 3000-10000 bp) than the proxy DNA fragment (200-800 bp).

Single-end library ("SE")

When we got the original Illumina Genome Analyzer, all it could do was 36 bp single-end reads, and each lane gave us a massive 250 Mbp, and we had to walk 7 miles through snow in the dark to get it. Ok, that last bit is clearly false as we don't get snow in Australia and we speak metric here, but the point is that there is still plenty of legacy SE data around, and SE reads are still used in RNA-Seq sometimes. Let's imagine our data was provided as a standard FASTQ file called SE.fq:

velveth Dir 31 -short -fastq SE.fq
velvetg Dir -exp_cov auto -cov_cutoff auto

I strongly recommend enabling the -exp_cov auto and -cov_cutoff auto options. They will almost always improve the quality of your assemblies.

Paired-end library ("PE")

Paired-end reads are the standard output of most Illumina sequencers these days, currently 2x100bp for the HiSeq and 2x150bp for the GAIIx and MiSeq, but they are all migrating to 2x250bp soon. The two sequences per paired read are typically distributed in two separate files, the "1" file contains all the "left" reads and the "2" file contains all the corresponding "right" reads. Let's imagine our paired-end run gave us two files in standard FASTQ format, PE_1.fq and PE_2.fq:

velveth Dir 31 -shortPaired -separate -fastq PE_1.fq PE_2.fq
velvetg Dir -exp_cov auto -cov_cutoff auto

Previously you had to interleaved the left and right files for Velvet, but we recently added support to Velvet for the -separate option which we hope is now saving time and disk space throughout the Velvetsphere!

Mate-pair library ("MP")

Mate-pair reads are extremely valuable in a de novo setting as they provide long-range information about the genome, and can help link contigs together into larger scaffolds. They have been used reliably for years on the 454 FLX platform, but used less often on the Illumina platform. I think the main reasons for this are the poorer reliability of the Illumina mate-pair protocol and the larger amount of DNA required compared to a PE library.

We can consider MP reads as the same as PE reads, but with a larger distance between them ("insert size"). But there is one technical difference due to the circularization procedure used in their preparation. PE reads are oriented "opp-in" (L=>.....<=R), whereas MP reads are oriented "opp-out" (L<=.....=>R). Velvet likes its paired reads to be in opp-in orientation, so we need to reverse-complement all our MP reads first, which I do here with the EMBOSS "revseq" tool.

revseq -sequence MP_1.fq -outseq rcMP_1.fq -notag
revseq -sequence MP_2.fq -outseq rcMP_2.fq -notag
velveth Dir 31 -shortPaired -separate -fastq rcMP_1.fq rcMP_2.fq
velvetg Dir -exp_cov auto -cov_cutoff auto -shortMatePaired yes


Early Illumina MP libraries are often contaminated with PE reads (the so-called shadow library) which are the result of imperfect selection of biotin-marked fragments in the circularization process. There is a special option in velvetg (not velveth) called -shortMatePaired added by  Sylvain Foret which informs Velvet that a paired channel is MP but may contain  PE reads, which helps it to account for them and avoid mis-scaffolding. I recommend using this no matter how pure you think your MP library is.

Combining them all! (SE + PE + MP)

When de novo assembling multiple libraries in Velvet, you should order them from smallest insert size to largest insert size. For our case, this means the SE first, then the PE, then the MP. Each library must go in its own "channel" as it comes from a differently prepared DNA library. Channels are specified in Velvet with a numerical suffix on the read-type parameter (absence means channel 0):

velveth \
  Dir 31 \
  -short -fastq SE.fq \
  -shortPaired2 -separate -fastq PE_1.fq PE_2.fq \
  -shortPaired3 -separate -fastq rcMP_1.fq rcMP_2.fq 

velvetg \
  Dir \
  -exp_cov auto -cov_cutoff auto \
  -shortMatePaired3 yes

Note that the -shortMatePaired option has been allocated to channel 3 now (the -shortPaired3 library) as that is the MP channel.

Conclusions

It's relatively to get up and running with Velvet, but when your projects become more complicated, the methods in this post should help you. But if you prefer a nice GUI to take care of most of the issues discussed here, I recommend using our Velvet GUI called VAGUE (Velvet Assembler Graphical User Environment).