Sunday, 4 October 2015

ASM NGS 2015 - Meeting Notes

These notes were taken during the 1st ASM Conference on 2015 Rapid NGS Bioinformatic Pipelines for Enhanced Molecular Epidemiologic Investigation of Pathogens held at the Omni Shoreham Hotel in Washington DC, USA from 24-27 September 2015. The conference has the shorter nickname “ASM NGS” and used the Twitter hashtag #ASMNGS.

The notes are intended to be as objective as possible. Personal opinions or speculation are prefixed by the author’s initials below:

  • PA = Phil Ashton (Public Health England, UK) = @flashton2003
  • TS = Torsten Seemann (Uni. Melbourne, Australia) = @torstenseemann
  • FB = Fiona Brinkman (Simon Fraser University, Canada) = @fionabrinkman
  • EG = Emma Griffiths (Simon Fraser University, Canada) = @griffiemma
  • RL = Robyn Lee (McGill University, Canada) = @robyn_s_lee

Steve Musser - FDA (session chair)

Welcome remarks by Joseph Campos, Gary Procop, Eric Brown

George Weinstock - Jackson Labs

Microbial Genomics and Beyond

  • Applications of clinical microbial NGS
    • eg surge of O157:H7 outbreaks in St. Louis → salad bar at grocery chain
    • core genome O157 - 3.4 million nts (Leopold, 2009; see diagram of SNPs showing evolution)
    • eg Bacteremia in NICU (from diapers) → monitoring babies in real time with WGS, “identity” of culprits of infection found in gut microbiome, 101 spp sufficiently covered from stool samples to assess SNPs depending on tissue, sample site etc produce different spectrum of AMR genes
    • eg Daptomycin resistance via mutation, in Enterococcus
    • Daptomycin used for patients with VRE
    • mutations in GdpD, Cls, LiaF produce resistance, can profile in patients
    • eg estimating bla copies (compared to single copy MLST genes), not in plasmids but massive tandem array

Metagenomics examples
  1. virus detection by metagenomics (febrile children study)
  2. mWGS vs RNA-Seq (looking for “Golden Microbiome” in elite athletes); mWGS=1% Methanobrevibacter (Archaea), RNA-Seq (48% Methanobrevibacter) → transcription may be more important than looking at “who’s there”
  3. 16S amplicon seq, cheaper, faster, high throughput, more comprehensive than PCR, culture
  4. Pathogen Detection - hospital acquired diarrhea, pathogen abundance in clinical samples, acne associated skin microbiome (top 10 ribotypes differ by 1-2 SNPs, wouldn’t have detected these unless seq entire 16S gene)

  • Read Length distance: PacBio, full length 16S (Fichot & Norman, Microbiome, 2013, 1:10)
  • Finding long read full-length amplicon sequencing for microbiome analysis works. Yeay! (PacBio Nanopore)
  • 10x coverage, 5x around circle
  • single organisms, polymorphisms in multiple copies
  • benchmarking 16S with MinION
  • HMP req’s metagenomic benchmarking
  • BLAST takes a long time to perform, rate limiting step in real time investigations, few tweaks were able to increase speed 10x (Big Iron NCSA Blue Waters, TeraGrid)

Lynn Bry - Brigham Women’s Hospital

(Late replacement for Julian Parkhill)

  • Sequencing foodborne and MDR clinical isolates at BWH.
  • 1000 micro samples per day! >100 +ve cultures across kingdoms/phyla
  • 50% diagnosis, 10% therapy, 40% screening/surveillance MRSA, VRE, GrpB Strep, Gram-
  • Lots of metadata: MIC, disk diff, E-TEST, Drug resistance, zone diam, R/I.S, ESBL, D-zone CLI/ERM
  • HIPAA de-identified data - Year only, no location
  • WHONET open source to generate antiobiograms in surveillance (Old, Windows s.w)
  • Crimson LIMS - prospective analysis of clinical samples & real time query
  • “Honest Broker” assigns new external IDs
  • Spades, QUAST, ResFinder, CARD, RAST, Mauve for extra chromosomal, BLAST for plasmid/transposon,  where is the resistance gene ?  transposon - plasmid or chrom.
  • SNPS: bowtie2, mpileup, bcftools, custom filtering.
  • Kp CRE ST258 - found many different plasmids and transposons + point mutations - WGS revealed this detail
  • E. cloacae CRE - ampC on chrom + porin mutations , multiple mobile elements Tn4401b / Tn6901
  • Serratia marcessens CRE - SRT-2 Ampc_SME-4, AmpC and KPC-3 acquire, 3 year survey 2011-2014, 2 close events
  • TImelimes; MiSeq (14 days), Bioinformatics (1 - 14 days), Epi (14 days)
  • Despite 3 week turnaround they are told it IS actionable
  • Rule out just as important as rule in
  • Mobile element analysis can refine relationship analysis
  • Curating new genomes and mobile elements takes the most the time
  • Desire to use more principled methods for outbreak calling - SaTScan, Bayesian, likelihoods

Stephen  Ostroff

Priming the Innovation Pump: FDA’s Role in Advancing and Using NGS

  • Did not use any slides.
  • Was FDA employee #1
  • WGS - gamechanger for “splitters”
  • WGS Identifies previously unknown clusters, provides surveillance/warning system
  • WGS allows rapid sharing between Ag, food, vet communities → evidence-based traceback and risk factors for identifying risks in food supply for targeted interventions
  • eg Foods Program
  • WGA→ routine analysis allowing for automation
  • Provides lot of data relatively quickly
  • identifying stable genetic changes used to pinpoint contamination (genome provides 3-5 million data points/isolate; statistically robust, accurate and stable)
  • WGS has allowed cases to be solved that were unresolvable by epi investigations alone
  • WGS useful for identifying medical agents, drug discovery, druggable targets (targets), live virus vaccine database identifying most prevalent strains
  • WGS is agnostic, don’t need to know identity of organism before sequencing → although sensitivity becomes an issue
    • eg ensure safety of blood supply with HIV detection tests and variant identification
    • eg cystic fibrosis test
  • FDA collaborates with NIST (National Institute of Standards and Technology) to develop standards, need comprehensive repositories and sharing
  • GenomeTrakr (14 states and 9 regional labs)
  • NARMS - National AMR surveillance (meat, animal slaughter, human samples) → real time monitoring of drug and disinfectant resistance determinants

Marc Allard - FDA

GenomeTrakr: A Pathogen Database to Build a Global Genomic Network for PAthogen Traceback and Outbreak Detection

ref database: pathogen detection pipeline that can inform:
  1. matching food/enviro isolates to clinical
  2. track facility contamination
  3. trace source of contamination (DB contains isolates from different geo_locs)
  4. monitor AMR, virulence, pathogenicity

  • GenomeTrakr project - originally sold as PulseNet 2.0 using WGS
  • No surprise that 4.7 Mbp gives higher resolution that a few antigen genes
  • Source tracking is key application for WGS - statistically robust, high res, stable, accurate
  • FDA - genomics mapping, link between food & env & clinical
  • CDC - which clinical case inclusion / exclusion
  • FDA/CVM - antimicrobial resistance, phenotypic predictions from genotype
  • Showed some GIS phylogeograph - made by
  • Minimal pathogen metadata

  • eg spicy tuna outbreak, Salmonella Bareilly
    • common PFGE patterns worldwide, not enough resolving power for inspectors to investigate (also sushi has many ingredients→ geo_loc details could help refine which ingredient is the culprit resulting in earlier intervention)
    • SNP phylogeny identified Scrape Tuna (Indian isolates), cluster within 2-5 SNPs, phylogeographic analysis → tips of phylo trees mapped to India, 8km between location of sequenced isolate and source of food contamination
    • need for global DB for detecting leads like this

  • eg S. Braenderup: Nut butter
    • outbreak cluster → only few SNP diffs
    • tree helped inform epi questionnaire = tool for IDing matches to drill down into number of cases faster (can point to particular foods from matching isolates)

  • as price for sequencing drops, number of isolates that can be sequenced increases
  • 2015-15 was big year for WGS (rolled out in 2014; 2015 focus on standards, training and proficiency)
  • 250 isolates/week, detecting 24 clusters/week, subset of clusters are actionable → weekly meetings to make PH decisions based on this info
  • regular epi curve shows spike in illnesses occurs 20-48 days into outbreak, WGS will help get ahead of the epi curve to avert illness
  • minimal metadata (describing who, what, when, why) provides context, key to real-time investigations, better metadata contributes to earlier interventions (industry, growers, distributors) → identify certain suppliers with contamination, also resident vs transient pathogens (require different interventions)
  1. reduced # recalls
  2. decrease sick patients
  3. preserves brand names
  4. improved farm practices (packing/processing)

  • PFGE with poorer resolution can falsely implicate industries
  • multi-ingredient products → can tease out endemic vs globally imported ingredient
  • industry needs access to data in 1-2 weeks to be effective
  • industries can use NCBI Genome Workbench, FDA analysis software themselves so the gov’t aren’t seen as “bad guys”, industry can understand the problems themselves

Validation efforts:
  1. technical performance
  2. intralab variation, seq platform
  3. interlab
  4. bioinformatics pipeline

Frank M. Aerestrup - DTU, Denmark

GMI - GLobal Microbial Identifier - Dream or Future?

  • Using the Battle of Austerlitz as a metaphor for WGS as a "common language" in our "war"
  • Infectious diseases is still #1 problem - 25% of global deaths
  • Increasingly they have global epidemiology
  • Real-time surveillance can’t work without real-time data sharing
  • Much easier to teach genome sequencing than teaching Salmonella serotyping! (apart from more people who know serotyping in a typical micro lab!)
  • Need to get people to trust to share.
  • FB: I think its key to encourage sharing first with v minimal metadata. Get everyone comfortable. Then more metadata can be added by those more comfortable and others will follow as they see the great benefit of doing so…
  • Need to engage people more widely around the world.
  • TS: This talk needs to be taken in context of Marc Allard politely imploring DTU to put all their data in GenomeTrakr, with the implication that they are holding stuff back? FB: There is a culture of some people v worried about sharing that needs to be overcome. Hopefully now that Genometrakr has shown you can do it without getting sued, this will change. TS: I don’t think it is legal worries, i still think it is publication novelty fear (which is reasonable given worsening academic funding in most countries where papers are key metric)

Marianne Kjeldsen - Statens Serum Institut, Copenhagen, DENMARK

  • Three Months of Surveillance of S. Typhimurium+S. 4, 5, 12:i:- (aka monophasic typhimurium) in Denmark Based on WGS and MLVA Typing
  • Salmonella 93.8M cases, 2500 serovars, 17% are serovar Typhimurium (notifiable)
  • PA: ST36 < 2000 SNPs from ST19/ST34, hmm, would be surprised, as different clonal complex
  • SNP trees included strains that were excluded based on MLVA, but broadly concurrent.
  • Higher discrimination than MLVA, especially for the monophasic ST34 strains.

  1. WGS provided higher resolution than MLVA
  2. reliable for outbreak detection, even with single ref strain
  3. need to consider max SNP difference
  4. investigations will always need to consider epi data

Amy Gargis - CDC

Assuring the Quality of Next-Generation Sequencing in Clinical and Public Health Laboratories

  • Quality assurance, sequencing, lab developed tests, optimizing library prep per organism, DNA quality
  • One major issue is assuring quality of DNA extract
  • Have to lock down bioinfx pipelines for quality control/assurance - strong difference from bioinfx community attitudes (PA)
  • clinical setting - CLIA regulates clinical labs performing tests on patient specimens → return results, CLA ensures accurate results
  • This is not “exciting science” but an important part of public health genomics

Deborah Moine - Nestlé

Long Reads Sequencing for Better Short Reads SNP Analysis

  • Need to detect contamination
  • When ref genome very different from sample (high SNP diffs), increases non-mappable reads → risk of false positives
  • PacBio generates 20Kb libraries requiring no amplification (less bias)
  • SMRT Cell, 250 000 nanowell → 1 DNA molecule/well
  • 10% error (random) rate, de novo assembly HGAP
  • < 15Kb is short read → used to correct longest PacBio read → get 1 contig representing whole genome
  • 1 contig 4.3 Mb, 245x coverage for Salmonella (Nestle) study
  • 20Kb library, 2SMRT cells, 1.4 Gb after filtering
  • SNP analysis using new ref genome, low number of non-mapping reads
  • Need to look at tree and SNP distance matrix
  • Ref genomes generated:
    • 21 Salmonella
    • 5 Listeria
    • 25 Cronobacter
  • SNP analysis better on full length ref than draft

Roger Barrette; Plum Island Animal Disease Center (USDA/ APHIS)

Subtractive-hybridization for Enrichment of Non-host Nucleic Acid for Improvement of Sequence-based Detection of Pathogens

  • Rapid ion torrent sequencing of Flu from Swine, but 87% host dna
  • need to decrease library bias → enrichment technique
  • Capture RNA oligonucleotide, biotinylated
  • isolate host RNA, fragments, ligation of biotinylated construct at 22oC, reverse transcription
  • RNAse treatment, pull out target cDNA (w/ negative beads)
  • Goal: decrease host, increase viral reads
  • “Background subtractive hybridization method” - decreased total reads, but higher proportion of pathogen-specific reads

  • eg Proof-of-principle → Foot & Mouth
  • 454 preps enriched vs not (by subtractive method)
  • 34% genome covered with no enrichment vs 75% with enrichment
  • need a process to increase yields and automate
  • Summary: this DNA:cDNA pulldown method works!
  • decrease library bias, increasing likelihood of agent discovery → critical for testing primary tissues   
  • Costs Less than a 454 Jr run presumably :)

Catherine Yoshida; Public Health Agency of Canada, Guelph, ON, CANADA

The Salmonella in silico Typing Resource (SISTR): Rapid Analysis of Salmonella Draft Genome Sequence Data

  • Salmonella in silico typing resource -
  • Genoserotyping, 1 day turn-around-time, high throughput (96 samples/day)
  • Non-subjective interpretation
  • O antigen (rfb cluster), somatic
  • H antigens (H1=fliC, H2=fliB), flagellar
  • SISTR can predict >2000 serovars
  • Incorporates Achtman Salmonella MLST
  • Classical MLST =7-9 genes, cgMLST=100’s to 1000’s core genes
  • SISTR cgMLST=330 genes → high assignability, low levels of “missing data”, will include international scheme when finished
  • SISTR interface → batch upload, on the fly typing, genome browser, visualization (can change min span tree according to selected metadata)
  • Under epi tab can select geographical visualization (by lat_lon or GPS co-ords) → click on node and table of metadata appears
  • Also temporal distribution of strains
  • Visualization only as strong as metadata provided
  • Does not seem to be a command line version available
  • FB: Ed Toboata said in person to me after this talk he’s interested in making a command line version available. There are some good docs at
  • cgMLST cluster is 86% correlation with its serovar (from metadata)
  • Phylogenetics can be used to make it 95% correlated
  • The last 5% due to bad or missing metadata

Philip Ashton; Public Health England

Revolutionising Public Health Reference Microbiology Using Whole Genome Sequencing: A Case Study with Salmonella

  • WGA allows for digging deep into outbreaks and research trends
  • 2500 serotypes -->99% clinical Enterica (50% Typhimurium & Enteritidis, other 50% other serotypes)
  • peak of cases in 90’s (30 000 cases /year), currently 7-8000 cases/year
  • rate of decrease of incidence slowing
  • Capacity of >3000 genomes per week with 2 miseq, 2 hiseq
  • pipeline: Kmer (18mer) → ID-->99.7% accurate subspeciation → can be used to detect contamination
  • MLST to predict serotype (for backwards compatibility) - 6887 isolates with WGS and phenotypic data -->96% match between genotype and phenotype (discrepancies due to 2 serotypes assigned to sincle serotype or eburst group, lab error, no ST/serotype lookup)
  • Method: short read seq typing → ST (& eburst grouping) → serotype
  • SISTR (PHAC)/SeqSero (CDC), SNP typing with most common serotypes, SnapperDB, FastQ → db eburst groups
  • eg 2014 14b outbreak (international), good traceback

Greg Armstrong; CDC (incoming head of AMD division)

The Application of Genomics to Public Health—an Epidemiologist’s Point of View

  • AMD Focus
  • Polio - used seq longer than any other PH area
  • Late 2013 → seq every isolate available → world eradication program in full effect
  • Polio thought to be endemic in Afghanistan → seq showed isolates from Pakistan with sustained transition
  • Seq in early 2014 showed isolates are all same in South Asia → intensified surveillance and immunization in southern Afghanistan

  • Ebola - little asymptomatic infection so transmission chains are more obvious
  • Guinea → consensus seq uploaded to → married to metadata (on MicroReact) → useful for epi’s

  • Listeria outbreak analysis: Normally the trees match PFGE but show a case where the PFGE didn’t match. Described how key to genomic epi is both having the genomic data AND the good epi (which in this case revealed that the outbreak was associated with carmel apples)

  • Showed amusing Mycobacterium tuberculosis “tree” with no branches resulting from  conventional genotyping with MIRU-VNTR (i.e. identical isolates): ….then showed tree illustrating how isolates could be differentiated by WGS

  • HIV transmission
    • contact tracing based on epi data
    • attribution table -25% social contact, integrating WGS and epi = >80% injection drug users

  • inferred HepC-V transmission

  • Pertussis incidence increasing for 30yrs (acellular vaccine since ‘90’s)
  • Refer to posters for pertussis outbreak analysis. Huge increase in pertussis lately, primarily in California (California outbreak 2010)

  • Influenza pipeline w/ NGS → faster, cheaper, more samples, more data, better data
  • impacts vaccine dev (informs what strains to build vaccine against based on typing from previous season)

  • Pneumococcal pipeline w/ NGS → more PH data, more easily exportable, less prone to human error

  • Mentions the need for bioinformaticians/bioinformaticists at CDC. FB: Good to encourage students to try work terms, scholarships/fellowships, at such public health agencies if they are interested in such positions. Many full time positions acquired after working in a public health agencies temporarily as part of a work term/trainee position.

  1. data integration is an issue, usually diff data streams, need to integrate with external partners
  2. culture-independent diagnostic tests impacting ability to get isolates

Which is best pipeline like asking “how big is a piece of string?”

Stefan Niemann; Research Center Borstel, Borstel, Germany

Tracing Evolution and Spread of Mycobacterium tuberculosis Strains in Times of Antibiotic Treatment

  • 90% of MDR-TB patients are not treated successfully
  • Former Soviet Union is a hot-bed of MDR-TB
  • Initially felt MDR-TB not easily transmitted due to decreased fitness associated with rpoB mutations
  • Conflicting information associated Beijing sub-lineage with MDR - did 24-locus MIRU with 4987 isolates, from 99 countries, WGS on subset of 110 isolates → the associated publication:
  • First streptomycin mutations ~1970, when first treatment given - resistance mutations appeared way before DOTs initiated
  • MDr outbreak clone sin Eastern Europe due to antibiotic Tx and bottleneck selection
  • Implementation of DOTs and DOTsPlus actually increase presence of the clones in Central Asia
  • Compensatory mutations increasing fitness, i.e. transmissibility of drug-R clones
  • One of Stefan’s older papers on this:

Ruth Timme (Hugh Rand); FDA

Benchmark Datasets for Validating Foodborne Outbreak Investigations: Integrating WGS and Phylogenomic Analyses

  • FDA/CDC/NCBI/FSIS(?) got together to develop a uniform approach for analysis comparison and standardisation of results.
  • Lots of components in the benchmark - isolate, dna, raw data, meta data, output of analyses. how to compare these analyses?
  • I think this is really valuable, would be great to see the details as to the broader reasons of how and why to use these in the github (PA) -
  • The fact that outbreak/epi related isolates are so close makes the evolutionary genetics of it much simpler. it is harder to do broader evo studies.

Madeline Galac; Univ. of North Carolina at Charlotte

Integrating Core Genome Phylogenetic Relationships and Isolate Geographic Data to Trace the 2012 Neisseria meningitidis Outbreak in New York City,

  • Neisseria meningitidis, outbreak associated with MSM. 102 isolates, 79 serogroup C, 2003-2013. 19 outbreak isolates.
  • Were all the outbreak isolates related?
  • Assembled illumina with velvet then used xBase to annotate (not Prokka or RAST, maybe an old study)
  • Found core genome w/orthomcl for single copy ortholog groups and aligned each gene MAFFT  and concatenated ~500 genes. then raxml.
  • Outbreak group was monophyletic, also had isolates from 2008. 2012 outbreak formed a single clone within that clade.
  • Did a betweeness centrality analysis - i.e. the more a location is connecting other locations to each other
  • Map the home location onto the tree, use paup to infer the ancestral states/changes. count the changes, make into network visualisation. Brooklyn had highest betweeness. vast majority coming out of brooklyn. this was over 10 year period.
  • Did same thing for just the 2012 outbreak. there were specific neighbourhoods that played a more important role in this one. Aided study of transmission events.
  • I would be interested to see whether there was an international aspect to this MSM outbreak as for Shigella  (PA)
  • Also, multiple SNPs between cases could also be missed steps in transmission chain? FB: Great point. Note that Neisseria are naturally competent for DNA uptake (at 10-3 rate which is really high for bacteria - just spread DNA on a plate containing the 13 bp Neisseria uptake seq in it, spread the bacteria on top, and presto you get colonies the next day transformed with the DNA you wanted to add to them!). So these multiple SNPs between cases should really be studied further to see how they evolved…

Maria Hoffmann; FDA

Whole Genome Sequencing Provides Rapid Traceback of Clinical to Food Sources During a Foodborne Outbreak of Salmonellosis

  • Salmonella Bareilly associated with wide host range, first isolated in India, 1928.
  • Retrospective study, 100 isolates, 41 outbreak, 57 from background, going back to 1960s. finished a genome from the outbreak with pacbio.
  • Bareilly is paraphyletic, one of the phyla associated with only east coast, one with west coast.
  • Found an arsenic resistance island in salmonella heidelberg as a side effect of investigaing outbreak.

Eija Trees; CDC, Atlanta, GA

Transforming Public Health Microbiology in the United States with Whole Genome Sequencing (WGS) - PulseNet and Beyond

  • WGS to decrease turn-around-time to 2-4 days from (as much as) months
  • 125$ with Miseq to sequence E. Coli  
  • did she miss some info in that cost total -RL? didn’t mention dna extraction for one (or the cost of the bionumerics license)....good point -RL.
  • MLST < rMLST < cMLST < wgMLST    but all require manual curation
  • TS: I feel that comparison table (provided by BioNumerics)  is very misleading!

David Lipman, NCBI, Bethesda, MD

Pathogen Genomics at NCBI

  • Ultimately want to deal with 1000, 2000 isolates a day.
  • Masks the repetitive/phage/mobile parts of the genome before SNP tree-ing (est 4% of genome)
    FB: the % of genome masks varies greatly between species though.
  • TS: Density filtering of SNPs - a proxy for recombination detection? ala ClonalFrameML, Gubbins, BratNextGen
  • Use “maximum compatibility” trees - does not allow homoplastic sites - all sites must agree with tree.
    TS: I found this reference:
  • Database of AMR genes mentioned but don’t note the collection of sources used.

Dag Hamsen, University of Munster, Munster, Germany

Overview of Tools for Microbial NGS Data Analysis

  • SURPI was the pipeline used for diagnosis of neuroleptospirosis in NEJM 2014 (TS: which cited our Leptospira genome paper, yay!)
  • Tablet - next gen sequence assembly visualization
  • Mapathon - used simulated bacterial data -> BWA + GATK diploid as best combination of tools for SNPs and indels (interesting, is this Unified Genotyper or HaplotypeCaller by GATK? UG has been decommissioned by Broad in favour of HC)

David Aanensen, Imperial College, London, UK

Community and Social Data / Applications for Pathogen Genomic Surveillance

Shorter (15 minute) talks

Xiangyu Deng, University of Georgia, Griffen, GA

Salmonella Serotype Determination Utilizing High-Throughput Genome Sequencing Data

  • 30K isolates serotyped/year by US PH depts
  • Retrofitting WGS to phenotyping
  • Serotyping  - including backwards compatibility
  • 46 O-antigens, 114 H-antigens => 2500+ serotypes
  • Identify correct allele by multiple rounds of mapping and BLAST
  • TS: would de novo assembly and BLAST be simpler? the two flagellar (h antigen, fliC and fljB) loci are sometimes 60% similar to each other (DNA or AA ?) (dna), might screw with assembly blast. I think in the right hands the assembly blast. would work.
  • one phenotype can be underlyed by multiple genotypes in the H antigen determining genes

  • 98.7% accuracy for reads (what is it for assembly?), takes a few minutes (!) on 4 cores. (that’s pretty good)
  • Did he say 99% accurate for ass+blast? in follow up with him afterwards, he seemed to back track on this (PA)

FB: I hate to say it but it depends how you define “accuracy” - sometimes actually mean precision or recall. Can ask, since having great recall/sensitivity is great, but not at expense of crappy precision/specificity.
TS: Is there a command line version of this?  
PA: Author claims “Yes” according to Kat, he said it is available on request - he emailed it to me btw.

Errol Strain - CFSAN, FDA

CFSAN SNP pipeline: a whole genome sequencing data analysis pipeline for food-borne pathogens

  • Only use reference < 5000 SNPs away (0.1% divergent)
  • Some post-facto filtering of phage, manually filtered
  • Salmonella Newport quite diverse, 15 SNPs might be linked
  • How to share this kind of information, just in publications, or in some other way?
  • That scares me (TS) that snp thresholds are so different for different serovars. Need domain experts,

Ivan Liachko, University of Washington, Seattle, WA

Assembling whole genomes from mixed microbial communities using Hi-C

  • taking advantage of the innovation of reconstructing chromosome conformation in human genetics
  • paper on this
  • Hi-C=chromosome conformation capture
  • cross linking occurs in cell before cell disruption. this allows you to bin contigs from the same original organism. also, within organism you get long range scaffolding information.   problem can be chimeras
  • note: some bacteria have multiple copy number chromosomes eg. Neisseria ~ 5
  • See also Dovetail technology:

Fangfang Xia, University of Chicago

PATRIC pipeline

Speaker was unable to attend and present.

Rima Khabbaz, CDC, Atlanta GA

Integrating Molecular Technologies in Public Health

  • Office  of Advanced Molecular Detection pioneering integrating WGS into PH
  • Goals: IT and lab infrastructure expansion, PH workforce (training and career paths for bioinformaticians), develop programs and projects for AMD innovation
  • AMD in Action
  • Foodborne diseases (centrepiece)
  • -culture independent diagnostics
  • PulseNet (won innovation award), changed how we identify food outbreaks → centralized national DB, creation of PulseNet has resulted in largest recalls ever for PH improvement
  • eg Listeria surveillance with WGS
    • #’s more manageable, well characterized human and food/enviro samples
    • greatly successful, created infrastructure to do WGS in PulseNet (1700 patient, ~2500 food/enviro samples seq)
    • currently comparing clusters generated by PFGE and WGS
    • WGS results in more clusters, with fewer cases/cluster
  • moving to Campy and E. coli surveillance and eventually Salmonella

  • eg Influenza
    • CDC Influenza Division important player in surveillance (1 of 5 centres)
    • monitor virus variation throughout year to inform viral strain selection for vaccine production in Sept (eg 2014 Southern Hemisphere Vaccine)
    • also monitor antiviral resistance
    • WGS has changed viral profiling pipeline → genetically profile FIRST then select subset to propagate/isolate followed by phenotypic characterization → faster, cheaper

  • eg HIV
    • million people in US living with disease (50% living in 4 states including Cali and Florida)
    • WGS improves transmission dynamics studies, allows faster PH response (needle exchange, better drug treatment)

  • eg MERS (Middle Eastern Respiratory Syndrome)
    • automated microfluidics in barcoding pipeline
    • several genomes already submitted to Genbank
    • human seqs track with camels

  • eg Bourbon virus (emerging in Kansas), tick-borne
    • WGS for pathogen discovery  

  • eg AMR
    • WGS adds level of precision, improving knowledge of transmission (endemic in Long term care facilities/nursing homes), highlighted need for regional approach  → Centres of Excellence planned

  • Challenges:
    • Innovation
    • Lack of standardization
    • Automation of analyses for high volumes of data

Charles Chiu, Univ of California

SURPI: Deep Sequencing of Infectious Disease

  • Omni-omics for infectious disease diagnosis
  • Focus on metagenomics for clinical infectious disease diagnosis
  • Agnostic approach-->nearly all microbes can be uniquely identified by NGS
  • Factors for choosing a platform: cost, speed, volume of data, turn-over-time
  • Target clinical unmet need: pneumonia (15-25% unknown cause), meningitis/encephalitis (40-60% unknown cause), fever/sepsis (20% unknown cause)
  • Key:SPEED (mins to hours), epi studies take too long “time is of the essence”
  • Require sensitivity and accuracy, HIPAA-compliant! (EMR integration), reference databases, user-friendly (for PH workers with no bioinformatics expertise)
  • Chiu and Miller 2015 for metagenomic pipeline, wet lab part fairly standard but bioinformatics analysis NOT (req “host subtraction” and essentially throw that info away, align remaining reads to pathogen databases)
  • Computational bottleneck (days to weeks to run this analysis)
  • Kraken (fast taxonomic classifier), first “unbiased, comprehensive benchmark”
  • Many other tools NOT benchmarked
  • SNAP/Bowtie2/STAR (fast nucletotide aligners), 100’s-1000’sx faster (now clinically meaningful timeframes) than BLAST
  • DIAMOND (fast translated nucleotide/protein aligner)
  • EDGE Bioinformatics (see Patrick Chaing, Los Alamos), Chris Detter
  • ONE Codex, best-in-class accuracy, minutes for turn-around-time, HIPAA
  • PathoScope, modular
  • Pathosphere, suite of tools
  • SUPRI: Seq based ultra rapid pathogen ID, for bioinformatics nubes, uses entirety of NCBI NT ref DB, clinical version of SUPRI under dev
  • Cloud version (Google cloud) and laptop version able to run on resource poor settings
  • NT alignment with SNAP, fast and scalable
  • Research vs clinical versions→ clinical mods include automated filtering, metadata tagging (background vs contamination vs pathogen), taxonomic classification, pipeline optimization, visualization, server and cloud implementation

  • eg neuroleptospirosis (Josh Osborne) diagnostics, 2yrs ago misclassified because actual pathogen not in the database

  • eg male with deafness and behavioural change, plethora of diagnostic tests in hospital, under 5hrs WGS identifies astro virus encephalitis

  • eg hemorrhagic encephalitis, extensive diagnostics were negative, NGS Dx IDs amoebic infection (Balamuthia, in under a week), couldn’t have made diagnosis earlier b/c Balamuthia poorly represented in ref DB but could have if DB more comprehensive

  • eg eosinophilic meningitis, tests for viruses, fungi and parasites negative, 2014 DB gives Malassesia (dandruff) top hit, 2015 NGS Dx Angiostrongylus (correct Dx, positive PCR from CDC), Dx had clinical impact!!

  • going forward, want everything to be in CLIA framework
  • SUPRI→ CLIA-certified pipelines with 24hr TAT, HIPAA compliant, data integration to get NGS Dx’s into patient EMRs
  • precision medicine consult team will access data for decision making
  • has capacity for genotyping but not automated in clinical version  
  • CNS “sterile”, easier to validate sterile sites
  • Docker container available to disseminate SUPRI

Randall Olsen, Houston Methodist, Tx

Genomics and Transcriptomics in Clinical Microbiology

  • PCR based tests, MALDI-TOF, WGS to inform and improve patient care
  • 20yrs ago H. influenzae genome seq >1 million, >1yr
  • “the $10 microorganism genome will soon be a reality”
  • “day in the life of a microbio lab” → 130 samples from 116 patients, can WGS ID unknown organisms for these?
  • 88.5% concordance with ref method, ID’d Mycobacterium 10 days before conventional culture based diagnosis
  • 10 organisms unable to ID by WGS because of deficiency in ref DB
  • Lack of fungi in ref DBs!
  • 400 genomes in validation study (bacteria, fungi and viruses) → 600 clinically ordered WGS tests now! used to supplement routine tests (particularly for fungi and Mycobacterium, also Salmonella and Influenza A to get rapid serotype)
  • Invoke WGS to improve patient care eg AMR, unusual disease presentation (B. cereus → anthrax-like, acquired anthrax toxins on plasmid, informed institutional response)
  • “Fire drill” for outbreak detection rehearsal, “mock rapid response scenario”, is mock outbreak clonal?  what actionable info in clinically relevant timeframe can be generated by WGS → seq analysis in 3 days, select subset and conduct follow up studies
  • WGS showed 5 clusters,informed transmission not initially appreciated in epi studies
  • Also examined gene expression profiles (RNA-Seq), transcriptomics show diffs in strains, has ABC capsule virulence factors overexpressed (unexpected based on genomics), non-coding regions in WGS pipeline previously not analyzed, overexpression of yesMN genes (virulence) led to discovery that there was mixed population (resulting in mods to genomics pipeline)
  • Combining Omics and animal models together enables testing of hypotheses and therapies, integrated as “disaster preparedness plan”

George Garrity, Michigan State Univ, Lansing Mi

A New Genomics-Driven Taxonomy: Are We There Yet?

  • International Code of Nomenclature of Prokayotes (2008 ed) → anchor points, provide ref organisms
  • 2 culture collections in 2 parts of world → provides refs for changes in platforms, methods etc
  • regulates nomenclature but not taxonomic methodology!!
  • field is dynamic, in 1980 2 200 names, currently 15-16 000 (moving target)
  • 35 000 “nomenclatural acts” since 1980
  • 1980 purged 1000’s of names! only 5% names survive from 80yrs ago
  • taxon calling (“OTUs”) vs identification → need for standards (rigorously validated)
  • proposal for open experiment setting forth series of test cases to test methodologies (are questions asked correct?)
  • Analysis and Validation Methods → “Name for Life” Commercial services

  • objective: create infrastructure to support validation system for ID Bacteia/Archaea to incorp genomics data

  • Peter Sneath, “father of numerical taxonomy” (does calculations by hand, doesn’t trust computers), max likelihood 16S tree no longer calculable → Garrity to arrange info in Bergey’s Manual
  • Currently, thresholds for classification overlap
  • Principal components analysis of data (nucleotide identity, aa identity, kmer)
  • Latent semantic analysis against 16S data, ANI, AAI
  • Size of genome is problem
  • PCA analyses - ancillary plot should contain 85% of data
  • Need to develop distortion free data viz tool
  • Pairwise combination anywhere in the heat map
  • Nearest neighbours → move out and find boundaries of taxa
  • Classifier goes through matrices of heat maps, >2SD → flag for reclassification (sp level rearrangement)

  • eg Streptomyces → novel microbial products, nomenclature got “cleaner”

  • eg Eubacterium should be phylum
  • eg Mycoplasma could be more genera

  • take home: statistical use of genomics data to develop better taxonomy

Martin Maiden; Univ. of Oxford, Oxford, United Kingdom

Beyond Typing and Phylogeny: the Population and Functional Genomics of the Neisseria

  • 19yrs ago everyone developing own gel-based methods → gradual adoption of PCR and nt-based detection
  • MLST based on housekeeping genes
  • 7 loci used for 100 Neisseria, 7 loci ST summarizes 3 284 bp = 0.15% of 2.18 Mb genome (compresses 3200 bp in 7 digits =ST)
  • 11 525 STs, 35K isolates, 507-780 alleles/loci → can use “bursts” to cluster STs (stable complex)

  • Reviewing BIGSdb

  • PubMLST
    • 1300 submitters, data curated → 90 MLST scheme used for molecular typing, species ID
    • Autotagger to annotate genomes (can feed into NeighbourNet)

  • Maiden 2013, Nat Rev Microbiol (Hierarchical genome analysis)
    • 16S-->MLST-->rMLST-->wgMLST

Mentioning Alexander von Humboldt’s Three Stages of Scientific Discovery:
  • first they deny its true
  • then they deny its important
  • then they credit the wrong person

  • Neisseria spp. - studying diverse phenotypes
  • Mening carriage across the meningitis belt - study published this year:
  • “The Diversity of Meningococcal Carriage Across the African Meningitis Belt and the Impact of Vaccination With a Group A Meningococcal Conjugate Vaccine.”

  • More data re vaccinated vs unvaccinated districts - showed herd immunity occurring in the vaccinated region vs non-vaccinated regions. FB: There is a pub associated with herd immunity that Martin mentioned to me at lunch:  

  • Napoleon: “History is the version of events that people decided to agree upon.”

Abu Mustafa; Kuwait Univ., Jabriya, KUWAIT

Next Generation Sequencing of Brucella melitensis Isolates from Kuwait and Comparative Genome Analyses

  • Brucellosis - reservoirs include camels, dogs, goats, swine, sheep
  • found in milk, cheese, dairy
  • highly infectious, aerosol transmission
  • potential biological agent, painful illness
  • top 10 impactful diseases to poverty ridden humans
  • difficult to diagnose relapse vs re-infection
  • culture, biochemical characterization, serotyping used traditionally for ID of spp/biovars
  • new methods needed for surveillance (in Kuwait, all B. melitensis)
  • identified 15 B. melitensis by PCR and standard methods (16S)
  • reads trimmed and filtered with FastX tool
  • QUAST used for assembly quality
  • 2 chromosomes, 1.2 & 2.1 Mb

All unpublished:

  • going through all methods, parameters in detail at start.
  • Genomes seq’d immediately reveals one the B. melitensis isolates was an outlier but no other information was provided
  • 10 variants/kb and 14 variants/kb in chrom 1 and 2, respectively
  • Two major variant groups, plus the one outlier seen clearly in trees.
  • Mentions isolate-specific variations identified, aiding epi studies as possible markers

Scott Federhen; NCBI, Bethesda, MD

Microbial Genomic Taxonomy at GenBank

“Taxonomy in the trenches”
  • Type vs genome from type
  • ProxyType scores vs ANI to type (ANI cutoffs change between spp)
  • Curation to correct misidentification (NCBI will just change name and add comment block instead of asking permission of author)

Planning to change names in entries that seem to be incorrectly taxonomically predicted  - with a comment showing the “ANI” percentage for old name versus (higher ANI) new name as evidence. Going to notify authors of this change, but this is the first time they won’t require author acceptance to change a Genbank entry. FB: This is really notable since the first time genbank is making blanket changes to original genbank entries (rather than their curated RefSeq) in this way without author agreement. However, they had a workshop with taxonomists to consult with them on this, and got agreement on making a blanket change. Federhen says they are trying to be really careful with this one. I’m sure the authors would appreciate the notice, and these fixes are necessary, but author input may also be key to note any errors in the automated approach, and make potential improvements to taxa correction that may be even more accurate.

Kat Holt, Univ of Melbourne, Australia

What do we need from microbial genomics surveillance software?

What are the considerations for using a genomics pipeline in a PH setting?
  1. what are we looking at (bits of genome, SNPs, MLST, Kmer, core genomes, outputs, confidence values)?
  2. how do we know it’s right?
  3. who is doing the analysis? what do i need? what are the inputs? will it all fit in with what I do right now?
  4. reproducibility? robust outcomes? how will the system/pipeline change with future updates, contamination or need for troubleshooting?
  5. are results interpretable? how is metadata integrated?
  6. will the results allow us to make good PH decisions and how will we know?

Errol Strain- CFSAN, FDA

  • Datasets for the challenge:
    • Multistate Listeria outbreak (18 isolates)-need to do matching to NCBI enviro/food isolates -> “elementary”
    • Enteritidis (50 isolates), matching to known clusters -> “more difficult”

Yan Luo - CFSAN, FDA

  • Bowtie2-->SAMtools-->variants-->customscript for SNPlist-->SNPmatrix
  • Listeria:1300SNPs b/w facility 1 and 2, 6 clinical matches to facility 1
  • Salmonella: more diverse than Listeria, more clusters

Hannes Pouseele - Applied Maths

  • BioNumerics 7.5  - used wgMLST and wgSNP point and click GUI modules
  • ie. assembly based + assembly-free ; want both to agree for confidence (within caveats)
  • rough and fine cluster detection, resolving clusters req’s exposure etc (more metadata)
  • calculation engine→ “warm shoebox”
  • Option to have the engine in the cloud rather than physical machine purchase
  • 26min to run one Listeria sampl (Velvet assembly took 16min alone)
  • included some QC highlights, possible contamination detected

Katja Einer-Jensen - Qiagen (CLC)

  • pipeline details at poster 6
  • dashboard includes running analyses side-by-side with metadata

David Aanensen - Imperial College / Sanger

  • population tree: ref genomes-->FastQ-->draft genomes-->core gene families
  • new metadata can be added as req’d to csv file
  • population tree looks nice (visualization), pretty slick, can select “source” metadata to overlay on tree

Jörg Rothgänger - Ridom

  • SeqSphere
  • cgMLST, cluster threshold <10
  • SRA FATSQ, epi download→ assembly, allele calling → QC and EWS → Tree
  • nice metadata visualization (isolation source, collection date, geo_loc)
  • state info missing for clinical cases
  • ad hoc cgMLST for Enteritidis, more complex tree
SNP typing cluster criteria from FDA
can get SeqSphere (and solve outbreaks) from the comfort of home!

Torsten Seemann - Uni Melbourne

  • Nullarbor pipeline (unix command line): Job name/ID→ csv file input → MLST → phylogeny
  • sequencing QC→ identified possible contamination/mixed population? (species ID with Kraken), assembly with Megahit to “good enough quality”
  • resistome report using Abricate software based on assembled contigs
  • core genome based on alignment to ref using Snippy
  • ML tree using FastTree
  • SNP distance matrix that epi’s get to see
  • Fripan uses Roary to determine pan genome
  • one Listeria sample seems to have 2 genomes
  • ref for Salmonella should be within 1000 SNPs
  • Message: use pan as well as core genome, combine multiple lines of evidence
  • TS: won prize for being first (only?) pipeline to detect L.innocua contaminant

Nabil-Fareed Alikhan  - Uni Warwick

  • Enterobase: analyses, curation, AMR, pan-genome, core SNPs, AMR → goal: make it useful to all people
  • simple web interface
  • Enterobase updates from SRA hourly
  • detected some QC issues
  • cgMLST
  • 2-3 clusters identified, need more metadata to resolve cluster 3
  • BioNumerics 7.5
  • 1% diff=36 alleles

Philip Ashton - Public Health England

  • SnapperDB
  • GitHub and CLIMG image (cloud infrastructure for bioinformatics, Birmingham, Warwick, Swansea)
  • install often hardest part of any pipeline
  • FASTQs→ SNPdb (PostgreSQL) --> SNP alignments→ tree
  • lists variants and ignored positions
  • generates SNP address kind of like IP address
  • connect isolates within 100 SNPs of each other
  • nice SNP address tree

Aaron Petkau - PHAC, Canada

  • SNVPhyl, part of Canada’s IRIDA gen epi platform
  • integrates genomics, epi, lab, clinical metadata
  • ref mapping → variant ID and filtering → wg phylogeny
  • implemented in Galaxy (web interface, API, provenance), QA/QC reporting, re-labeling of tree
  • Listeria: ref produced de novo with SPAdes, remove phage and repeats
  • matches defined as isolate within 0-4 SNVs
  • ~100 SNVs between facility 1 and 2
  • removed ASM20, too little data, didn’t meet min coverage of 10x after filtering
  • 3 clusters matching with clinical test dataset
  • One of the few methods mentioning the use of Galaxy for workflow

Martin Thompson, Centre for Genomic Epidemiology, DTU

  • KmerFinder and assembly→ ResFinder → MLST based on results from KmerFinder → other “Finders”
  • batch upload
  • can download results in excel file
  • Pipeline is available as a Docker image here: (i know it exists somewhere)
  • CSIphylogeny (SNP tree), BWA mapping, quality >30, depth >10, distance to nearest SNP >10
  • Ndtree (Kmer based)
  • a lot of Salmonella linkage

Zamin Iqbal - Uni Oxford

  • Reference free de Bruiijn Graph (DBG) - sits between de novo assemnly and read alignment
  • 4000 samples too 2 days using ~16 cores
  • But can save these caluclations and re-use for future analysis ie. background samples
  • FastTree for tree
  • looking only for segregating variation, matter of minutes
  • Map back coordinates to “close reference” (unclear)
  • Awesome phage sharing matrix heatmap with hierarchial clustering
  • Using phage to distinguish close samples on tree
  • AMR identification module

Nick Greenfield -  One Codex

  • assembly free
  • focused on improving ref DB (>40 000 distinct genomes, reduce false positives)
  • no Listeria typing DB
  • FASTQ→ add metadata→ metagenomic classification
  • found 2 clusters

Bill Klimke - NCBI

  • quality issues and standards for NGS
  • need to draw attention to where all points errors could arise (wet lab/computational analyses) so they can be addressed
  • samples dependent on metadata and contextual data
  • sample mixups, contamination, digital data mixups
  • need better standardized ways of integrating data
  • QA/QC is moot if upstream errors not reduced/solved

Bruce Budowle, Univ of Texas, Austin, Tx

Microbial forensics and its needs for standards and standardization

  • Excluding culprits is as important as identifying culprits
  • Info can be limited but still useful. microbial forensics is multidisciplinary
  • Bioterrorism investigations complicated by background noise of sporadic and accidental foodborne pathogens at large
  • Food and agriculture targets eg US vs Canadian BSE in cows, who has madder cows?
  • Wide number of forensic outcome scenarios, possibly retaliation such as invasion
  • Who, what, when, where to assess plausibility of bioterrorism acts
  • Need to supply standards of proof with measures of certainty
  • Quality assurance guidelines to advise community (valid, rigorous)
  • Need to define validation (spans collection, shipping and storage, extraction, analysis, interpretation), criteria and outcomes that qualify (and exceptions or alternatives during extreme circumstances)
  • Need some stability to create gold standard, if technology always changing, not good as benchmark
  • Practitioners can’t afford dynamic change
  • Standards: references (DBs and panels), quality metrics and levels; Standard Performance Methods Requirements (SPMRs)
  • “protect the country”
  • Clonality, unknown histories, abandon concept of individualization?
  • Attribution decisions require more info than just genomics (+law, policy, intelligence etc)
  • Correcting bacterial genome metadata with AutoCurE!!
  • Marker selection criteria (eg gene scoring)

  • Goal is attribution - who committed the crime, as well as who did not commit the crime

  • Science does not have to say something beyond reasonable doubt, that is the requirement of the whole case. microbial forensics, have to deal with plethora of potential culprits.

  • high background, need epidemiology to distinguish deliberate release. Peanut guy who got arrested - microbial evidence is part of the case, not the whole shebang. was that a joke about native americans and smallpox?

  • validation  - define limitations of technique so don’t go beyond the boundaries of your method. in exigent circumstances, can accept non-’validated’ results.

  • Gold standard just means more people using it than something else
  • Don’t want to become a prisoner of QA.

  • When thinking about adoption of new techniques, need to take a new look at old techniques to make a proper comparison of pros/cons

Paul Keim, Northern Arizona Univ, Flagstaff, Az

Anthrax - Molecular epidemiology and forensics from WGS and metagenomic sampling of complex specimens

  • The anthrax FBI investigation
  • B. anthracis strictly clonal, no evidence of LGT
  • Canonical SNPs (landmarks for naming)
  • Mutations causing phenotypic diffs all within markers
  • Outbreak of anthrax could take years to develop, and perhaps decades to detect
  • nASP (“pipelines are like elbows, everybody’s gotta have at least 2”), open source, ref-dependent (single or pan-genome), supports reads or assemblies, fast, scale linearly
  • Monsoon
  • ~12 000 SNPs, use for inclusion vs exclusion
  • “A clade”, out of Africa (10 000yrs ago)
  • eg anthrax and heroin users in Scotland
    • 200 suspected cases, 100 confirmed
    • 14 deaths
    • Scotland, England, Germany
    • using canonical SNPs, 2 Turkish isolates closest to Scottish drug user isolate
    • concluded that heroin contaminated during smuggling process → feds ran with idea, which turned out to be too strong
    • expanded European screening → PCR-based assay + bigger ref populations → 2 outbreaks!
    • but injectional anthrax groups overlapped in time and space
    • req’d bilateral agreements → model collaborative project to get Germany to work with US (contracts in place)
  • Soviet Union weaponized spores in industrial complex (Sverdlovsk)
    • 1979 left most filters off production facility, rupturing remaining filters, sending out plume of spores
    • in violation of international treaties
    • US obtained fixed pathology samples from victims, PCR confirmed B. anthracis
    • how low can you go and still ID strains? normally 50-100x coverage, 20x, 10x, 1x (would result in 10 000 miscalled SNPs at 1x - only 12 000 known SNPs in species)?
    • WG - FAST (focus array SNP typing)
    • turns out 1x can be done! oly genotype SNPs you already know!
  • Placement confidence landscape (with E. coli, 270 genomes x 255 000 SNPS, phylogenetic position matters, only 500 SNPs req’d to place!)
  • can examine AMR to see if Russians using AMR strains → absolutely WT
  • Based on Monte Carlo resampling, need 500 SNPs to ID/place in tree with 95% accuracy
  • No culture? no problem with GOOD REF!!