Back CONTACT

Network | Importing data from Next-Generation Sequencing using pibase


Users who are planning to re-construct phylogenetic networks from own NGS experiments should be aware that normal NGS software usually reports variants only if they are non-reference genotypes and if there is sufficient evidence that they are non-reference. Vice-versa, not-reported variants are not necessarily reference-genotypes: they can just as well be sequencing failures or coverage gaps or insufficient evidence for a non-reference variant. Networks generated from such data are likely to consist largely of artifacts.

To generate more accurate rdf files, we recommend the free pibase package (or use this link if the University of Kiel web page is down).
Ask your NGS-bioinformatician to use pibase to scan the alignment files of all samples at a list of genomic coordinates (e.g. the known dbSNP positions within your genomic windows of interest) and generate a binary rdf file. It may be a good idea to discuss the pibase "phylogenetic work-flow" with your NGS-bioinformatician, as there are Network-specific pibase options which are more familiar to you, as well as NGS-specific options which may be more familiar to the bioinformatician.


For those who are unfamiliar with NGS

Next-Generation Sequencing data from whole genome shotgun sequencing and targeted sequencing (PCR amplicons or hybridisation probe based targeting) are becoming increasingly accurate and reliable. Currently, targeted sequencing can be performed from $200 or less per sample, and a genome from about $5000 or less. The data sizes are huge. Tens of Gigabytes or more. A complete genome comprises hundreds of millions of sequences.

Some limitations need to be remembered, because the process is as follows: DNA or cDNA is fragmented to a suitable length, the fragments are sequenced in a massively parallel micro-process, and finally the fragment sequences are aligned to a genome-sized reference sequence. Approximate heuristic methods are used to align these sequences because exact methods are too slow. Current NGS-sequences are usually at least 100 nucleotides long. As each sequenced DNA/cDNA fragment comes from an unknown random location in the genome, a high coverage is required to reliably sample the genomic variants (e.g. 50 independent fragments covering the same stretch of genomic sequence). The genomic variants are estimated from the consensus of sequences, after filtering these sequences for artifacts. Variant lists are never complete: The greater the sequence length, the greater the fraction of genome that can be sequenced - but there are always uncharted regions which vary from individual to individual.