NCGR has amassed a large portfolio of bioinformatics expertise and tools and provides researchers in academia and industry around the world a conscientious knowledge-discovery partner tackling complex analytical issues. The NM INBRE Sequencing and Bioinformatics Core (SBC) leverages this expertise, which includes:
Analyses of genomic variation, including variant detection, marker development, genotyping by sequencing, functional characterization, and phenotypic association studies.
Expression analyses, including gene and isoform level differential expression, allele-specific, small RNA studies, and pathway analyses.
Epigenetic analyses of single base pair resolution methylation states.
ChIP-Seq differential binding and chromatin modification studies.
Data-driven genome annotation and structural variation (CNV, translocations, novel insertion) studies.
Genome and transcriptome assembly and annotation.
NCGR offers tremendous bioinformatics tool expertise in the use of proprietary, open-source, and in-house tools to support researchers’ pursuits. These include off-the-shelf tools such as the Statistical Analysis System (SAS) the gold standard in statistical software, its genomics analysis and visualization component JMP-Genomics, Mathematica for computation and visualization, and GeneGo, a powerful functional database to explore pathway enrichment and analyze networks. There are a further 30+ open-source tools focused on genome and transcriptome analysis in the areas of variant detection, visualization, differential gene expression analysis, genome structural and functional annotation, genome and transcriptome de novo assembly, and pathway/network analysis. New tools are added to NCGR/SBC arsenal continually for our network colleagues and students to leverage, learn and use.
Presented below is a recommended Bioinformatician Tool-kit with many useful tools that are frequently used in bioinformatics analyses of Next-Generation sequencing projects. NCGR scientists and analysts have compiled this list for the NM-INBRE community as recommended and frequently used open-source or NCGR-developed tools to apply to your own projects. Bioinformatics is a constantly changing and dynamic field, so please note that these are suggestions as of Sep 20, 2013. It is always good to refer to tool home pages for the most up-to-date versions and use recommendations.
Analysis Tools for our informatics analysis and training include proprietary, open-source, and tools we have developed in house.
Proprietary tools include Statistical Analysis System (SAS) the gold standard in statistical software, its John’s Mac Program (JMP) Genomics analysis and visualization component, and Mathematica for computation and visualization. We use GeneGo, a powerful funcational database, to explore pathway enrichment and network analysis in human, mouse and rat projects.
Open-source tools include the following:
- The Basic Local Alignment Search Tool (BLAST) to sequence similarities between nucleotide or proteins and infer functional, evolutionary relationships and identify species. BLAST+ allows faster searches as well as more flexibility in output formats and in the search input.
- The Genomic Mapping and Alignment Program for mRNA and EST Sequences (GMAP) for single-end sequence read sequence with minimal hardware requirements and provides fast batch processing of large sequence sets. The Genomic Short-read Nucleotide Alignment Program (GSNAP) which can align both single- and paired-end reads. NCGR’s Alpheus variant and expression detection pipeline uses GSNAP as its alignment algorithm.
- The Burrows-Wheeler Aligner (BWA) is an efficient program that aligns relatively short nucleotide sequences against a long reference sequence such as the human genome.
- Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences.
- For genome wide comparison we use MUMmer for the rapid alignment of very large DNA and amino acid sequences. It is particularly useful in comparing a de novo assembly to a known reference.
- Tablet is a lightweight, high-performance graphical viewer for next generation sequence assemblies and alignments.
- The Comparative Map and Trait Viewer (CMTV). A comparative map and trait visualization framework enabling visual integration of genomic data from disparate data sources and allowing rich client-side interactivity and manipulation. Extensible through plugins for new datasources and algorithms. (developed by NCGR).
Differential gene expression analysis
- The R-project (a language and environment for statistical computing and graphics) combined with modules from the Bioconductor suite such as DESeq, and EdgeR.
Full genome structural and functional annotation
- The Rapid Annotations using Subsystems Technology (RAST) Server, and theMetagenome (MG)-RAST server for identifying species in mixed samples such as 'contaminating' bacterial sequences
- The Gene Locator and Interpolated Markov ModelER (GLIMMER) typically finding 98–99% of all protein-coding genes.
- AUGUSTUS, a gene prediction in eukaryotes tool that allows user-defined constraints.
- The Assembly by Short Sequence (ABySS) tool for both transcriptomic and genomic assemblies.
- The ALLPATHS- Large Genomes (LG) short-read assembler
- The Mimicking Intelligent Read Assembly (MIRA) tool is a multi-pass DNA sequence data assembler/mapper for whole genome and EST projects. MIRA assembles reads gained by Sanger, 454, Solexa (Illumina), IonTorrent data and more recently PacBio sequencing technologies.
- The Contig Assembly Program 3 (CAP3) for transcriptome assembly uses long reads (e.g. 454 and Sanger. This improvised version 3 of CAP also addresses assembly errors due to repeats.
- The Phragment Assembly Program (PHRAP) for genome assembly uses long reads (e.g. 454 and Sanger). Is computationally trained to improve accuracy of assembly in the presence of low quality and repeats and can handle very large datasets.
- Short Oligonucleotide Analysis Package (SOAPdenovo) is a novel short-read assembly method that can build a de novo draft assembly for the human-sized genomes. The program is specially designed to assemble Illumina GA short reads.
Post Assembly Analysis
- The GapCloser program is designed to close the gaps that arise during the scaffolding process utilizing paired-end short read data.
- The Cluster Database at High Identity with Tolerance (CD-HIT) program is a database reduction program which removes any repetitive or redundant sequences from the input fasta file producing a set of ‘non-redundant’ sequences.
Protein/Peptide Prediction & Annotation
- The ESTScan is a program that detects coding regions in DNA sequences and can work with low quality data and correct sequencing errors that lead to frameshifts.
- The Hidden Marfov Model Scan (HmmScan) is a tool that takes a query sequence and searches it against the Pfam profile HMM library database reporting significance and thresholds scores/values.
- Cytoscape integrates biomolecular interaction networks with high-throughput expression data into a unified conceptual framework.
- The Database for Annotation, Visualization and Integrated Discovery (DAVID) is a program for performing gene-annotation, enrichment analysis, and functional annotation clustering on large gene sets to infer biological meaning.