DarkHorse HGT Candidate Resource

A database of phylogenetically atypical microbial proteins

Home Tutorial Search Download

Download Program


Program Description: Darkhorse is an experimental program that defines phylogenetic relatedness of BLASTP hits for a set of proteins against the NCBI Genbank nr database, using a lineage probability index (LPI) score. The basic algorithm used to calculate LPI scores and its application in predicting horizontal gene transfer are described in the following publications:

Hardware requirements: A multi-processor CPU with clock speed of at least 1.0 GHz is recommended. There must be sufficient hard disk space available to store the entire GenBank nr sequence database, (currently 12 Gb) plus several large MySQL tables based on this database. Total disk space required is roughly double the size of the Genbank nr database. This requirement will expand proportionally as new sequences are added to Genbank. Program performance depends primarily on MySQL database access efficiency.

Software requirements: DarkHorse requires the Unix OS. It has been installed, tested, and run successfully on Sun Solaris (v. 9), Apple Macintosh (OS X 10.4), and Linux (RedHat Enterprise v. 4) platforms. It cannot be run using Microsoft Windows. The following additional third-party software must be pre-installed:

  • PERL version 5.8.1 or later, with the following modules installed (http://www.cpan.org/):
        DBI
        DBD::mysql
  • MySQL version 4.1 or later (http://www.mysql.com/)

Prior to DarkHorse program installation, a database must be created from within the MySQL program to accept DarkHorse input, e.g.

    mysql> create database darkhorse_01;

The following public database information must be downloaded, de-compressed, and locally available:

  • Genbank nr database, including NCBI blast formatted, raw fasta, and taxdb files
        ftp://ftp.ncbi.nih.gov/blast/db/nr.*
        
  • Genbank Taxonomy database files
        ftp://ftp.ncbi.nih.gov//pub/taxonomy/taxdump.tar.gz
        ftp://ftp.ncbi.nih.gov//pub/taxonomy/gi_taxid_prot.dmp.gz

Users must be able to obtain protein BLAST search data for query sequences versus the Genbank nr database, with output in the NCBI -m8 (tab-delimited) format. Alternative BLAST engines (e.g. cluster accelerated) can be used instead of the NCBI version, as long as the output ends up in the same tab-delimited format. These tab-delimited output files are used as DarkHorse program input.

The stand-alone software is provided as a unix archive, which must be decompressed and extracted before use:

    tar -xzvf DarkHorse-1.4_rev163.tar.gz

Instructions for installation and use are provided in a README.txt file, which may be downloaded separately below.

Download Files

    Installation Instructions

    DarkHorse-1.4_rev163.tar.gz

Version History

Version 1.4 July 19, 2010
    added workaround to avoid problems caused by bug in NCBI BLAST. This bug causes
    inconsistent, truncated ID numbers for EMBL database entries if Genbank nr sequences
    are extracted from pre-formatted BLAST files using fastacmd, instead of downloading
    unformatted fasta files directly from NCBI (e.g. P29446.1 gets changed to P29446)

Version 1.3 December 14, 2009
    fixed incompatibility with some flavors of Linux operating system
    (e.g. Ubuntu and Centos) due to syntax differences in unix sort command.

Version 1.2 August 19, 2009
    fixed problem recognizing high-level self-id taxonomy terms from lineage string
    (e.g. kingdom, phylum, class, order)

Version 1.1 April 6, 2009
    revised program installation scripts to handle larger data sets without memory errors
    added new requirement for NCBI Taxonomy reference file gi_taxid_prot.dmp
    removed requirement for NCBI blast accessory program fastacmd

Version 1.0 October 24, 2008
    added error check for Genbank ID number mismatch between database and blast hits

Version 0.9 May 9, 2008
    fixed problem parsing extra space characters in NCBI blast output
    fixed problem parsing non-self matches if number exceeds 999

Version 0.8 August 15, 2007
    initial beta distribution.


Copyright 2009 S. Podell, University of California, San Diego