DarkHorse HGT Candidate Resource

A database of phylogenetically atypical microbial proteins

Home Tutorial Download

Download Program


Program Description: Darkhorse is an experimental program that defines phylogenetic relatedness of BLASTP hits for a set of proteins against the NCBI Genbank nr database, using a lineage probability index (LPI) score. The basic algorithm used to calculate LPI scores and its application in predicting horizontal gene transfer are described in the following publications:

Hardware requirements: Hardware requirements include sufficient disk space to store reference protein sequences, large MySQL relational database tables based on these sequences, and tab-delimited BLASTP sequence comparisons for query sequences. Total disk space required is roughly triple the size of the decompressed protein reference database. Program performance depends primarily on MySQL database access efficiency. A multi-processor CPU with at least 64 GB of RAM is recommended.

Software requirements: DarkHorse is a command-line only program, and requires the Unix OS. It has been installed, tested, and run on multiple versions of Linux and Macintosh OS X platforms. It cannot be run using Microsoft Windows. The following additional third-party software must be pre-installed:

  • PERL version 5.8.1 or later, with the following modules installed (http://www.cpan.org/):
        DBI
        DBD::mysql
  • MySQL Relational Database software version 5.5 or later (or MariaDB equivalent)
    (https://dev.mysql.com/downloads/mysql/ or https://downloads.mariadb.org)
  • Diamond software for performing protein sequence similarity searches:
    (https://github.com/bbuchfink/diamond)

Prior to DarkHorse program installation, a database must be created from within the MySQL program to accept DarkHorse input, e.g.

     mysql -u username -p -e "create database db_name;"

The following public database information must be downloaded, de-compressed, and locally available:

  • Genbank nr database raw fasta file
         wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz
         gunzip nr.gz
  • Genbank Taxonomy database
        wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
        tar -xzvf taxdump.tar.gz
  • Protein accession-tax_id index file
        wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz
        gunzip prot.accession2taxid.gz

The stand-alone software is provided as a unix archive, which must be decompressed and extracted before use:

    tar -xzvf DarkHorse-2.0_revXX.tar.gz
    chmod -R 755 DarkHorse-2.0_revXX

Instructions for installation and use are provided in a README.txt file, which may be downloaded separately below.

Download Current Version

    Installation Instructions

    Latest release on GitHub


Legacy Software

    DarkHorse-1.5_rev170.tar.gz


Version History

Version 2.0 January 4, 2017
    Removed reliance on NCBI Genbank gi numbers. Provides a faster, more efficient
    database installation process, and allows the use of custom reference data sets,
    including private and/or unpublished sequences. The installation software now
    furnishes users with a verified, database-matched set of informative reference
    sequences for use in subsequent BLAST searches, along with optional tools
    for subdividing these sequences into smaller, taxonomically focused subsets.

Version 1.5 October 2, 2013
    fixed bug causing LPI calculation errors for organisms whose NCBI taxonomy lineage string
    contains two identical terms, for example Actinobacteria, which is both a phylum name
    and a class name in the following example:
    Bacteria;Actinobacteria;Actinobacteria;Actinobacteridae;Bifidobacteriales;Bifidobacteriaceae;Gardnerella

Version 1.4 July 19, 2010
    added workaround to avoid problems caused by bug in NCBI BLAST. This bug causes
    inconsistent, truncated ID numbers for EMBL database entries if Genbank nr sequences
    are extracted from pre-formatted BLAST files using fastacmd, instead of downloading
    unformatted fasta files directl from NCBI (e.g. P29446.1 gets changed to P29446)

Version 1.3 December 14, 2009
    fixed incompatibility with some flavors of Linux operating system
    (e.g. Ubuntu and Centos) due to syntax differences in unix sort command.

Version 1.2 August 19, 2009
    fixed problem recognizing high-level self-id taxonomy terms from lineage string
    (e.g. kingdom, phylum, class, order)

Version 1.1 April 6, 2009
    revised program installation scripts to handle larger data sets without memory errors
    added new requirement for NCBI Taxonomy reference file gi_taxid_prot.dmp
    removed requirement for NCBI blast accessory program fastacmd

Version 1.0 October 24, 2008
    added error check for Genbank ID number mismatch between database and blast hits

Version 0.9 May 9, 2008
    fixed problem parsing extra space characters in NCBI blast output
    fixed problem parsing non-self matches if number exceeds 999

Version 0.8 August 15, 2007
    initial beta distribution.


Copyright 2009 S. Podell, University of California, San Diego