Select recipient organisms to include in the search.
The "Simple Search" selection area can be sorted alphabetically by species name or
lineage to assist in finding desired genomes. Alternatively, whole taxonomic
groups can be selected in the "Advanced Search" section. In addtion
to pre-selected groups, organisms whose lineages contain one or more specific
terms can be selected by pasting the desired terms into the search text box. This feature
can be used to choose sets of related or unrelated genomes.
Select LPI score range.
Phylogenetically atypical proteins, which are the most likely
potential candidates for horizontal gene transfer, can be detected by
selecting only those matches with particularly low LPI scores
(e.g. MAX < 0.6). Conversely, to find proteins that would be
phylogenetically UNLIKELY as horizontal transfer candidates, a higher LPI
score range can be selected (e.g. MIN > 0.75). LPI scores reflect phylogenetic
distance of the match sequence from the query organism.
Match organisms at similar phylogenetic distances recieve
similar LPI scores, regardless of their database abundance.
This feature is helpful in compensating for database bias in
number of sequences associated with different taxonomic
groups.
As a guide to LPI score selection, LPI score frequencies for
955 microbial genomes are shown below, binned in increments of 0.05 score units.
Proteins with LPI scores
below 0.6 typically have no database matches closer than the phylum or
class level, indicating strong phylogenetic discordance. LPI scores
greater than 0.75 indicate that database matches exist in the
same phylogenetic family, suggesting horizontal gene transfer is
unlikely to be detectable by phylogenetic methods. Intermediate scores
are typically borderline cases, which may be difficult to interpret. In some cases,
available data may be insufficient to resolve whether or not HGT has actually occurred.
Optimal cutoff points may vary for individual genomes, depending on
branch lengths of the phylogenetic trees underlying their lineage
descriptions, as well as phylogenetic distance between available
sequenced genomes. Users may wish to adjust LPI score cutoffs after viewing the genome-specific
histogram presented on the DarkHorse results summary page for their organism of interest.
Genome-specific LPI histograms usually contain several obvious break points in the
distribution, as shown in the following examples (binned in score increments of 0.02 units).
LPI Score Histogram for Thermotoga lettingae
LPI Score Histogram for Thermoplasma acidophilum
Select phylogenetic granularity.
Phylogenetic granularity describes the breadth of "self"
sequences excluded as possible matches during a DarkHorse search. HGT
events of different ages can be targeted by choosing different levels.
Relative age of
potential HGT events can be explored for a particular
protein of interest by comparing its LPI scores at different
phylogenetic granularities and looking for a point where the score
changes from high to low.
Strain level is the narrowest possible granularity choice.
This setting is most useful for identifying relatively recent HGT
events, because the low LPI proteins it finds are unique to a
particular strain, not present in other strains of the same species.
Using Burkholderia cenocepacia AU 1054 as an example, strain level
granularity excludes all matches to database entries specifically
labeled as belonging to strain AU1054, while still permitting matches
to organisms like Burkholderia cenocepacia PC184 and Burkholderia
cenocepacia MC0-3. Matches to these other strains of Burkholderia
cenocepacia will receive high LPI values, but matches unique to strain
AU1054 will receive lower LPI values.
Species level granularity widens the list of excluded
organisms. It can identify older HGT events by finding low LPI
proteins unique to a particular species, absent from all other
sequenced species of the same genus. In the previous example, this
level of granularity would exclude matches to all strains of
Burkholderia cenocepacia, but still allow matches to Burkholderia
dolosa and Burkholderia xenovorans.
Genus level granularity widens the exclusion list still
further, excluding matches from all species within the same genus. In
this case, no matches would be allowed to any members of the
Burkholderia genus. Low LPI sequences found in only one genus, absent
from all other known members of the same family, may represent more
ancient HGT events.
Users should keep in mind that predicting the age of HGT
events requires some caution, because phylogenetically atypical
proteins with low LPI scores can occur for two different reasons:
either gene gain by the query organism, or gene loss from its closest
known relatives. In cases where the query organism is the only
sequenced example at a particular taxonomic level, it may not be
possible determine age of gene acquisition until more data on related
organisms becomes available.
-
Add additional filters (optional).
Select protein functions or families by one or more annotation keywords
(e.g "transporter").
Select protein functions or families by sequence similarity to one or
more reference proteins by amino acid BLAST search.
Select only those matches where potential donor sequences are
phylogenetically typical within their own taxa. This will reduce sensitivity,
but provides more stringent selection against potential false positives.
Excluding matches with reciprocal LPI scores < 0.75 typically eliminates
HGT candidates that cannot be easily corroborated using phylogenetic
trees.
Select unusual GC content in coding sequence DNA by using z-score statistics,
or by absolute percent GC. A z-score of 1.0 means that
GC content was either higher or lower than the mean for all coding sequences in
the genome by 1.0 standard deviations.
Select output format.
Coding sequence location coordinates can indicate whether horizontal
transfer candidates are adjacent or distant from each other, or located
on the same chromosome, scaffold, or plasmid. To facilitate cross-referencing
between protein and nucleic acid sequences, this option includes corresponding nucleic
acid scaffold id numbers and coding sequence locus ids.
DNA composition statistics include percent GC for each individual protein coding
sequence, as well as mean and standard deviation for percent GC of all
coding sequence regions in the parent genome. Percent GC is
sometimes used as a simple marker of foreign DNA within a
genome, although there may be a wide disparity between
individual genes due to other factors.
BLAST match statistics include alignment length,
percent identity, e-value, and bitscore, as well as percent of the protein covered
by the alignment. It also includes number of database matches included as potential
orthologs during DarkHorse analysis. A high number of database matches
indicates a protein that is conserved and well represented in database.
Unusual or rapidly evolving proteins have fewer database
matches. The "best" match sequence selected by the DarkHorse algorithm represents
the closest database relative of potential donor organisms, whether by vertical or
horizontal transmission.
View search results.
For each individual genome, two types of output are available, a genome summary
page and a tab-delimited file of raw, unfiltered results. The tab-delimited file
can be downloaded and imported into a spreadsheet program such as Microsoft
Excel. The summary page, in html format,
includes a histogram of genome-wide LPI scores, a tally of match numbers by species,
and statistics on total number of matched and unmatched proteins. It also includes
phylogenetic lineage of the genome, as defined in the NCBI
taxonomy database, and search-specific keywords and/or NCBI taxonomy ID numbers
used by the DarkHorse program to eliminate self-matches to the query genome.
Advanced results include composite data for all genomes selected,
filtered according to parameters specified by the user. These results may be
downloaded as a tab-delimited file, or viewed online in html format. Default result
fields include information on both query and match sequences.
Advanced results can be used to look for "reciprocal" LPI score relationships
(low LPI query coupled with a high LPI match partner).
Although LPI scores have only been calculated for matches belonging to publicly
available, sequenced genomes, if a reciprocal LPI relationship can be detected
it will show whether a potential
donor sequence is typical or atypical within its own taxon. This information
is useful in predicting whether or not sufficient data will be available to build a
full-scale phylogenetic tree supporting horizontal gene
transfer. Based on the LPI score distributions shown above, a combination of query
LPI score less than 0.6 with a match partner (potential donor) LPI score greater
than 0.75 is a reasonable starting place to select the best-supported HGT candidates.