Tutorial

Quick description

A more complete description of the method can be found in the Introduction.

OrphHCA is designed to detect conserved hydrophobic segments (called HCA-segments) on multiple sequence alignment (MSA).

The input of OrphHCA is a MSA fasta file. OrphHCA is actually distributed as two scripts. The main script orphHCA and an utilitary script filterOrphHCA.

The orphHCA script performs the external domain annotation and the HCA-segments search. Then, it selects the segments corresponding to domains based on their overlaps and their conservation in the MSA. Finally, the script produces a flat file with the domain positions and an hmm database file built with hmmbuild.

The filterOrphHCA script can be used to compare the created hidden markov models (HMMs) with models from other databases. The script uses the hhsearch tool to perform the comparison.

Getting started

First you will need to install OrphHCA. A complete documentation on how to install OrphHCA can be found in Installation.

Warning

As OrphHCA built the amino-acid sequences from the sequences of the MSA, non amino-acids characters [“*”, ”!”, ”.”, ”?”, “-“] are removed. Other characters in the sequences are kept.

Example file

The example file to run orphHCA can be found in the example in the git repository.

Running orphHCA

Running orphHCA without specific parameters.

$ orphHCA -i examples/EOG7CPB12.fasta -o examples/EOG7CPB12 -w examples/EOG7CPB12/ -v --keep-fas

Two files are created: examples/EOG7CPB12.out and examples/EOG7CPB12.hmm.

The first file ‘’examples/EOG7CPB12.out’’ contains the list of domains found in each protein. The format of the file follows the xdom syntax.

>FBgn0179134_Dsec_1 772
13 61 orph_0 Nan # 12 65
203 258 orph_1 Nan # 202 258
288 328 orph_2 Nan # 287 328
395 772 orph_3 Nan # 388 772
>FBgn0241472_Dyak_1 780
13 61 orph_0 Nan # 12 65
...

Each protein entry starts with a fasta header correspoding to the name of the protein sequence, for example FBgn0179134_Dsec_1 followed by a space character and the length of the protein sequence, here 772 for the protein FBgn0179134_Dsec_1.

The lines following the fasta header correspond to domain positions. The line 13 61 orph_0 Nan # 12 65 is made of four required columns, 13 61 orph_0 Nan, and followed by two commented columns, 12 65. The numbers 13 61 in the required columns correspond to the start and stop positions of the domain, the position are inclusive and the first amino-acid of the sequence starts at 1. The name orph_0 corresponds to the domain name and can be shared between the proteins, the Nan correspond to the e-value field of the xdom and should be ignored as no e-values are computed. The two optional columns 12 65 correspond to the full length of the domain.

The final positions, 13 61, are computed based on the domain position conservation between the sequences and the original HCA-domain annotation of the protein sequences can be longer, 12 65 in this example. As a matter of comparison the positions 13 61 can be seen as the alignment position, ali columns, of the annotation produced by hmmscan and the columns 12 65 as the envelop of the domain, env columns in hmmscan results.

The second file examples/EOG7CPB12.hmm is an hmm file generated from hmmbuilt. All the domain models are concatened in this file.

Running filterOrphHCA

Running filterOrphHCA:

$ filterOrphHCA -f examples/EOG7CPB12/kept_fasta/ -i examples/EOG7CPB12.hmm -w examples/filtering_EOG7CPB12/ -d pfamA_v27.0_22Oct13.hhm -c 50 -v -o examples/EOG7CPB12.filtered.dat

The program takes as an input the directory of the fasta files corresponding to the previously created HMMs, examples/EOG7CPB12/kept_fasta/, with an HMM databases corresponding to the fasta file, examples/EOG7CPB12.hmm, a working directory, examples/filtering_EOG7CPB12/ and an external database against which the created models are compared, pfamA_v27.0_22Oct13.hhm.

The output file, examples/EOG7CPB12.filtered.dat is a tab delineated flat file of four columns.

model_name_1 target_name_1 similarity database_of_the_target_1
model_name_1 target_name_2 similarity database_of_the_target_2
...
model_name_2 target_name_1 similarity database_of_the_target_1
...
All the targets having a similarity score strictly above the cutoff parameter,
-c 89, are reported.

Parameters of orphHCA

Required parameters

-i, --input : FILE

the MSA input file

-o, --output : FILE PREFIX

output file prefix (<output>.out : list of domains, <output>.hmm : hmmdatabase)

-w, --workdir : DIR

working directory

Optional parameters

-d, --database

list of the domain hmm databases to use

-s, --seqdb

path to the sequence database used for enrichment

-c, --core

number of cores to use; default=1

--perc-hca

minimal percentage of sequences in the MSA that should have a domain , default=20

--nb-hca

minimal number of sequences in the MSA that should have a domain

--perc-over, default=80

minimal percentage of overlap allowed between hca segment for them to be considered as part of the same domain

--nb-over

minimal number of overlapping amino-acids between two hca segments to consider them as the same

--hca-size

minimal size to consider a hca segment as a domain, default=30

--perc-hmm

maximal percentage of overlap allowed between a hca segment and a hmm domain , default=0

--nb-hmm

maximal number of overlapping amino-acids allowed between an hca segment and an hmm domain

--keep-fas

keep fasta results, fasta alignment are needed by hhsearch in the filtering program

-v, --verbose

active/inactive verbose mode

Parameters of filteringOrphHCA

Required parameters

-f, --fastadir

the directory with fasta alignments

-i, --inputfile

the hmm database corresponding to the fasta alignments

-w, --workdir

the working directory

-d, --database

the list of hmm database to which the fasta alignments are compared to

-o, --output

the list of model that are similar to an other model in a database

-c, --cutoff

the similarity cutoff

Optional parameters

-v, --verbose

activate verbose mode

-h, --help

show this help message and exit