# Introduction¶

## Protein domain¶

A protein domain corresponds to a conserved region of a protein sequence. Depending on the domain ressouces used, a domain is either first defined based on structural information, followed by a search for similar sequences corresponding to the limits given by the structure, or based only on sequence conservation deduced from similarity searches.

A protein domain can be alone on a protein sequence or can be coupled to other ones to form a particular domain arrangement, i.e. a succession of the same or of different domains along the protein sequence. The comparison of protein domain arrangements can give deep insight into our understanding of: protein evolution, phylogeny relationships between species, protein function, ...

## Sequence annotation¶

Protein domain annotation methodologies typically use a protein domain database and scan a query proteome against all the models present in the database. The domains are represented inside the database as Hidden Markov Models (HMMs). These HMMs are built from Multiple Sequence Alignments (MSAs) of sequences of protein segments that are classified as belonging to the same domain family.

One of the major difficulty relies on the creation of the domain family set of sequences. As mentioned above, a search for regions sharing similarities between sequences is performed. Families with domains present in a sufficient large number of species will be detected without too much difficulties. However recent domains, domains present only in a specific clade for which too few species are available, or fast divergent protein domain families will usually be missed by methods based only on sequence similarity searches.

## The HCA method¶

The Hydrophobic Cluster Analysis (HCA) [CG1987] [IC1997] of protein sequences is a methodology that performs a coupled physico-chemical and topological analysis of the amino acids present on a protein sequence. In globular proteins, the hydrophobic amino-acids present on the regular secondary structures (alpha helices and beta strands) display a typical binary pattern of alternating hydrophobic and non-hydrophobic amino acids, that corresponds to the general trend of hydrophobic residues to be buried inside the protein cores [JH2003] [RE2007]. The use of a bidimensional support to represent the protein sequences brings an additional dimension to the binary pattern definition, leading to the definition of constrained binary patterns or hydrophobic clusters, through the use of a connectivity distance separating them into distinct units. Positions of hydrophobic clusters mainly correspond to those of the regular secondary structures, and can be used to characterize in different ways the protein fold characteristics.

SegHCA [FG2013] is a tool based on the HCA methodology allowing the detection of high densities of hydrophobic clusters on protein sequences. These hot spots can then be used as a proxy to protein area with a propensity to fold, i.e. protein domains. These areas are called HCA-segments.

## OrphHCA¶

The OrphHCA software has been designed to propose a solution for finding: recent domains, fast diverging domains, or domains on proteomes of clades with only a few number of species. The methodology has been previously tested on a set of Drosophila orthologous proteins [TBF2015] and was able to detect recent and fast diverging domains.

The workflow of the methodology is presented below:

The OrphHCA workflow.

The methodology can be separated into two steps. The first step, mandatory, corresponds to the domain annotation. SegHCA is used to delineate HCA-segments, and optionally an annotation with other databases can be performed using hmmscan. The annotation is followed by several filtering procedures to detect the conserved HCA-segments.

The second step corresponds to a filtering step, during which the generated HCA-segments are compared to some other databases or to each others.

## References¶

 [CG1987] Gaboriaud C, Bissery V, Benchetrit T, Mornon JP. Hydrophobic cluster analysis: an efficient new way to compare and analyse amino acid sequences. FEBS Lett. 1987 Nov 16;224(1):149-55.
 [IC1997] Callebaut I, Labesse G, Durand P, Poupon A, Canard L, Chomilier J, Henrissat B, Mornon JP. Deciphering protein sequence information through hydrophobic cluster analysis (HCA): current status and perspectives. Cell Mol Life Sci. 1997 Aug;53(8):621-45.
 [JH2003] Hennetin J, Le Tuan K, Canard L, Colloc’h N, Mornon JP, Callebaut I. Non-intertwined binary patterns of hydrophobic/nonhydrophobic amino acids are considerably better markers of regular secondary structures than nonconstrained patterns. Proteins. 2003 May 1;51(2):236-44.
 [RE2007] Eudes R, Le Tuan K, Delettré J, Mornon JP, Callebaut I. A generalized analysis of hydrophobic and loop clusters within globular protein sequences. BMC Struct Biol. 2007 Jan 8;7:2.
 [FG2013] Faure G, Callebaut I. Comprehensive repertoire of foldable regions within whole genomes. PLoS Comput Biol. 2013 Oct;9(10):e1003280
 [TBF2015] Bitard-Feildel T, Heberlein M, Bornberg-Bauer Erich and Callebaut I. Detection of Orphan Domains in Drosophila using “Hydrophobic Cluster Analysis” Biochimie accepted

Read the Tutorial for a quick start on how to use OrphHCA!