Use of environmental profiles for protein structure comparison and extended sequence homology search
A new protein structure alignment procedure (SHEBA: structural homology by environment based alignment) and a new fold recognition procedure (PASSC: pair-to pear sequence structure correlation) are described. Both SHEBA and PASSC are useful tools for finding protein structures that are similar to a known structure and to an unknown structure of a protein sequence, respectively. An initial alignment is made by comparing primary, secondary, and tertiary structural features (environmental profiles) of the two proteins, without explicitly considering the three-dimensional geometry of the structures in SHEBA procedure. The alignment is iteratively refined in a second step, in which new alignments are found by three-dimensional superposition of the structures based on the current alignment. Two new sets of scoring matrices are introduced for use in the PASSC extended homology search procedure: H2 for the protein sequence comparison and T2 for the protein sequence-structure correlation. Each element of H2 or T2 measures the frequency with which a pair of amino acid types in one protein, k-residues apart in the sequence, is aligned with another pair of residues, of given amino acid types (for H2) or in given structural states (for T 2), in other structurally homologous proteins. There are four types, corresponding to the k-values of 1 to 4, for both H2 and T 2. These matrices were set up using more than 10,000 structurally homologous protein pairs, with little sequence homology between protein pairs, which were generated by SHEBA. The two scoring matrices were incorporated into the main body of the sequence alignment program FASTA and tested in a fold recognition setting in which a set of 107 test sequences were aligned to each of a panel of 3539 domains that represent all known protein structures. Four procedures were tested: the straight Smith-Watermann (SW); the FASTA procedures, which used the single residue-type substitution matrix H1 (Blosum62); PASH (pair-to-pair alignment of sequence homology), which used H1 and H2 matrices; and PASSC, which used H1, H2, and T2 matrices. The four procedures gave similar results when the probe and target sequences had greater than 30% sequence identities. However, both PASH and PASSC produced significantly more structurally homologous alignments over SW and FASTA when the sequence identity was below 30%. In conclusion, the procedure that ignores, three-dimensional geometry altogether and considers only the sequence homology, secondary structure state, and tertiary structural environment of the residues led to structurally meaningful alignments and their use for protein fold recognition allowed detection of remotely related protein structures with sequences.