Download Digital Coding of Amino Acids Based on Hydrophobic Index

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Interactome wikipedia , lookup

Citric acid cycle wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Fatty acid metabolism wikipedia , lookup

Magnesium transporter wikipedia , lookup

Fatty acid synthesis wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup

Western blot wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Homology modeling wikipedia , lookup

Metalloprotein wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Protein wikipedia , lookup

Peptide synthesis wikipedia , lookup

Metabolism wikipedia , lookup

Point mutation wikipedia , lookup

Proteolysis wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Biosynthesis wikipedia , lookup

Genetic code wikipedia , lookup

Biochemistry wikipedia , lookup

Transcript
Protein & Peptide Letters, 2007, 14, 871-875
871
Digital Coding of Amino Acids Based on Hydrophobic Index
Xuan Xiao1,* and Kuo-Chen Chou2
1
Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 33300, China; 2Gordon Life Science Institute,
13784 Torrey Del Mar Drive, San Diego, CA 92130, USA
Abstract: Analysis of amino acid sequences can provide useful insights into the tertiary structures of proteins and their
biological functions. One of the critical problems in amino acid analysis is how to establish a digital coding system to better reflect the properties of amino acids and their degeneracy. Based on the hydrophobic index, a one-to-one relationship
has been established between the amino acid sequence and the digital signal process. Such a “bridge” will make it possible
to apply all the existing powerful methods in the signal processing area to analysis of the amino acid sequences.
Keywords: Amino acid digital coding, hydrophobic index, sequence analysis, signal process, pseudo amino acid composition.
I. INTRODUCTION
The success of human genome project has generated deluge of sequence information. Sequence databases, such as
GenBank and EMBL, have been growing at an exponential
rate [1-2]. The explosion of biological data has challenged
biologists’ and computer scientist’s ability and speed of analyzing these data. In general, gene sequences are stored in
the computer database system in the form of long character
strings. It would act like a snail’s pace for human beings to
read these sequences with the naked eyes. Also, it is very
hard to extract any key features by directly reading these
long character strings. However, if they can be converted to
some signal process, many important features can be automatically manifested and easily studied by means of the existing tools of information theory [3].
Biological information can be analyzed on several levels,
such as nucleotide sequence, protein sequence, and genome
sequencing. Amino acid sequence analysis can provide important insights into the tertiary structures of proteins and
their functions.
Viewing protein synthesis as an information processing
system allows amino acid sequences to be analyzed as messages without considering the physical-chemical elements
for information processing [4]. There are many established
information process methods that can be used for the analysis of amino acid sequences. In fact, digital signal processing
approach has been used in a number of protein prediction
tasks, such as prediction of subcellular location [5-6] and
structural classes as well be described later. Digital coding of
amino acid can be modeled as a communication channel with
the amino acid sequence as the input and a 01 digital signal
as the channel output. One of the critical problems is how to
model the digital coding of amino acids for better reflecting
the amino acid properties and degeneracy.
Many kinds of models on amino acid digital encoding
have been built. Cristea (2001) proposed a representation of
*Address correspondence to this author at the Computer Department, JingDe-Zhen Ceramic Institute, Jing-De-Zhen 33300, China;
E-mail: xiaoxuan0326@yahoo.com.cn
0929-8665/07 $50.00+.00
genetic code, which converts the DNA sequences into digital
signals and uses a base for representation of the nucleotides.
It leads to the conversion of the codons into the numbers in
the range 0-63 and the conversion of the amino acids (together with the terminator) into the numbers in the range 020 [7]. According to their model, the 20 amino acids and the
terminator are coded as: F=0, L=1, S=2, Y=3, end=4, C=5,
W=6, P=7, H=8, Q=9, R=10, I=11, M=12, T=13, N=14,
K=15, V=16, A=17, D=18, E=19, G=20. This model better
reflects amino acid structure and degeneracy, but the genetic
signals built from genes on this model show low autocorrelation. Pan et al. also proposed a kind of amino acid
coding for predicting protein sub-cellular localization
through the stochastic signal processing approach [8]. For
simplicity, their model is: A=10, C=20, D=30, E=40, F=50,
G=60, H=70, I=80, K=90, L=100, M=110, N=120, P=130,
Q=140, R=150, S=160, T=170, V= 180, W=190, Y=200.
Although the aforementioned two different procedures can
encode a protein sequence to a serial of digital signals, they
only distinguish each amino acid in the process of encoding
amino acids without taking into account the physics chemical properties of the amino acids.
When Sofer predicted secondary structure of proteins
using genetic algorithms, he assigned one or two five-digit
codes to each amino acid because the rules of genetic algorithms are often encoded as binary strings [9]. Amino acids
with similar properties have similar code words. But a shortcoming of this model is that the amino acid and its digital
coding are not one-to-one correspondence. According to this
rule, there are 12 amino acids sharing two possibilities of
digital coding. To improve the shortcoming, Nikola built an
encoding model according to the molecular recognition theory [10].
In the current paper, based on the amino acid hydrophobicity index and information theory, a digital coding approach is proposed. It can not only take into account the
chemical physical properties of amino acids but also make
each of them corresponding to one, and only one, digital
code.
© 2007 Bentham Science Publishers Ltd.
872 Protein & Peptide Letters, 2007, Vol. 14, No. 9
II. METHOD
1. Amino-Acid Index
An amino-acid index is a set of 20 numerical values representing one of the various physicochemical properties of
the 20 amino acids. A total of 402 sets of amino-acid indices
were collected by Tomii and Kanehisal [11]; unfortunately,
none of them is universally applicable although many of
them concur with each other on the classification of a particular amino acid. They analyzed the relationships among
the amino acid indices in the 402 sets by the single-linkage
hierarchical cluster analysis, and found these indices can be
clustered into the following six groups: (A) the -helix and
tight-turn [12] propensities, (B) -strand propensity, (C)
amino acid composition, (H) hydrophobicity, (P) physicochemical properties and (O) other properties such as the frequency of left-handed helix.
The hydrophobic amino acids tend to repel the aqueous
environment, and therefore reside predominantly in the interior of proteins. Amino acids of this type neither ionize nor
participate in the formation of H-bonds. The hydrophilic
amino acids that tend to interact with the aqueous environment are often involved in the formation of H-bonds and are
predominantly found on the exterior surfaces proteins or in
the reactive centers of enzymes. In fact the hydrophobicity of
amino acids is not only one of the major factors that influence the amino acid substitution during evolution, but also
able to show periodicity of the secondary structure [13]. Using the auto-correlation functions based on the profile of
amino-acid index along the primary sequence of the query
protein (domain), Bu et al. [14] predicted the protein structural classes and found that most fractions of the amino-acid
indices lead to considerably less accuracy than that obtained
by using the Oobatake-Ooi index and the hydrophobic index
of Ponnuswamy. Only nine indices yielded the predicted
results similar to the Ponnuswamy index. The ten amino-acid
indices consist of one physicochemical property, one -helix
and tight-turn propensities, one -propensity, and seven hydrophobicity indices, indicating that the relation between
amino acid hydrophobicity and protein structural class is
very strong.
A formulation of the autocorrelation functions based on
the hydrophobicity index of the 20 amino acids is also used
to predict membrane protein types [15], where it was reported that the higher predicted accuracy could be obtained
with two sets of hydrophobicity indices, those of Ponnuwamy index and Hopp index. Table 1 shows the five kinds
hydrophobicity indices of the 20 amino acids that lead to the
higher overall predicted accuracy. The amino acid indices
listed in Table 1, all include decimal fraction and negative.
Therefore, they do not satisfy the information coding principle and can not be deemed as the appropriate digital codes
for amino acids. Nevertheless, their results did show some
intriguing approach through the Ponnuwamy hydrophobicity
index system.
2. Optimal Model of Amino Acids Digital Coding
It is well known that all the proteins occurring in living
organisms are composed of a total of just 20 different chemical building blocks (amino acids). Information theory makes
Xiao and Chou
it possible to determine the smallest binary number of a word
in order to allow unambiguous identification of all amino
acids. If words are made up of 4 bits per word, they would
contain too little information. Six bits per word would be too
complicated. According to the information theory, words
having five bits per word are sufficient and are therefore the
most economical method of coding. Five binary numbers can
mostly present 32 states from which we have to select the 20
states. According to the combinatorics, this encoding format
has
C 3220 kinds.
The coding principle we adopt is that the larger the Ponnuwamy hydrophobicity index of an amino acid is, the
greater its digital code. Thus, according to the ascendant
order of the Ponnuwamy hydrophobicity index for the 20
amino acids, the digital codes of amino acids are arranged in
the following order: K, N, D, E, P, Q, R, S, T, G, A, H, W,
Y, F, L, M, I, V, S. Because the Ponnuwamy’ hydrophobicity index of amino acid K is 5.72, we arranged the digital
coding system from the beginning of number six. The margin of hydrophobicity index between any two adjacent amino
acids in the above sequence is less than 0.45 except K-N, HW, Y-F, V-C. If the difference of hydrophobicity indices
between two amino acid is small, the two amino acid should
been arranged close together in the digital coding system,
and vice versa. Based on such a principle, an optimal coding
system is formed as shown in Table 2.
It can be seen from Table 2 that the larger the Ponnuwamy hydrophobicity index of an amino acid, the greater its
digital code is. There are only two one cordon-one amino
acid (non degenerated) mappings for Tryptophan and Methionine, but ten double, three triple, six quadrille, and two
sextuple degeneracy. Judging from the frequency of the
amino acids in the proteins, it is obviously that the genetic
code presents the features of an entropic coding.
III. APPLICATION: PREDICTION OF PROTEIN
STRUCTURAL CLASSES
Prediction of protein structural class is an important topic
in protein science [20-23]. Many different methods were
proposed aimed at such a topic. Chou et al. [24-25] demonstrated that the interaction among the components of amino
acid composition is an important driving force [26] in determining the structural class of a protein during the sequence
folding process, and it was observed that the correct rates in
recognizing protein structural classes by the covariant discriminant algorithm are significantly higher than other algorithms. However, in the above approaches, the sample of a
protein is represented by the conventional amino acid (AA)
composition. Obviously, if one used the AA composition to
represent the sample of a protein, all its sequence order effects are lost. To avoid completely lose the sequence-order
information, the pseudo amino acid (PseAA) composition
was introduced [27]. Since the concept of PseAA composition was introduced, various different kinds of PseAA composition have been proposed to improve the prediction quality of various protein attributes (see, e.g., [28-33]). Owing to
its wide application, recently a web-server called PseAA was
established at http://chou.med.harvard.edu/bioinf/PseAA/, by
which users can generate many different types of PseAA
composition as they wish. Here we shall introduce a new
Digital Coding of Amino Acids Based on Hydrophobic Index
Table 1.
Protein & Peptide Letters, 2007, Vol. 14, No. 9
873
Hydrophobicity Indices of the 20 Amino Acids. One Letter Codes are Used to Denote Amino Acids
Amino acid
Biou et al. Kyte and Doolittle Ponnuswamy
Ponnuswamy
Woid
A
16
1.8
12.28
7.62
0.07
C
168
2.5
14.93
10.93
0.71
D
-78
-3.5
10.97
6.18
3.64
E
-106
-3.5
11.19
6.38
3.08
F
189
-3.5
10.97
6.18
3.64
G
-13
-0.4
12.01
7.31
2.23
H
50
-3.2
12.84
7.85
2.41
I
151
4.5
12.77
9.99
-4.44
K
-141
-3.9
10.80
5.72
2.84
L
145
3.8
14.10
9.37
-4.19
M
124
1.9
14.33
9.83
-2.49
N
-74
-3.5
11.00
6.17
3.22
P
-20
-1.6
11.19
6.64
-1.22
Q
-73
-3.5
11.28
6.67
2.18
R
-70
-4.5
11.49
6.81
2.88
S
-70
-0.8
11.26
6.93
1.96
T
-38
-0.7
11.65
7.08
0.92
V
123
4.2
15.07
10.38
-2.69
W
145
-0.9
12.95
8.41
-4.75
Y
53
-1.3
13.29
8.53
-1.39
Information value for accessibilityalu; average fraction 35% [16]; Hydropathy index[17]; Surrounding hydrophobicity in folded form [18]; Average gain in surrounding hydrophobicity [19]; Principal property value [19].
type of PseAA composition based on the current digital coding system as formulated below.
Given a protein sequence, we can generate a series of
digital signals according to Table 2 and define the value of
its complexity measure factor. Complexity measure factor
has been used in predicting protein subcellular location. The
complexity of a sequence can be measured by the minimal
number of steps required for its synthesis in a certain process. The advantage by incorporating the complexity measure
factor as one of the pseudo amino acid components for a
protein is that it can more effectively reflect its overall sequence-order feature than the conventional correlation factors. Now, by following exactly the same procedure as described by Chou [27] and Xiao et al. [34], a protein P can be
expressed by a vector or a point in a (20 + )D = (20 + 1)D =
21D space; i.e.,
P = ( p1 , p2 , , p20, p21 )T
where T is the transpose operator, and
(1)
fk
,
20
fi + wf 21
i =1
pk = wf 21
,
20
f
wf
+
21
i
i =1
(1 k 20)
(2)
(k = 21)
where f i (i =1, 2, …, 20) are the occurrence frequencies of
the 20 native amino acids in a protein, f 21 = CLS ( S ) the
complexity measure factor that can be derived for a given
protein sequence according to the procedure described in
[34], and w the weight factor. The standard vector for the
subset G is defined by
p1 p P = 2
p21
( = , , /, +)
(3)
874 Protein & Peptide Letters, 2007, Vol. 14, No. 9
Table 2.
Xiao and Chou
Digital Codes of 20 Native Amino Acids
Type
Code
Character
K
N
D
E
P
Q
R
S
T
G
Decimal
6
8
9
10
11
12
13
14
15
16
Binary
00110
01000
01001
01010
01011
01100
01101
01110
01111
10000
Character
A
H
W
Y
F
L
M
I
V
C
Decimal
17
18
20
21
23
24
26
27
28
30
Binary
10001
10010
10100
10101
10111
11000
11010
11011
11100
11110
The similarity between the standard vector P î and the protein
P is characterized by the covariant discriminant, as defined
by
F(P,P ) = D 2 (P,P ) + ln( 2 3 21 )
(4)
where the first term is the squared Mahalanobis distance
between P and P , the second term reflects the difference of
covariance matrices for different subsets, in which
i is the
i th eigenvalue of the covariance matrix C [35]. Accordingly, the prediction rule is formulated by
F(P,P ) = Min{F(P,P ), F(P,P ), F(P,P / ), F(P,P + )} (5)
where can be , , /, +, and the Min means taking the
least one among those in the parentheses, and the superscript
represents the very structural class which the protein p
belongs to. The details about the algorithm can be found in
[34,35].
As a demonstration, let us use the same dataset studied
by the many previous authors. It consists of 204 proteins, of
which 52 all-, 61 all-, 45 /, and 46 +. Their PDB
codes are given in Table 2 of Chou [26]. We used the jackknife cross-validation to examine the performance of the
current approach. This is because among the independent
dataset test, sub-sampling (e.g., 5-fold sub-sampling) test,
and jackknife test, which are often used for examining the
accuracy of a statistical prediction method, the jackknife test
is deemed the most rigorous and objective as analyzed by a
Table 3.
comprehensive review [36] and has been increasingly
adopted by investigators to test the power of various prediction methods (see, e.g., [37-47]). The results thus obtained
are listed in Table 3, where for facilitating comparison the
corresponding results by the other methods are also given.
It can be seen from Table 3 that the current approach
yielded the best overall success rate because the digital coding model based on which the current method was established can better reflect the chemical physical properties of
amino acids and their degeneracy.
CONCLUSIONS
This paper introduces the optimal symbolic-to-digital
mapping for amino acids based on the hydrophobicity index
and information theory. The model developed based on the
current coding system can be also used to predict a series of
other features of proteins, such as protein subcellular localization [48], membrane protein type [49], protein signal peptide [50], enzyme family class [51-53], GPCR type [54-57],
and protease type [58].
ACKNOWLEDGEMENTS
This study was supported by the grants from the National
Natural Science Foundation of China (No. 60661003), and
the Province National Natural Science Foundation of JiangXi
(No. 0611060). The corresponding author would like to express his gratitude to two anonymous reviewers for their
constructive comments, which were very helpful for improving the presentation of this paper.
The Overall Predictive Accuracy in the Jackknife Test for the 3 Sets of Amino Acid Digital Codes
Method
Augmented
covariant
discriminant
algorithm
Digital coding
All-
All-
/
+
Overall
Cristea
[7]
43
= 82.7%
52
55
= 90.16%
61
44
= 97.78%
45
40
= 86.95%
46
182
= 89.21%
204
Xiao et al.
[34]
43
= 82.7%
52
55
= 90.2%
61
45
= 100%
45
40
= 87.0%
46
193
= 89.7%
204
57
= 93.44%
61
45
= 100%
45
41
= 89.13%
46
186
= 91.17%
204
This paper
43
= 82.7%
52
Digital Coding of Amino Acids Based on Hydrophobic Index
[30]
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
Venter, J. C., Smith, H. O. and Hood, L. (1996) Nature 381, 364.
Chou, K. C. (2004) Curr. Med. Chem., 11, 2105.
Xiao, X., Shao, S. H., Ding, Y., Huang, Z., Chen, X. and Chou, K.
C. (2005) Amino Acids, 28, 29.
Ramon R. R., Pedro, B. and Jose, L. O. (1996) Pattern Rec., 29,
1187.
Xiao, X., Shao, S. H., Ding, Y. and Chou, K. C. (2005) Amino
Acids, 28, 57.
Xiao, X., Shao, S. H. and Chou, K. C. (2006) Amino Acids, 30, 49.
Cristea, P. (2001) SPIE Conference BIOS 2001-International Biomedical Optics Symposium, San Jose, USA, pp. 20-26
Pan, Y. X., Zhang, Z. Z., Guo, Z. M., Huang, Z. D. and He, L.
(2003) J. Prot. Chem., 22, 395.
Sofer, W. H., http://waksman. Rutgers. Edu/Waks/Sofer/sofer.
Html.
Nikola, S. (1998) Croat. Chem. Acta, 71, 573.
Tomii, K. and Kanehisa, M. (1996) Protein Eng., 9, 27.
Chou, K. C. (2000) Analytical Biochem., 286, 1.
Cornette, J. L., Cease, K. B., Margalit, H., Spouge, J. L., Berzofsky, J. A. and Delisi, C. (1987) J. Mol. Biol., 195, 659.
Bu, W. S., Feng, Z. P., Zhang, Z. D. and Zhang, C. T. (1999) Eur.
J. Biochem., 266, 1043.
Feng, Z. P. and Zhang, Z. T. (2000) J. Prot.Chem., 19, 269.
Biou, V., Gibrrat, J. F., Levin, J. M., Robson, B. and Garnier, J.
(1988) Protein Eng., 2, 185.
Kyte, J. and Doolittle, R. F. (1982) J. Mol. Biol., 157, 105.
Ponnuswamy, P. K., Prabhakaran, M. and Manavalan P. (1980)
Biochem. Biophys. Acta, 623, 301.
Wold, S., Eriksso, L. and Hellberg, S. (1987) Can. J. Chem., 65,
1814.
Chou, K. C., (2000) Curr. Prot. Pept. Sci., 1, 171.
Chou, K., C. and Zhang, C. T. (1994) J. Biol. Chem., 269, 22014.
Shen, H. B. and Chou, K. C. (2006) Bioinformatics, 22, 1717.
Shen, H. B., Yang, J., Liu, X. J. and Chou, K. C. (2005) Biochem.
Biophys. Res. Commun., 334, 577.
Chou, K. C. (1995) Prot: Struct. Funct. Gene, 21, 319.
Chou, K. C. and Maggiora, G. M. (1998) Protein Eng., 11, 523.
Chou, K.C. (1999) Biochem. Biophys. Res. Comm., 264, 216.
Chou, K. C. (2001) PROT: Struct. Funct. Gene, 43, 246.
Chen, C., Tian, Y. X., Zou, X., Y. and Mo, J. Y. (2006) J. Theor.
Biol., 243, 444.
Du, P. and Li,Y. (2006) BMC Bioinformatics, 7, 518.
Received: May 30, 2007
Protein & Peptide Letters, 2007, Vol. 14, No. 9
Revised: July 02, 2007
Accepted: July 03, 2007
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44]
[45]
[46]
[47]
[48]
[49]
[50]
[51]
[52]
[53]
[54]
[55]
[56]
[57]
[58]
875
Mondal, S., Bhavna, R., Mohan Babu, R. and Ramakumar, S.
(2006) J. Theor. Biol., 243, 252.
Lin, H. and Li, Q. Z. (2007) Biochem. Biophys. Res. Commun.,
354, 548.
Pu, X., Guo, J., Leung, H. and Lin, Y. (2007) J. Theor. Biol., 247,
259–265.
Chen, Y. L. and Li, Q. Z. (2007) J. Theor. Biol., 245, 775.
Xiao, X., Shao, S. H., Huang, Z. D. and Chou, K. C. (2006) J.
Comp. Chem., 27, 478.
Chou, K. C. and Elrod, D. W. (1999) Protein Eng., 12, 107.
Chou, K. C. and Zhang, C. T. (1995) Crit. Revi. Biochem. Mol.
Biol., 30, 275.
Zhou, G. P., (1998) J. Protein Chem., 17, 729.
Zhou, G. P. and Assa-Munt, N. (2001) PROTEINS: Struct. Funct.
Gene, 44, 57.
Chou, K. C. and Shen, H. B. (2006) Biochem. Biophys. Res. Commun., 347, 150.
Kedarisetti, K. D., Kurgan, L. A. and Dick, S. (2006) Biochem.
Biophys. Res. Commun., 348, 981.
Chou, K. C. and Shen, H. B. (2006) J. Proteome Res., 5, 1888.
Chou, K. C. and Shen, H. B. (2007) J. Cell. Biochem., 100, 665.
Shen, H. B. and Chou, K. C. (2007) Biopolymers, 85, 233.
Shen, H. B. and Chou, K. C. (2007) Biochem. Biophys. Res. Commun., 355, 1006.
Chou, K. C. and Shen, H. B. (2007) J. Proteome Res., 6, 1728.
Chou, K. C. and Shen, H. B. (2007) Biochem. Biophys. Res.,
Comm., 357, 633.
Chou, K. C. and Shen, H. B. (2007) Biochem. Biophys. Res.,
Comm., 360, 339.
Shen, H.B., J. Yang, and K.C. Chou. (2007) Amino Acids, 33, 57.
Chou, K.C. and Y.D. Cai. (2005). J. Chem. Inform. Model. 45, 407.
Chou, K.C. and H.B. Shen. (2007). Biochem. Biophys. Res.
Comm., 357, 633.
Chou, K.C. (2005) Bioinformatics, 21, 10.
Chou, K. C. and Cai, Y. D. (2005) Protein Sci., 13, 2857.
Zhou, X.B., Chen, C., Li, Z.C. Zou, X. Y. (2007) J. Theoret. Biol.,
doi:10.1016/j.jtbi.2007.1006.1001.
Chou, K.C. and Elrod, D. W. (2002) J. Prot. Res., 1, 429.
Chou, K.C. (2005) J. Prot. Res., 4, 1413.
Gao, Q. B. and Wang, Z. Z. (2006) Prot. Eng. Des. Sel., 19, 511.
Wen, Z., Li, M., Li, Y., Guo, Y. and Wang, K. (2006) Amino Acids, 32, 277.
Chou, K.C. and Y.D. Cai. (2006). Biochem. Biophys. Res. Comm.
339, 1015-1020.