RE: Credit Card Fraud Detection Using Hidden Markov Models
||This article is presented by:Erik L.L. Sonnhammer
Gunnar von Heijne
A hidden Markov model for predicting transmembrane helices in protein
A novel method to model and predict the location and orientation of alpha helices in membrane- spanning proteins is presented. It is based on a hidden Markov model (HMM) with an architecture that corresponds closely to the biological system. The model is cyclic with 7 types of states for helix core, helix caps on either side, loop on the cytoplasmic side, two loops for the non-cytoplasmic side, and a globular domain state in the middle of each loop. The two loop paths on the non-cytoplasmic side are used to model short and long loops separately, which corresponds biologically to the two known different membrane insertions mechanisms. The close mapping between the biological and computational states allows us to infer which parts of the model architecture are important to capture the information that encodes the membrane topology, and to gain a better understanding of the mechanisms and constraints involved. Models were estimated both by maximum likelihood and a discriminative method, and a method for reassignment of the membrane helix boundaries were developed. In a cross validated test on single sequences, our transmembrane HMM, TMHMM, correctly predicts the entire topology for 77% of the sequences in a standard dataset of 83 proteins with known topology. The same accuracy was achieved on a larger dataset of 160 proteins. These results compare favourably with existing methods.
Prediction of membrane-spanning alpha helices in proteins is a frequent sequence analysis objective. A large portion of the proteins in a genome encode integral membrane proteins (Himmelreich et al. 1996; Frishman & Mewes 1997; Wallin & von Heijne 1998). Knowledge of the presence and exact location of the transmembrane helices is important for functional annotation and to direct functional analysis. Transmembrane helices are substantially easier to predict than helices in globular domains. Predicting 95% of the transmembrane helices in the ‘correct’ location is not unusual (Cserzo et al. 1997; Rost et al. 1995). By ‘correct’ is meant that the prediction overlaps the true location. The reason for this high accuracy is that most transmembrane alpha helices are encoded by an unusually long stretch of hydrophobic residues. This compositional bias is imposed by the constraint that residues buried in lipid membranes must be suitable for hydrophobic interactions with the lipids. The hydrophobic signal is so strong that a straightforward approach of calculating a propensity scale for residues in transmembrane helices and applying a sliding window with a cutoff already performs quite well. In addition to knowing the location of a transmembrane helix, knowledge of its orientation, i.e. whether it runs inwards or outwards, is also important for making functional inferences for different parts of the sequence. The orientations of the transmembrane helices give the overall topology of the protein. It is known that the positively charged residues arginine and lysine play a major role in determining the orientation as they are mainly found in non-transmembrane parts of the protein (‘loops’) on the cytoplasmic side (von Heijne 1986; Jones, Taylor, & Thornton 1994; Persson & Argos 1994; Wallin &von Heijne 1998), often referred to as the ‘positiveinside rule’. Since the rule also applies to proteins in the membrane of intracellular organelles (Gavel et al. 1991; Gavel & von Heijne 1992), we shall use the terms ‘cytoplasmic’ and ‘non-cytoplasmic’ for the two sides of a membrane. The difference in amino acid usage between cytoplasmic and non-cytoplasmic loops can be exploited to improve the prediction of transmembrane helices by validating potential transmembrane helices by the charge bias they would produce (von Heijne 1992). Despite this relatively consistent topogenic signal, correct prediction of the location and orientation of all transmembrane segments has proved to be a difficult problem. On a reasonably large dataset of single sequences, a topology accuracy of 77% has been reported (Jones, Taylor, & Thornton 1994), and aided with multiple alignments 86% (Rost, Fariselli, & Casadio 1996). The difficulty in predicting the topology seems to be partly caused by the fact that the positive-inside rule can be blurred by globular domains in loops on the non-cytoplasmic side that contain a substantial number of positively charged residues.
For more information about this article,please follow the link: