BIOC702 Protein Secondary Structure Prediction Using Deep Learning Models
- Subject Code :
BIOC702
- University :
The University of Melbourne Exam Question Bank is not sponsored or endorsed by this college or university.
- Country :
United Kingdom
ABSTRACT
PROTEIN SECONDARY STRUCTURE PREDICTION USING DEEP LEARNING
Protein structure prediction is essential but challenging research in bioinformatics. The goal of the protein structure prediction is to identify the three-dimensional structure of a protein from its amino acid sequence; that is, the primary structure of a protein can be used to predict its secondary and tertiary structure. However, predicting this 3D structure of a protein from its amino acid sequence alone is still an unsolved problem in computational biology. The protein is an amino acids sequence that produces different shapes and structures, resulting in diverse biological functions. Therefore, it is challenging to know how a protein folds into a 3D structure from its amino acids sequence. Understanding the relationship between the amino acids sequence and the protein structure is one of the most significant challenges in bioinformatics. The protein secondary structure prediction plays a vital role in linking the primary sequence into the functional tertiary structure of a protein. Predicting the secondary structure of the protein from the primary structure has a high computation cost, and the accuracy of the prediction should be maintained.
Machine learning has great success in many fields; it has been proven to predict outcomes accurately from a different data set. Deep learning performs as a framework of Artificial Neural Network (ANN), has been used in many fields, such as computer vision and natural language processing. In our project, we try to use several approaches to improve the predictions performance on protein secondary structure. In general, two deep learning models, Recurrent Neural Network (RNN) and Convolution Neural Network (CNN), are used in our project. We apply two different protein properties as extra features to build deep learning models compared to the existing research on protein secondary structure prediction using amino acid sequences only. The first protein property is the water solvent accessibility, which contains two attributes, Absolute Solvent Accessibility (ASA) and Relative Solvent Accessibility (RSA). The second property we used in the project is the protein charge, which has two attributes, positive charge (N-terminal) and negative charge (C-terminal). Third, we combine two proteins properties, and they are also used in the deep learning model.
Copyright by
Yan Wang
2022
ACKNOWLEDGEMENT
This project would not have been possible without the support of many people. Many thanks to my advisor, Dr. Jiling Zhong, for your patience, guidance. I have significantly benefited from your wealth of knowledge.
Thank you to my committee members, Dr. Huan and Dr. Ma. Your encouraging words and thoughtful, detailed feedback have been essential to me.
Thank you to my parents for your endless support. You have always stood behind me, and this was no exception.
Most importantly, I am grateful for my familys unconditional, unequivocal, and loving support.
HUMAN OR ANIMAL SUBJECTS REVIEW FORM
for
________Yan Wang________
Name of Student
PROTEIN SECONDARY STRUCTURE PREDICTION USING DEEP LEARNING
Title of Research Project
This research project has been reviewed by the Institutional Review Board and approved as follows (the appropriate block must be checked by either the Thesis chair or the Chair of the Institutional Review Board):
Neither humans nor animals will be used, and this research is certified exempt from Institutional Review Board review by the thesis committee chair.
Human participants will be used, and this research is certified exempt from Institutional Review Board review by the Chair of the Instituional Review Board.
Human participants will be used, and this research was reviewed and is approved by the Institutional Review Board.
Animal participants will be used, and this research was reviewed and is approved by the Animal Research Review Board.
_________________________________________
Signature of Thesis Committee Chair Date
_________________________________________
Signature of Chair of Institutional Review Board Date
TABLE OF CONTENTS
LIST OF TABLES.............................................................................................................. x
LIST OF FIGURES............................................................................................................ xi
CHAPTER TWO: PROTEIN STRUCTURE..................................................................... 6
2.1 Primary Structure...................................................................................................... 6
2.2 Secondary Structure.................................................................................................. 7
2.3 Tertiary Structure...................................................................................................... 9
2.4 Quaternary Structure................................................................................................. 9
CHAPTER THREE: PROBLEM FORMULATIONS AND RELATED WORK........... 11
3.1 Problem Formulation.............................................................................................. 11
3.2 Related Work........................................................................................................... 14
CHAPTER 4: NEURAL NETWORK AND DEEP LEARNING.................................... 16
4.1 Machine Learning................................................................................................... 16
4.2 Neural Network....................................................................................................... 16
4.3 Deep Learning......................................................................................................... 18
4.3.1 Convolutional Neural Network (CNN)............................................................ 18
4.3.2 Recurrent Neural Network (RNN)................................................................... 20
CHAPTER FIVE: PREDICT PROTEIN SECONDARY STRUCTURE........................ 24
5.1 Data Sets.................................................................................................................. 2
5.2 CNN model............................................................................................................. 25
5.2.1 CNN with Water Solvent Accessibility........................................................... 26
5.2.2 CNN with Charge Property.............................................................................. 27
5.2.3 CNN with Two Properties................................................................................ 28
5.3 RNN model............................................................................................................. 29
5.2.1 RNN with Water Solvent Accessibility........................................................... 30
5.2.2 RNN with Charge Property.............................................................................. 30
5.2.3 RNN with Two Properties................................................................................ 31
CHAPTER SIX: EXPERIMENT AND RESULTS DISCUSSION................................. 32
6.1 Experiment Environment........................................................................................ 32
6.2 Experiment on CNN............................................................................................. 32
6.2.1 Results with Water Solvent Accessibility........................................................ 35
6.2.2 Results with Charge Property........................................................................... 36
6.2.3 Results with Two Properties............................................................................ 37
6.3 Experiment on RNN............................................................................................. 38
6.3.1 Results with Water Solvent Accessibility........................................................ 40
6.3.2 Results with Charge Property........................................................................... 41
6.3.3 Results with Two Properties............................................................................ 42
CHAPTER SEVEN: CONCLUSION AND FUTURE WORK....................................... 44
7.1 Conclusion............................................................................................................... 44
7.2 Future Work.......................................................................................................... 45
LIST OF REFERENCES.................................................................................................. 46
APPENDICES................................................................................................................... 48
LIST OF TABLES
Table 1:.................................................................................................................. 9
Eight States in the second structure
Table 2:................................................................................................................ 13
DSSP
Table 3:................................................................................................................ 25
Data description
Table 4:................................................................................................................ 43
Overall performance on all models
LIST OF FIGURES
Figure 1............................................................................................................................... 7
Protein primary structure
Figure 2............................................................................................................................... 8
Protein secondary structure
Figure 3............................................................................................................................... 9
Protein tertiary structure
Figure 4............................................................................................................................. 10
Protein quaternary structure
Figure 5............................................................................................................................. 17
Neural network structure
Figure 6............................................................................................................................. 19
CNN model
Figure 7............................................................................................................................. 21
Working flow in RNN
Figure 8............................................................................................................................. 21
Unrolled RNN structure
Figure 9............................................................................................................................. 22
Cell structure in LSTM
Figure 10........................................................................................................................... 22
Information passing in LSTM
Figure 11........................................................................................................................... 23
Protein primary structure
Figure 12........................................................................................................................... 26
CNN architecture with sequence only
Figure 13........................................................................................................................... 27
CNN architecture uses water solvent accessibility
Figure 14........................................................................................................................... 28
CNN architecture uses charge property
Figure 15........................................................................................................................... 29
CNN architecture uses two properties together
Figure 16........................................................................................................................... 30
RNN architecture with sequence only
Figure 17........................................................................................................................... 30
RNN architecture uses water solvent accessibility
Figure 18........................................................................................................................... 31
RNN architecture uses charge property
Figure 19........................................................................................................................... 31
RNN architecture uses two properties together
Figure 20........................................................................................................................... 34
Training performance of CNN without using any properties
Figure 21........................................................................................................................... 35
Training performance of CNN uses water solvent accessibility
Figure 22........................................................................................................................... 36
Training performance of CNN uses charge property
Figure 23........................................................................................................................... 3
Training performance of CNN uses two properties together
Figure 24........................................................................................................................... 39
Training performance of RNN without using any properties
Figure 25........................................................................................................................... 40
Training performance of RNN uses water solvent accessibility
Figure 26........................................................................................................................... 41
Training performance of RNN uses charge property
Figure 27........................................................................................................................... 42
Training performance of RNN uses two properties together
CHAPTER ONE: INTRODUCTION
Proteins are made of amino acids residues that join from one or more chains. In general, a protein consists of 20 standard amino acids; but in some specific protein sequences, a protein includes selenocysteine and pyrrolysine. These two amino acids are usually the 21st and 22nd amino acids (Zhang & Gladyshev, 2017); they are not used in our project. The different combinations in these 20 amino acids in a protein sequence cause proteins to differ from one another. The acid amino sequence results in protein folding into a 3D structure, which determines the function of a protein (Sillitoe, Lewis & Cuff, 2015; Yang, Gao, Wang, Heffernan, Hanson, Paliwal & Zhou, 2016), such as balancing the fluids, transporting and storing nutrients, providing energy, and so forth. Protein structure is one of essential research in computational biology. Understanding the protein structure can help people know how the protein functions work, creating hypotheses about how to control or modify it.
The amino acid sequence conveys valuable information to understand a proteins structure and its functions. In nature, a protein is made up of 20 amino acids; they are Alanine (Ala, A), Arginine (Arg, R), Asparagine (Asn, N), Aspartic acid (Asp, D), Cysteine (Cys, C), Glutamic acid (Glu, E), Glutamine (Gln, Q), Glycine (Gly, G), Histidine (His, H), Isoleucine (Ile, I), Leucine (Leu, L), Lysine (Lys, K), Methionine (Met, M), Phenylalanine (Phe, F), Proline (Pro, P), Serine (Ser, S), Threonine (Thr, T), Tryptophan (Trp, W), Tyrosine (Tyr, Y), Valine (Val, V). Proteins can be hundreds of amino acids long or a few amino acids long, but not every protein uses all 20 amino acids. Due to these possibilities, the combinations of these amino acids are nearly countless to form a protein. DNA is a genetic molecule that contains the genetic code of an organism. The nucleotide bases of DNA are Adenine (A), Cytosine (C), Guanine (G), and Thymine (T). Each genes code combines these nucleotides in a group of three, and every three-nucleotide code specifies which amino acid is needed to make a protein. Hence, a sequence of 300 nucleotides determines a 100 amino acid long protein. The order of the amino acid is vital to proteins because it determines the proteins structure, which identifies the proteins functions.
The Nucleotide database is a collection of sequences from multiple resources, such as GenBank, TPA, and RefSeq. For example, GenBank[1] contains 16.1 trillion nucleotide bases in more than 2 billion sequences in November 2021. The Protein Data Bank (PDB) is a database built for storing the 3D structure of biological molecules, such as protein. By 2021, there are over 160K protein structures have been deposited into the RCSB[2]. Comparing the number of protein sequences in GenBank to the whole deposited structure in the RCSB protein data bank, the protein sequence is 1250 times more than the protein structure. After detecting a new protein sequence, the biggest challenges in finding protein structures from experiments are cost and time, it usually costs around $100,000 to determine a new structure. To predict a protein structure with a lower cost from the protein sequences becomes very important. In our project, we will apply machine learning algorithms to predict the secondary structure of the protein from the protein sequences.
A protein can be divided into four essential structures: primary, secondary, tertiary, and quaternary. The protein secondary structure of a protein refers to the local conformation of proteins polypeptide backbone. There are two regular secondary structure states, ?-helix (H) and ?-strand (E), and one rare secondary structure type, the coil region (C).
Accurately predicting the protein 3D structures from the protein sequences alone is one of the most challenging tasks in computational biology. Therefore, predicting the protein secondary structure provides an alternative method for the tertiary structure prediction; it links the primary sequence and tertiary structure (Plaxco,
Simons & Baker, 1998; Zhou & Karplus, 1999; Ozkan, Wu & Chodera, 2007). Accurate protein structure prediction depends on the secondary structure prediction accuracy (Fischer & Eisenberg, 1996; Wu, Skolnick & Zhang, 2007). However, protein secondary structure prediction is still a classical problem in bioinformatics (Yang, Gao, Wang, Heffernan, Hanson, Paliwal &
Zhou, 2016). In this paper, we attempt to improve the performance of predicting proteins secondary structure.
Protein secondary structure prediction since 1951 when Pauling and Corey predicted helical and sheet conformations for protein polypeptide backbone before the first protein structure was identified. Sixty-five years later, some new methods had been invested into this field. The most used method is the secondary structure assignment methods that include three classes Secondary Structure Prediction (SSP) and Dictionary of Secondary Structure of Proteins (DSSP), which automatically assigns the secondary structure into eight states according to hydrogen-bonding patterns (Kabsch & Sander, 1983). The secondary structure prediction with constant improvement of accuracy till today.
Machine learning algorithms have been proven that they perform well in many classification and prediction tasks. Seeing the advantage of machine learning algorithms, especially for an artificial neural network in the artificial intelligence area, have already been applied to bioinformatics. The neural network was first used in the secondary structure prediction had 69.7?curacy (Rost & Sander, 1993). 80?curacy can be achieved by Structural Property prediction with an Integrated Neural network (SPINE) in 2007 (Dor&Zhou, 2007). Also, accuracy was boosted up to 82% by Structural Property prediction with Integrated deep neural network 2 (SPIDER2) in 2017 (Yang, Heffernan, Paliwal, Lyons, Zhou, 2017). Moreover, A Multilayer Shift-and-Stitch Deep Convolutional Architecture improved a little on the performance of the protein structure prediction. Deep Convolution Neural Field network (DeepCNF) in 2016 increased the accuracy to 84% (Wang, Peng, Ma & Xu, 2016). As a machine learning framework, deep learning is working similarly to the artificial neural network, and it has an outstanding reputation in classification and prediction tasks. There is much existing research that uses deep learning in the protein secondary structure prediction. Our project will apply both Convolution Neural Network (CNN) and Recurrent Neural Network (RNN) into the protein secondary structure prediction. Compared to the current research in the same task, we attempt to improve the prediction performance by using the amino acid sequence and the different protein properties.
Chapter TWO: PROTEIN STRUCTURE
Protein is a macromolecule that plays many crucial roles in a living organism. They work in the cells and compose a structure, function in the cell. A protein works as a catalyst for most of the biochemical reactions that occur in a body. Proteins help protect the body, assist with forming new molecules, support cells, and transport/store oxygen.
A protein is composed of many amino acids, which are joined together to form a long chain. In general, a protein is made of 20 amino acids, but not all proteins consist of 20 amino acids; they use a few amino acids instead. Therefore, proteins have similar functions in a body if they have identical amino acid sequences. Because of nearly infinite combinations of the amino acid sequences, proteins perform different functions.
To fully understand a proteins function, it is critical to explore both its structures and functions. A protein folds into a 3D structure that is determined by the proteins amino acid sequence. The complete structure can be divided into four different levels: primary, secondary, tertiary, and quaternary structure.
2.1 Primary Structure
The primary structure is described as the linear amino acid sequence of a protein's polypeptide chain. The term protein sequence often refers to the primary structure. The primary structure consists of 20 amino acids, and a peptide bond forms the amino acid sequence. Figure 1 shows the primary structure of a protein. In 1973, Chris Anfinsen demonstrated that the higher level of structure is uniquely determined by an amino sequence of a protein (Anfinsen, 1973). The amino sequence regulates a proteins 3D structure, which determines the functions of a protein. For example, Karp and Geer (2005) describe the change in the sequence of hemoglobin causing sickle cell anemia caused by a singular twist in the sequence. The amino acid sequence in a protein differs from each other in the structure of its chain. So far, the primary structure has a thousand proteins has been found and studied, such as pancreatic ribonuclease, insulin, and trypsin.
Figure 1 Protein primary structure[3]
2.2 Secondary Structure
[1] https://ncbiinsights.ncbi.nlm.nih.gov/2021/11/15/genbank-release-246/#more-7199
[2] https://www.rcsb.org/structure/3P7U
[3] https://content.byui.edu/file/a236934c-3c60-4fe9-90aad343b3e3a640/1/module3/readings/proteins.html
The secondary protein structure depends on the local spatial interaction backbone between parts of a protein chain exclude the side chain. The secondary structure refers to polypeptide folds, which decide the 3D shape of the protein. In Figure 2, there are two common types of secondary structure, ?-helix and ?-pleated sheet.
- -helix: resembles a coiled spring and is secured by hydrogen bonding in the polypeptide chain
- -pleated sheet: to be folded or pleated and is joined together by hydrogen bonding between adjacent polypeptide units of the folded chain
Many proteins contain both ? helices and ? pleated sheets, but some include only one type of secondary structure. The 3D structure in this level reveals more information about the construction of a protein.
Figure 2 Protein secondary structure[1]
The secondary structure of a protein has eight states; they are shown in the following table. Our project will use these eight states (shown in Table 1) for the secondary structure prediction of the classes in deep learning models.
Table 1. Eight States in the secondary structure
Three states |
?-helix |
?-pleated sheet |
Coil |
|||||
Abbreviation |
H |
E |
C |
|||||
Eight states |
H |
G |
I |
E |
B |
T |
S |
- |
2.3 Tertiary Structure
The protein polypeptide chains form different secondary structures, and they can be further coiled and folded into a complex 3D shape of a proteins tertiary structure. The tertiary structure refers to the comprehensive 3D shape of a polypeptide after the secondary structure. The side-chain conformation mainly forms the tertiary structure. The tertiary structure may be necessary for a proteins functions, such as finding an active site in enzymes. This structure is displayed in Figure 3.
Figure 3. Protein tertiary structure[2]
2.4 Quaternary Structure
Some proteins comprise more than one independent polypeptide chain; every individual polypeptide chain unit with a unique tertiary structure is called a subunit. Multiple subunits with the same type may interact to form a more complex and more extensive 3D structure, the protein quaternary structure. Because of the complexity of the quaternary structure conformation, it is impossible to predict a protein's folded structure from its amino acid sequence alone. In figure 4, the quaternary structure is illustrated.
Figure4. Protein quaternary structure[3]
CHAPTER THREE: PROBLEM FORMULATIONS AND RELATED WORK
3.1 Problem Formulation
Accurately predicting a proteins 3D structure is still an unsolved problem in biological computation. Although scientists have known the four protein levels, it is difficult to predict its 3D structure via the amino acid sequence alone due to the nearly countless folded amino acid sequences combinations. The discovered unique amino acid sequences and proteins structures are being stored in many databases. For instance, GenBank has more than 2 billion sequences so far. There are over 160K protein structures that have been deposited into the RCSB. Comparing the number of protein sequences in GenBank to the whole deposited structure in the RCSB protein data bank, the protein sequence is 1250 times more than the protein structure. After discovering a new protein sequence, the most difficulty in solving protein structures from experiments are cost and time. It usually costs around $100,000.
Nowadays, the existing research solves the protein structure prediction in the following fields.
- Predict the secondary or tertiary structure from the primary structure
- Using the secondary structure to predict the tertiary structure
- Predict the quaternary structure from lower levels, tertiary and secondary structures
Among the above research on protein structure prediction, predicting the secondary structure from the primary structure is the most popular subproblem. Some research has proven they have already confidently predicted a protein's secondary structure with higher accuracy by using amino acid sequences. The secondary structure is an elementary level that bridges the primary structure and the tertiary structure. On the other hand, the secondary structure of a protein determines how proteins fold and how fast they fold (Plaxco, Simons & Baker, 1998). The accuracy of the secondary structure prediction is determined by how the second structure is correctly defined.
Overall, two categories can be used in the proteins secondary structure prediction. One is the Dictionary of Second Structure of Protein (DSSP), the other is Second Structure of Proteins (SSP). Wolfgang Kabsch first designed DSSP in 1983, and later it was standardized by Chris Sander. (Kabsch & Sander 1983). DSSP is a collection of secondary structure annotation for protein structures. The secondary structure is assigned one of eight possible states based on the hydrogen bonding patterns. These eight states and their letter representations are shown in Table 2. These annotations for the secondary structures are used in our deep learning classification models as the data labels/classes. Furthermore, these eight states in DSSP can be divided into three categories, helix, sheet, and coil. They are the second structure of a protein (SSP).
Table 2 DSSP
Eight State Names |
Letter Representation |
3-helix |
G |
alpha-helix |
H |
pi-helix |
I |
helix-turn |
T |
extended beta-sheet |
E |
beta bridge |
B |
bend |
S |
loop |
L |
Our project is not just using the amino acid sequence to predict the protein secondary structure; instead, we apply amino acid properties as additional features into our machine learning models. The different amino acids are divided into three primary classes: charged residues, hydrophobic, and hydrophilic.
- Amino acid charge properties include positive charge and negative charge. The N-terminal acid is positively charged; the C-terminal amino acid is negatively charged. All other charges are neutralized
- The hydrophobic amino acid has little or even no polarity in the side chain
- The hydrophilic amino acid has two essential types, and the first one is Accessible Surface Area (ASA) or solvent-accessible Surface Area (SASA). ASA was first described in 1971 by Lee & Richards, and it is calculated by an algorithm using a sphere of a specific radius to explore the accessible surface area. The second type is Relative Accessible Surface Area (RASA) or Relative Solvent Accessibility (RSA). It is a measurement of the residue solvent exposure.
Our project will use water solvent accessibility and protein as extra features in deep learning training.
3.2 Related Work
A variety of state-of-art methods already improved the prediction accuracy on the secondary structure of a protein without using homologous sequences in training, such as SPIDER2, DeepCNF, MUST-CNN, and so on forth. Most of these methods use machine learning, more specifically deep learning framework, like CNN.
SPIDER2 (Yang, 2017) applied deep learning networks with three iterations to simultaneously improve the accuracy of several structural properties. SPIDER2 predicted the secondary structure and accessible surface area representing local and nonlocal structural properties, and it could achieve 82?curacy on the test dataset.
DeepCNF is short for Deep Convolutional Neural Fields, improved the secondary structure prediction by discovering the long-range sequential information and interdependencies between adjacent labels. To solve the imbalance issue in the labels, DeepCNF assigned the different weights for each label during the training. Their method was evaluated in the CASSP9 and CASO10 and achieved 0.855 and 0.898 AUC values (Wang, Peng, Ma & Xu, 2016).
MUST-CNN improved the secondary structure prediction using a multilayer shift-and-stitch technique to produce a fully dense per-position prediction on the proteins primary sequence. MUST-CNN also obtained comparable accuracy with a faster training process on datasets 4prot and cullPDB (Lin, Lanchantin, & Qi, 2016).
Deep Convolutional and Recurrent Neural Network (DCRNN) improves the secondary structure by utilizing a stacked deep learning framework. This model consists of four structures, one feature embedding layer, CNN layer, bidirectional gated recurrent unit (BGRU) layers, and two fully connected layers. DCRNN achieved 73.2?curacy on the dataset CB6133 (Li, Yu, Shahabi & Liu, 2018).
Chapter FOUR: Neural Network and deep learning
[1] https://www.khanacademy.org/science/biology/macromolecules/proteins-and-amino-acids/a/orders-of-protein-structure
[2] https://www.stereoelectronics.org/webDD/DD_02.html
[3] https://ib.bioninja.com.au/higher-level/topic-7-nucleic-acids/73-translation/protein-structure.htm
4.1 Machine learning
Machine learning is getting a computer to automatically build a learning model from the historical data without being explicitly programmed. As a branch of artificial intelligence, machine learning is being used in many fields, such as text classification, image processing, healthcare, and so forth. Machine learning is very close to our daily life, and even people probably dont realize it.
Machine learning includes two types of learning, supervised learning and unsupervised learning. Supervised learning is a method of training the model with labeled data. Labeled data is the data samples that have been manually annotated with a class or multiple labels. That dataset trains a machine learning algorithm to predict or classify outcomes accurately for unforeseen data. Several classical supervised learning algorithms, such as linear regression, are designed for the regression task. The Support Vector Machine (SVM), Decision Tree, and Artificial Neural Network (ANN) is used for classification and prediction tasks. On the other hand, unsupervised learning is another machine learning approach that inferred patterns from the unlabeled data. Unsupervised learning is designed for discovering the structure and patterns from the input data. K-means is a popular algorithm used for unsupervised learning.
4.2 Neural Network
As mentioned in the previous section, Artificial Neural Network (ANN) or Neural Network (NN) is one of the machine learning algorithms used for classification and prediction tasks. The neural network initially tried to mimic the human brain, it was widely used in the 80s and early 90s, but it is now a state-of-the-art machine learning for many applications. A neural network contains an input layer, one or more hidden layer(s), and one output layer. Each layer has a different number of neurons or nodes. The number of nodes in the input later is determined by the number of features in training data. The hidden layers can have any number of nodes, but the number of the output layer is decided by the number of classes or labels in the input data. Hence, the neural network works well in both binary classification and multi-class classification. Figure 5 displays an example structure of a neural network.
Figure 5. Neural network structure[1]
In a simple neural network, the nodes in all layers are assumed to be fully connected, which means each node in the current layer is connected to all the nodes in the previous layer. Neural network training contains two phases, forward propagation and backward propagation. In forward propagation, the network calculates the output by propagating the input data through the layers. It means the output calculated from the previous layer will be used as input to the next layer. There are several activation functions used in forwarding propagation. The first one is a sigmoid or logistic function with an S-shape with output between 0 to 1. Hence, the sigmoid function used in the output layer is built for the binary classification task. Another activation function that can be used for binary classification is tanh. The second activation function is called softmax, a generalized logistic activation function used for multiclass classification. Third, the ReLU is another popular activation function, it is usually used in the deep learning model. Backward propagation finds the cost function's gradient, then adjusts or corrects the weights to reduce the loss function.
4.3 Deep Learning
Deep learning is different from machine learning algorithms, and it learns the features automatically. This method reduces the need for feature engineering work but takes more extended training than the traditional machine learning algorithms. Deep learning generally performs well in most artificial intelligent applications, primarily relying on a large dataset. There are mainly two types of deep learning models, convolutional neural network and recurrent neural network.
4.3.1 Convolutional Neural Network (CNN)
CNN is mainly used to build a model for image data, for example, image classification, handwriting recognition, object detections, and so forth. Taking image data as input, the CNN model detects and extracts spatial and temporal dependencies from images through several convolutional layers with filters (kernels), pooling, flatten, and a fully connected layer. The output layer works the same as the output layer in the traditional neural network; a few activation functions, e.g., softmax, can be applied to implement a multiclass classification in CNN. A sample CNN model is shown in Figure 6.
Figure 6. CNN model[2]
The first layer in CNN is the convolutional layer, which extracts features from input images. Each image is transported into a 3D matrix, and it is . The depth is the number of image channels (e.g., RGB), equal to the number of filters. A filter is represented by a matrix that shares the weights on the input. Hence, an image with different filters (a.k.a, kernels, or feature maps) can perform operations, for example, edge detection, image blur, and sharpen. Another parameter in the convolutional layer is stride, the number of pixels that shift over the input matrix. If the filter cannot fit the input image, a strategy called zero-padding can be applied to pad images with zeros.
Another layer in CNN is the pooling layer. It is used to reduce the image size to avoid the overfitting problem. The overall dimensionality is decreased but retains the dominant features. The pooling layer uses three pooling methods: max pooling, average pooling, and sum pooling. The max-pooling returns the largest element from the portion of the image covered by a filter. The average pooling is similar to the max-pooling but returns the average value. The last pooling method, sum pooling, sum up all the values from the image portion covered by a filter.
The flattening layer reshapes the matrix into a flat vector and feeds it into a fully connected layer. They are working as the traditional neural network.
4.4.2 Recurrent Neural Network (RNN)
Compared to the CNN that performs well in the image data, the recurrent neural network works for sequential data and time-series data, such as machine translation, speech recognition, text classification, etc. RNN has a concept of memory, storing the state or information of prior inputs to generate the subsequent sequential output. Therefore, RNN can capture the contextual data from the sequence. Figure 7 shows the structure of RNN, the information to be persisted with loops in cell A, which allows information to be passed through one state of a cell to the next. Network A's loop is unrolled to pass a message to the following network in Figure 8. Hence, RNN can retain a short-term dependency. However, it is necessary to persist a piece of long-time information. A famous architecture of RNN is Long Short Term Memory (LSTM), which is designed to remedy the short-term issue in RNN.
Figure 7. Working flow in RNN[3]
Figure 8. Unrolled RNN structure9
Long Short Term Memory (LSTM) has a similar working flow with a recurrent neural network. The differences are the designs within the LSTMs cells. The figure below illustrates the structure of an LSTM cell. These operations help an LSTM to retain or leave out information. The overall design of an LSTM is shown in Figure 9.
Figure 9. Cell structure in LSTM[4]
Figure 10. Information passing in LSTM9
Another RNN variant is called Gated Recurrent Unit (GRU). It is similar to LSTM but uses hidden states rather than a cell state in LSTM to transfer information. In addition, GRU has two gates, a reset gate, and an updated gate instead of three gates in the LSTM. The cell structure is displayed below.
Figure 11. The inner structure of GRU9
Chapter FIVE: proposed methods to predict the secondary structure of a protein
[1] https://www.w3schools.com/ai/ai_neural_networks.asp
[2] https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a5
[3] https://colah.github.io/posts/2015-08-Understanding-LSTMs/?source=post_page---------------------------
[4] https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21
5.1 Data sets
Two datasets I used in the protein secondary structure prediction are CB6133 and CB513. The first one is used as a training dataset, and the second one is our test data.
- CB6133: generated with PISCES CullPDB (Wang and Dunbrack, 2003). CB6133 contains 6133 non-homologous protein sequences and protein structures. Moreover, 25% of redundant data with cb513 have been removed after filtering. This filtered CB6133 dataset has 5534 proteins, which are used as training data. The dimension of training data is (5534 proteins * 57 features). The input is reshaped to (5534 proteins * 700 amino acids * 57 features) to fit the deep learning model.
- CB513 is a public benchmark dataset that is used for testing. Zhou and Troyanskaya produced it in 2014. One of the protein sequences in this data set is more than 700 amino acids, separated into two overlapping sequences. Therefore, there are 514 rather than 513 protein sequences were used in the testing. The test data dimension is (514 protein * 57 features).
The 57 features in the data sets, CB6133 and CB513, are described in Table 3. The letter X in the amino acid sequence denotes the unknown amino acid. The Noseq marks the end of the protein sequence in both amino acid residues and secondary structure. Our project will use the water solvent accessibility and protein charge property as additional features in the training and testing.
Table 3. Data Description
Index |
Descriptions |
[0, 22) |
Amino acid residues. The order is: 'A', 'C', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'M', 'L', 'N', 'Q', 'P', 'S', 'R', 'T', 'W', 'V', 'Y', 'X', 'NoSeq' |
[22, 31) |
Secondary structure labels, with the sequence of 'L', 'B', 'E', 'G', 'I', 'H', 'S', 'T', 'NoSeq' |
[31, 33) |
N-terminal and C-terminal |
[33, 35) |
Relative and absolute solvent accessibility |
[35, 57) |
Sequence profile |
5.2 CNN model
Some existing research uses the protein sequences to predict the secondary structure, and the accuracy is different. In the CNN, we utilize the amino acid sequences and two additional protein properties as features in training. They are water solvent accessibility and amino acid charge property. Because there are eight states in the secondary structure in protein, these eight states are used as classes/labels in the multiclass classification. The following sections discuss the different CNN models by using extra features in the training phase. Figure 12 displays the CNN architecture with protein sequence only.
Figure 12. CNN architecture with sequence only
5.2.1 CNN with Water Solvent Accessibility
In this model, we first applied water solvent accessibility as an additional feature. Relative Solvent Accessibility (RSA) and Absolute Solvent Accessibility (ASA) are two values for these properties. Figure 13 shows the structure of the first model in CNN.
Figure 13. CNN architecture uses water solvent accessibility
5.2.2 CNN with Charge Property
Another additional feature we used in the CNN model is the protein charge property. The C/N terminals represent the amino acid negative and positive charge properties. The structure of the CNN model along with charge property is displayed in Figure 14.
Figure 14. CNN architecture uses charge property
5.2.3 CNN with Two Properties
We also combined two properties described in the above sections into another CNN model. This new CNN model contains two additional features. The structure is presented in Figure 15.
Figure 15. CNN architecture uses two properties together
5.3 RNN model
Similar to the CNN model, we also apply the protein properties, water solvent accessibility, and protein charge to the RNN model, specifically the LSTM. Figure 16 presents the pure LSTM model without using any protein properties.
Figure 16. RNN architecture with sequence only
5.3.1 RNN with Water Solvent Accessibility
The water solvent accessibility is used as an extra feature in the LSTM. The model is shown in Figure 17
Figure 17. RNN architecture uses water solvent accessibility
5.3.2 RNN with Charge Property
Also, the protein change property can be used as another feature along with the amino acid sequence. In Figure 18, this model is displayed
Figure 18. RNN architecture uses charge property
5.3.3 RNN with Two Properties
The last RNN model uses two protein properties together. Figure 19 illustrates this model.
Figure 19. RNN architecture uses two properties together
Chapter SIX: Experiment and Results Discussion
6.1 Experiment Environment
Our project is working on two models, CNN and RNN. Both two models use dataset CB6133 for training, another dataset CB513 for testing.
Python provides various libraries/packages for machine learning and data science, and Python builds our project with version 3.8 on the one python development software, PyCharm. The following libraries are used in this project.
- Tensorflow-2.6.0
- Keras-2.6.
- Skikit_learn-0.24.2
- Numpy-1.19.5
- Scipy-1.7.1
- Matplotlib-3.4.3
All the experiments are running on a computer with hardware:
- Operating system: macOS Big Sur 11.6.1
- Memory: 8 GB
- Processor: 2.4 GHz Dual-core Intel i5 CPU
6.2 Experiment on CNN
Working on the CNN model, the training experiment parameters include:
- Sequence length: 700
- Number of features: 21
- Learning rate: 0.0009
- Dropout: 0.38
- Batch dimension: 64
- Epochs: 30
- Loss function: categorical_crossentropy
- Activation function: Relu
- Pooling: max_pooling
- Metrics: accuracy
Each epoch when training the CNN model takes around 22 minutes; in total, the training time for one CNN model is approximately 11 hours.
The model accuracy for training and evaluations and the modal loss for training and evaluation are presented in Figure 20.
Figure 20. Training performance of CNN without using any properties
6.2.1 Results with Water Solvent Accessibility
Same as the first model except the number of features is 23 because RSA and ASA are added. The training accuracy and loss are displayed in Figure 21.
Figure 21. Training performance of CNN uses water solvent accessibility
6.2.2 Results with Charge Property
Same as the first model except the number of features is 23 because N-terminal and C-terminal are added. Figure 22 shows training performance for this model.
Figure 22. Training performance of CNN uses charge property
6.2.3 Results with Two Properties
Same as the first model except the number of features is 25 because N/C terminals, ASA/RSA are added. The training performance is represented in Figure 23.
Figure 23. Training performance of CNN uses two properties together
6.3 Experiment on RNN
Working on the RNN model, the training experiment parameters include:
- Sequence length: 70
- Number of features: 21
- Learning rate: 0.0009
- Batch dimension: 64
- Epochs: 30
- Optimizer: adam
- Loss function: categorical_crossentropy
- Activation function: softmax
- Metrics: accuracy
Each epoch when training the RNN model usually takes 8 minutes; in total, the training time for one RNN model is around 3.5 hours.
The model accuracy for training and evaluations and the modal loss for training and assessment are presented in Figure 24.
Figure 24. Training performance of RNN without using any properties
6.3.1 Results with Water Solvent Accessibility
Same as the first model except the number of features is 23 because RSA and ASA are added. The training accuracy and loss are displayed in Figure 25.
Figure 25. Training performance of RNN uses water solvent accessibility
6.3.2 Results with Charge Property
Same as the first model except the number of features is 23 because N-terminal and C-terminal are added. Figure 26 shows training performance for this model.
Figure 26. Training performance of RNN uses charge property
6.3.3 Results with Two Properties
Same as the first model except the number of features is 25 because N/C terminals, ASA/RSA are added. The training performance is represented in Figure 27.
Figure 27. Training performance of RNN uses two properties together
The accuracy achieved from CNN and RNN on the testing data, CB513, is shown in Table 4.
Table 4. Overall performance on all models
|
CNN |
RNN |
21 amino acid sequence |
0.0670 |
0.904 |
21 amino acid sequence + Water solvent accessibility |
0.071 |
0.895 |
21 amino acid sequence + N/C terminals |
0.073 |
0.906 |
21 amino acid sequence + Water solvent accessibility + N/C terminals |
0.074 |
0.898 |
As per the results we have obtained from both CNN and RNN, the RNN used less time than CNN did; also, all the RNN models' accuracy is better than any CNN models.
Chapter 7: Conclusion and future work
7.1 Conclusion
Protein is essential to living bodies. Understanding the protein structure helps learn how the proteins functions work in the human body, such as storing nutrition. According to the records in the gene database, for example, GenBank has more than 2 billion sequences. However, there are currently over 160K protein structures that have been deposited into the RCSB. The biggest challenges in learning protein structures from experiments are cost and time. Usually, solving a new and unique structure costs around $100,000. After discovering a new protein sequence, finding an experimental structure with a lower cost will be challenging. Therefore, accurately predicting the 3D dimension of a protein is a popular research area in computational biology.
A protein structure can be divided into four levels: primary, secondary, tertiary, and quaternary. The secondary structure works like a bridge linking the primary structure and the other two structures. Therefore, most current research focuses on predicting the secondary structure according to the primary structure, that is, the proteins amino acid sequences. Our project will also predict the proteins secondary structure from the amino acid sequences using deep learning frameworks. Compared to the most existing research to predict the secondary structure, we are not using the amino sequence alone to build deep learning models. Instead, we apply two other protein properties in training, the water solvent accessibility and protein charge property.
Weve tried several deep learning models, and their performances are different. Overall, CNN models have low accuracy. But it is notable that using extra features along with amino acid sequence achieved 1% higher accuracy than just utilizing the amino acid sequences. In contrast to the CNN performance, all RNN models have a comparable accuracy compared to most current deep learning models. The winner of the RNN model in our project applies both protein properties to the learning model. More importantly, the RNN model uses less time than CNN does.
7.2 Future work
Seeing the poor CNN performance on the testing data, there are a few possible methods to improve it. First, we will try to use a simpler CNN structure. Second, try different loss functions and decrease the learning rate. Also, we may combine CNN and RNN. The last but the most important one is collecting more training data if they are available.