1 Introduction

Protein subcellular localization is one of the most essential and indispensable topics in proteomics research. Recent years have witnessed the incredibly fast development of molecular biology and computer science, making it possible to utilize computational methods to determine the subcellular locations of proteins. This chapter introduces the background knowledge about proteins, their subcellular locations, and the significance of protein subcellular localization prediction.

1.1 Proteins and their subcellular locations

Proteins, which are essential biological macromolecules for organisms, consist of one or more chains of amino acids residues which are encoded by genes. Proteins occur in a great variety and exist in all cells and in all parts of cells. Moreover, proteins exhibit a tremendous diversity of biological functions and participate in virtually every process within cells. Proteins are important and indispensable in many biological processes. For example, enzymes are a special kind of protein that participate in most of the reactions involved in metabolism catalysis; membrane proteins are receptors for cell signalling, i.e. binding a signaling molecule and inducing a biochemical response in the cell [23]; antibodies are proteins which are mainly responsible for identifying and neutralizing alien objects such as bacteria or viruses in immune systems; cell adhesion proteins such as selectins, cadherins, or integrins [81] bind a cell to a surface or substrate, which is essential for the pathogenesis of infectious organisms; and some proteins such as digestive enzymes play important roles in chemical digestion to break down food into small molecules which the body can use.

Most of the biological activities performed by proteins occur in organelles. An organelle is a cellular component or subcellular location within a cell that has specific functions. Figure 1.1 illustrates some organelles in a typical eukaryotic cell. In eukaryotic cells, major organelles include cytoplasm, mitochondria, chloroplast, nucleus, extracellular space, endoplasmic reticulum (ER), Golgi apparatus, and plasma membrane. Cytoplasm takes up most of the cell volume where most of the cellular activities such as cell division and metabolic pathways occur. A mitochondrion is a membrane-bound organelle found in most eukaryotic cells. It is mainly responsible for supplying energy for cellular activities. A chloroplast is an organelle which exists in plant or algal cells. Its role is to conduct photosynthesis, storing energy from sunlight. The nucleus is a membrane-enclosed organelle containing most of the genetic materials of a cell. Its main function is to control the activities of the cell by regulating gene expression. Extracellular space refers to the space outside the plasma membrane, which is occupied by fluid. ER is a type of organelle which forms an interconnected membranous network of cistemae, which serves the function of folding protein molecules in cistemae and transporting synthesized proteins to Golgi apparatus. The Golgi apparatus is an organelle which is particularly important for cell secretion. The plasma membrane or cell membrane is a biological membrane that separates the intracellular environment from extracellular space. Its basic function is to protect the cell from its surroundings.

image

Fig. 1.1: Organelles or subcellular locations in a typical eukaryotic cell. Major eukaryotic organelles include cytoplasm, mitochondria, chloroplast, nucleus, extracellular space, endoplasmic reticulum, Golgi apparatus, and plasma membrane.

There are proteins located in the peroxisome, vacuole, cytoskeleton, nucleoplasm, lysosome, acrosome, cell wall, centrosome, cyanelle, endosome, hydrogenosome, melanosome, microsome, spindle pole body, synapse, etc.1 For viruses, viral proteins are usually located within the host cells, which are distributed in subcellular locations such as host cytoplasm, host cell membrane, host ER, host nucleus, and viral capsid.

1.2 Why computationally predict protein subcellular localization?

As an essential and indispensable topic in proteomics research and molecular cell biology, protein subcellular localization is critically important for protein function annotation, drug target discovery, and drug design [4, 9, 176]. To tackle the exponentially growing number of newly found protein sequences in the postgenomic era, computational methods were developed to assist biologists in dealing with large-scale protein subcellular localization.

1.2.1 Significance of the subcellular localization of proteins

Proteins located in appropriate physiological contexts within a cell are of paramount importance in exerting their biological functions. Subcellular localization of proteins is essential to the functions of proteins and has been suggested as a means to maximize functional diversity and economize on protein design and synthesis [25]. Aberrant protein subcellular localization is closely correlated to a broad range of human diseases, such as Alzheimer’s disease [106], kidney stones [102], primary human liver tumors [111], breast cancer [30], preeclampsia [117], and Bartter syndrome [87]. Knowing where a protein resides within a cell can give insights for drug target identification and design [42, 130].

1.2.2 Conventional wet-lab techniques

Although many proteins are synthesized in the cytoplasm, how proteins are transported to specific cellular organelles often remains unclear. Conventional wet-lab methods use genetic engineering techniques to assess the subcellular locations of proteins. There are three main wet-lab techniques:

  1. Fluorescent microscopy imaging. This technique creates a fusion protein consisting of the natural protein of interest linked to a “reporter”, such as green fluorescent proteins [196]. The subcellular position of the fused protein can be clearly and efficiently visualized using microscopy [239].
  2. Immunoelectron microscopy. This technique is regarded as a gold standard and uses antibodies conjugated with colloidal gold particles to provide high-resolution localization of proteins [139].
  3. Fluorescent tagging with biomarkers. This technique requires the use of known compartmental markers for regions such as mitochondria, chloroplasts, plasma membrane, Golgi apparatus, ER, etc. It uses fluorescently tagged versions of these markers, or antibodies to known markers, to identify the localization of a protein of interest [135].

Wet-lab experiments are the gold standard for validating subcellular localization and are essential for the design of high quality localization databases such as The Human Protein Atlas.2

1.2.3 Computational prediction of protein subcellular localization

Although the various wet-lab experiments mentioned in Section. 1.2.2 can be used to determined the subcellular localization of a protein, solely conducting wet-lab experiments to acquire this knowledge is costly, time-consuming, and laborious. With the avalanche of newly discovered protein sequences in the postgenomic era, large-scale localization of proteins within cells by conventional wet-lab techniques is by no means wise and tractable. Table 1.1 shows the growth of protein sequences in the UniProt Database3 in the last ten years. The UniProt Database includes two databases: Swiss-Prot whose protein sequences are reviewed, and TrEMBL, whose protein sequences are not reviewed. As can be seen, the number of entries in Swiss-Prot in 2004 was only 137,916, whereas the figure increased to 542,503 in 2014, which means that the number of reviewed protein sequences has quadrupled in just ten years. More importantly, during this period the number of unreviewed protein sequences has increased by almost 59 times, from 895,002 in 2004 to 52,707,211 in 2014. This suggests that unreviewed protein sequences increase at a significantly faster rate than that of the reviewed ones. Moreover, the ratio of the number of reviewed protein sequences and that of the unreviewed ones has been remarkably widen from 1 : 6 to 1 : 97. This suggests that the gap between the number of reviewed protein sequences and discovered but unreviewed protein sequences is becoming larger and larger. Therefore, using wetlab experiments alone to determine the subcellular localization of such a huge number of protein sequences amounts to a “mission impossible”.

Table 1.1: Growth of protein sequences in the UniProt Database. The UniProt Database includes the Swiss-Prot Database, whose protein sequences have been reviewed, and the TrEMBL Database, whose protein sequences have not been reviewed. Note that as of March 23, 2010 the UniProt release numbers have been changed to the “year/month” format.

Date UniProt Release No. of sequence entries
Swiss-Prot TrEMBL UniProt
02/Feb/2004 1.2 137,916 895,002 1,032,918
15/Feb/2005 4.1 166,613 1,389,215 1,555,828
07/Feb/2006 7.0 204,930 2,042,049 2,246,979
06/Feb/2007 9.6 255,667 3,078,259 3,333,926
05/Feb/2008 12.8 347,458 4,776,500 5,123,958
10/Feb/2009 14.8 408,238 6,592,465 7,000,703
09/Feb/2010 15.14 512,824 9,749,524 10,262,348
08/Feb/2011 2011_02 523,646 12,857,824 13,381,470
22/Feb/2012 2012_02 534,395 19,547,369 20,081,764
06/Feb/2013 2013_02 539,045 29,468,959 30,008,004
19/Feb/2014 2014_02 542,503 52,707,211 53,249,714

Under such circumstances, computational methods are required to assist biologists in dealing with large-scale proteomic data for determining the subcellular localization of proteins. With the rapid progress of machine learning, coupled with an increasing number of proteins with experimentally-determined localization, accurate prediction of protein subcellular localization by computational methods has become achievable and promising.

A protein has four distinct hierarchical structures: (1) primary structure, or the amino acid sequence; (2) secondary structure, or regularly repeating local structures, such as α-helix, β-sheet, and turns; (3) tertiary structure, or the overall shape of a single protein molecule; and (4) quaternary structure, or the structure formed by several protein molecules. Since the primary structure, namely the amino acid sequence, is easier to obtain by high-throughput sequencing technologies, protein subcellular localization prediction usually refers to the problem of determining in which part of a cell a protein resides, given the amino acid sequence of the protein. In other words, computational methods for protein subcellular localization are equivalent to designing a model or a predictor, with the amino acid sequence information of query proteins as input and the subcellular location(s) of the protein as output.

1.3 Organization of this book

The next chapter reviews different kinds of computational methods for protein subcellular localization prediction proposed in the past decades and points out the limitations of these approaches. Chapter 3 details the legitimacy of using gene ontology (GO) information for predicting subcellular localization of proteins. Then in Chapter 4, two predictors, GOASVM and FusionSVM, which are both based on GO information, are proposed for single-location protein subcellular localization. Subsequently, multi-location protein subcellular localization is focused on in Chapter 5. In this chapter, several multi-label predictors, including mGOASVM, AD-SVM, and mPLR-Loc, which were developed based on different classifiers, are introduced for accurate prediction of subcellular localization of both single- and multi-location proteins. Next, Chapter 6 presents the predictors, namely SS-Loc and HybridGO-Loc, that exploit the deep information embedded in the hierarchical structure of the GO database. These predictors incorporate the information of semantic similarity over GO terms. Chapter 7 introduces the ensemble random projection for large-scale protein subcellular localization to construct two dimension-reduced multi-label predictors, namely RP-SVM and R3P-Loc. In addition, two compact databases (ProSeq and ProSeq-GO) are proposed for replacing the conventional databases (Swiss-Prot and GOA) for fast and efficient feature extraction. Chapter 8 details the specific experimental setup, including datasets construction and performance metrics. Extensive experimental results and analyses for all the proposed predictors are detailed in Chapter 9. Further discussions are provided in Chapter 10. The book ends with a conclusion in Chapter 11.

To allow other researchers to use the proposed predictors, several online webservers have been developed and are detailed in Appendix A. An introduction to support vector machines is provided in Appendix B. A complementary proof for no bias during the performance measurement of leave-one-out cross-validation is provided in Appendix C. The derivation for penalized logistic regression is provided in Appendix D.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset