Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 28

Knowledge Discovery in Proteomic Mass Spectrometry Data

Michael Netzer¹; Michael Handler¹; Bernhard Pfeifer¹; Andreas Dander²; Christian Baumgartner¹^,³ ¹ Institute of Electrical and Biomedical Engineering, UMIT, Hall in Tirol, Austria
² Division for Bioinformatics, Biocenter, Innsbruck Medical University, Innsbruck, Austria
³ Institute of Heath Care Engineering with European Notified Body of Medical Devices, Graz University of Technology, Graz, Austria

Abstract

High-throughput technologies such as mass spectrometry produce large amounts of data that require sophisticated computational methods to preprocess and identify highly discriminatory features (biomarker candidates) from these data. In this chapter, a computational workflow for the search and identification of biomarker candidates using mass spectrometry data is presented. First, preprocessing steps necessary to transform raw spectra into comparable data sets are described, followed by a novel three-step feature selection approach that combines the advantages of efficient filter and effective wrapper techniques. The proposed workflow has been integrated into the Knowledge Discovery in Databases (KD³) Designer tool, our self-designed and cost-free software package. One of the main advantages of this tool is its straightforward design, which visualizes processing steps that can be easily connected to workflows. Due to its modular software architecture, new algorithms can be readily implemented into the system. This analysis strategy was evaluated using an example mass spectrometry data set.

Keywords

Proteomics

mass spectrometry

data preprocessing

feature selection

biomarker identification

computational workflow

1 Introduction

In proteomics, mass spectrometry (MS) allows the identification of hundreds to thousands of proteins in cells, tissues, and biofluids (Gerszten et al., 2011). Changes in the concentration of proteins may indicate pathologic processes. Very recent examples include new biomarkers in coronary artery disease (Lee et al., 2015), breast (Suh et al., 2012; Bouchal et al., 2013), bladder (Lindén et al., 2012), liver (Poté et al., 2013), or prostate cancer (Pallua et al., 2013; Mantini et al., 2007), pancreatic beta cell injury (Brackeva et al., 2015), as well as in neurodegenerative diseases such as Alzheimer’s disease (Ringman et al., 2012).

In general, the biomarker discovery process includes several steps: experimental study design and execution, sample collection, preparation and separation, MS analysis, biomarker identification, and biological interpretation and validation (Handler et al., 2011).

After the raw MS data is available, a technical review is necessary due to background signals caused by electronic and chemical noise or ions from unknown fragments in the sample (Cerqueira et al., 2009). To treat background signals, sophisticated computational methods for preprocessing raw data are necessary to provide quality-assured data for further analysis. Such methods include baseline correction, normalization, and quality assessment.

In the next analysis step, highly discriminatory biomarker candidates—at this stage presented by mass-to-charge (m/z) ratio values in the spectra—are identified from the preprocessed spectra. However, the bioinformatic-driven search for relevant markers is challenging, as the spectra are characterized by a huge number of features (hundreds or thousands of m/z values; Osl et al., 2008).

Basically, features can be identified either by calculating a specific score indicating the predictive value of features and selecting those that have a score beyond a certain threshold (filter approaches; John et al., 1994) or by searching the space of feature sets in combination with a learning strategy (e.g., classifier) to select a set of highly discriminating candidates (the wrapper approach; Kohavi and John (1998). Because of the huge amount of possible feature sets, wrapper approaches have higher computational costs, and yet they yield a higher performance compared to filter approaches. In addition, the combination (pooling) of feature selection methods are proposed as described in Netzer et al. (2009) or Saeys et al. (2007). Recently, network-based methods have also been suggested for the identification of metabolic biomarker candidates (Netzer et al., 2011), taking into account the kinetics of circulating analytes.

In this chapter, we present a comprehensive computational workflow for the identification of biomarker candidates using proteomic mass spectrometry data. The first step of the workflow is comprised of complex data preprocessing modalities necessary to transform “noisy” raw spectra into adjusted spectra, followed by a three-step feature selection approach for biomarker identification that combines the advantages of the efficient filter and effective wrapper techniques. This analysis strategy was integrated into the software package, which has been published as the Knowledge Discovery in Databases (KD³) Designer by our group (Dander et al., 2011).

The chapter is structured as follows: Section 1, “Technical Background,” gives a brief survey on MS profiling technologies. In section 2, “Computational Workflow,” we describe the proposed bioinformatic approach for proteomic biomarker identification. In section 3, “Analysis Tool,” we present KD³ as a software tool for data analysis facilitating high usability of all processing steps. Section 4 presents a brief conclusion.

2 Technical background

This section delineates the technical background of MS-based profiling technologies, which have become a key analysis platform in proteomics. In general, a mass spectrometer consists of three components (Aebersold and Mann (2003)): (i) an ion source, (ii) a mass analyzer that measures the mass-to-charge (m/z) ratio of the ionized molecules, and (iii) a detector that counts the ions at each m/z value.

Matrix-assisted laser desorption/ionization (MALDI; Karas and Hillenkamp, 1988) and electro spray ionization (ESI; Fenn et al., 1989) are two commonly used methods to produce ions from macromolecules such as proteins (Aebersold and Mann (2003)).

In this chapter, we focus on MALDI technology and data, which is generally used to ionize dry samples, whereas ESI is also coupled with liquid-based separation platforms (Aebersold and Mann, 2003). In MALDI, a laser and an ultraviolet (UV)–absorbing chemical compound are used to vaporize and ionize the analytes (Parker et al., 2010). MALDI is usually coupled with time-of-flight (TOF) analyzers that measure the mass of intact peptides based on the TOF of molecules in an electric field. Finally, a detector amplifies and counts the arriving ions (Aebersold and Mann, 2003). The resulting spectrum presents mass identities over m/z values; however, mass signals in the spectrum (”true” peaks) can be contaminated by diverse chemical and physical noise (Satten et al., 2004; Mantini et al., 2007). These disturbances in the signal may lead to baseline drift, which is the trend of the signal generated if no material was introduced, and background noise caused by electronic disturbances and fragments varying over small mass ranges randomly (Mantini et al., 2007).

In the next section, we present a coupled computational workflow to denoise raw mass spectra and to identify putative biomarker candidates in the adjusted data.

3 Computational workflow

In this section, a workflow for data preprocessing and proteomic biomarker identification is presented, which has been recently introduced to identify highly discriminatory masses when comparing diseased and harmless forms of samples (Handler et al., 2011; Pallua et al., 2013). All steps of this workflow were implemented as plug-ins for the software package KD³ (Dander et al., 2011), providing a user-friendly assembly of the different processing steps to configurable workflows.

This discussion aims to provide additional information about our previously published workflow (Handler et al., 2011, Pallua et al., 2013), together with a description of the parameters, which need to be defined for the modules implemented in KD³.

3.1 Preprocessing

In order to compare mass spectra for the identification of proteomic biomarkers, multiple steps need to be performed to standardize all given spectra to a common format. Therefore, a preprocessing pipeline is used for the transformation of data into a proper format for analysis (shown in Figure 28.1) (Handler et al., 2011).

f28-01-9780128025086 — Figure 28.1 Preprocessing and evaluation steps (gray boxes indicate the direct use of algorithms of the R/Bioconductor (R Development Core Team, 2009) library PROcess (Li, 2005). Note that binning, adjustment, and normalization need to be repeated after the alignment step for equally spaced, consistent, and normalized spectra.

Some of the modules in this workflow use the PROcess library (Li, 2005) of the Bioconductor R package (R Development Core Team, 2009). For the integration of processing steps into KD³ submodules (functional objects), a parallel execution of the algorithms was made possible, which can efficiently compute numerous MALDI spectra on systems with multiple cores, for example, on a computational cluster or workstation. The figures demonstrating the preprocessing results were created using the JFreeChart library (JFreeChart, 2012).

Binning of m/z values Due to unequally distributed m/z values in the compared spectra, a direct comparison between intensities of different spectra is difficult. In our approach, therefore, we included a binning step to map the intensities of the given spectra onto equally distributed m/z bins. In particular, the dimension (number of m/z values) of high-resolution spectra can be reduced using this step (Ressom et al., 2005).

Using the Binning module, the width w of the resulting bins and the method for calculation of the bin representative can be selected. As intensity value of the bin representative I(x) at m/z value x the mean/maximum intensity of all intensities in the range $[x - \frac{w}{2}; x + \frac{w}{2})$ $[x - \frac{w}{2}; x + \frac{w}{2})$ is calculated. Furthermore, the user can choose whether the intensities of empty bins should be linearly interpolated by neighboring intensities or set to 0.

After setting these parameters, spectra can be assigned to this module. As a result, the Binning module delivers spectra with equally distributed m/z bins (see Figure 28.2).

f28-02-9780128025086 — Figure 28.2 Binning: The m/z values of mass intensities differ between the two plots (gray lines). After binning, the mass intensities of both spectra are located on equally distributed m/z bins (black lines).

Adjustment of m/z ranges Spectra with different m/z ranges cannot be matched with each other over the whole m/z domain. To use most of the spectra for further preprocessing, the adjustment step is applied, which removes all values from the spectra outside the boundary of the highest common minimal m/z value x_min and the lowest common maximum m/z value x_max of all spectra.

The Adjustment module allows the user to set the values x_min and x_max manually. The user can also decide individually whether spectra are selected for further preprocessing that do not lie within the defined boundaries. If, for example, a spectrum starts at a higher m/z value as defined by x_min, this spectrum can be filtered out by this step. The m/z lower and upper bounds of residual spectra are equalized by removing all m/z bins outside the defined boundaries. Therefore, after this step, the m/z range of all spectra is now equal and consistent (see Figure 28.3).

f28-03-9780128025086 — Figure 28.3 Adjustment: The dashed lines represent the position of the common minimal and common maximal m/z values of the three given spectra. Adjusted spectra with m/z values within the two boundaries are used for further processing.

Baseline subtraction Chemical noise in the energy absorbing molecule solution and ion overload can lead to an elevated baseline within the spectra (Li et al., 2005). Different algorithms are available for effective removal of baselines from spectra. For example, Ressom et al. (2005) used a spline approximation on local minima of a spectrum to estimate its baseline. In this work, the baseline subtraction algorithm described by Li et al. (2005) was implemented, which estimates the baseline by local regression to the points below a certain quantile or to local minima of a moving window. After the baseline is estimated, it is subtracted from the original spectrum. The Baseline subtraction module allows the configuration of parameters, which are directly passed to the bslnoff operation of the PROcess package of R (Li, 2005). The user can set parameters such as the number of breaks on the log m/z scale for finding the local minima or intensity values below a defined quantile necessary for local regression calculation. Basically, it can be selected between local regression or linear interpolation for smoothing the estimated baseline. The user can also specify the bandwidth for the local regression method. Figure 28.4 depicts a baseline corrected spectrum returned by this module.

f28-04-9780128025086 — Figure 28.4 Baseline subtraction: A spectrum before baseline subtraction (gray) and after baseline subtraction (black) is shown.

Normalization Effects of experimental noise can cause variations in the amplitude of spectra. To remove this artifact from spectra the total ion normalization procedure was used (Li et al., 2005) to rescale intensity values of the spectra. In this step, the area under the curve (AUC) is calculated for each spectrum above a user-defined cutoff. This cutoff is relevant because of high noise signals at low m/z values. The AUC of a spectrum is calculated as the sum of all intensities of a spectrum (considering the given cutoff value) if the intensities are represented by equally distributed m/z bins (Li et al., 2005). A normalized spectrum N_i based on the spectrum U_i is calculated as

$N_{i} (x) = U_{i} (x) \cdot \frac{AU C_{M}}{A U C (U_{i})},$ $N_{i} (x) = U_{i} (x) \cdot \frac{AU C_{M}}{A U C (U_{i})},$

si2_e (28.1)

where

$A U C (U_{i}) = \sum_{j = 1}^{n} U_{i} (j)$ $A U C (U_{i}) = \sum_{j = 1}^{n} U_{i} (j)$

(28.2)

defines the AUC of the spectrum U_i, n is the number of m/z values in the spectrum U_i, and AUC_M defines the median AUC of all spectra (see Figure 28.5). Alternatively, the user has the possibility to define the AUC as a parameter directly.

f28-05-9780128025086 — Figure 28.5 Normalization: Gray spectra indicate spectra before normalization. The normalized spectra are depicted in black. The AUC of the upper spectrum is higher than the AUC of the lower spectrum before normalization. After normalization, the AUC of both spectra is equal. Note that only a cutout of the entire m/z range of the spectra is presented here.

Peak detection Peaks in the spectra represent specific, abundant polypeptides in the sample (Li et al., 2005). These peaks define reasonable intensities compared to intensities of other m/z values, which appear as noise (Ressom et al., 2005). To identify peaks in the spectra, the function isPeak of the PROcess package is used (Li, 2005). Three different parameters are considered for the selection of the peaks:

• The signal-to-noise ratio (SNR), defined as the local smooth divided by the local estimate of variation

• A detection threshold below which the intensities of the spectrum are considered as zero

• The shape ratio, which is defined as the ratio between the area under the curve within a small distance of a peak candidate, as already identified by the first two criteria, and the maximum of all such peak areas of a spectrum

In addition to these parameters, the user can define window widths for the estimation of local variance, for smoothing the spectrum before peak detection, and for the calculation of the area under the peak, which are passed to the isPeak operation of the PROcess library.

Quality assessment Spectra of poor quality may cause reduction of statistical significance of selected masses. Therefore, a quality assessment step on preprocessed spectra is recommended. In our workflow, we used the quality operation of the PROcess package (Li, 2005).

The quality of a spectrum is evaluated using the following three measures: Quality, Retain, and Peak. For computation of the Quality and Retain measures, a noise envelope is used, which is computed as follows: First, the noise is estimated as the difference between the spectrum and its moving average with a window size of 5 points starting from a user-defined cutoff value. Afterward, the noise envelope is computed as three times the standard derivation of the previously calculated noise in a 250-point window.

For the computation of the Peak measure, the peak information retrieved by the previous step is required.

The three measures for the quality estimation are defined as follows (Li et al., 2005):

• Quality: The measure of the separation of signal from noise, defined as the ratio of the AUC before and after subtraction of the noise envelope.

• Retain: The number of high peaks in a spectrum is quantified by comparing the number of intensities more than five times the noise envelope over the total number of points in a spectrum.

• Peak: The number of peaks of the current spectrum is compared to the average number of peaks in all spectra.

The boundary parameters for the three measurements can be defined by the user. If a spectrum does not exceed any of the defined boundaries, the spectrum is removed.

Alignment to internal standards Because of measurement variations, peaks in different spectra that correspond to the same protein may be located at different m/z values (Li et al., 2005). In the peak alignment of spectra, peaks are identified across spectra, which are likely to represent the same protein. For the alignment, different algorithms are available (e.g., Li et al., 2005; Ressom et al., 2005). In our workflow, a method for peak alignment was integrated that used internal standards as reference points in the spectrum. By definition, an internal standard is a compound added to a sample in known concentration to facilitate the qualitative identification and/or quantitative determination of the sample components (Ettre, 1993). In this approach, two internal standards are used. To align a spectrum by using this method, corresponding peaks have to be found close to the defined m/z values of the internal standards. Either the closest peak or the peak with maximal intensity in a user-defined window is chosen. The m/z values of the aligned spectrum are subsequently calculated by linear interpolation:

$\hat{x} = (x - I S_{l}) \cdot \frac{{\hat{I S}}_{u} - {\hat{I S}}_{l}}{I S_{u} - I S_{l}} + {\hat{I S}}_{l}$ $\hat{x} = (x - I S_{l}) \cdot \frac{{\hat{I S}}_{u} - {\hat{I S}}_{l}}{I S_{u} - I S_{l}} + {\hat{I S}}_{l}$

si4_e (28.3)

where x is the original m/z value, $\hat{x}$ $\hat{x}$ is the aligned m/z value, IS_l and IS_u denote the m/z values of the corresponding lower and upper internal standard peaks of the processed spectra, and ${\hat{I S}}_{l}$ ${\hat{I S}}_{l}$ and ${\hat{I S}}_{u}$ ${\hat{I S}}_{u}$ denotes the m/z values of the lower and upper internal standard (see Figure 28.6).

f28-06-9780128025086 — Figure 28.6 Alignment: In the upper panel, an unaligned spectrum and the areas around the internal standards (dashed lines) are depicted. The lower panel shows the aligned peaks of this spectrum.

The peak information and the m/z values of two internal standards, including a defined window size around the standards, are required for running the alignment. If no peaks are found with a smaller or equal distance than half of the window, the spectrum is removed and is not considered for further analysis. Note that the maximum peak or the closest peak within a window can be chosen as an internal standard for the alignment.

Recalibration of preprocessed spectra Due to the alignment to internal standards, parts of the initial preprocessing modality (i.e., binning, adjustment, and normalization) need to be repeated to ensure that the m/z values of the remaining spectra are (i) equally binned, (ii) within the same m/z bounds, and (iii) have the same AUC.

Generation of mean spectra If multiple spectra per sample are available, these spectra can be averaged to a mean spectrum by calculating the mean intensities over the entire m/z value range. By this action, the noise can be reduced and the SNR significantly improved. Note that spectra of the same sample are identified by the file name of the spectra that included the sample IDs.

Formatting of spectra for feature selection approach For the biomarker identification, some algorithms from the Weka software package are used (Hall et al., 2009). To assign the spectra to different classes and transform the spectra information into ARFF (Attribute-Relation File Format), two additional modules were implemented (class assignment by file name and conversion to ARFF).

3.2 Identification of biomarker candidates

After preprocessing, the spectra have comparable m/z and intensity values and can be defined as a set of tuples, $T = \{(c_{j}, m)| c_{j} \in C, m \in M\}$ $T = \{(c_{j}, m)| c_{j} \in C, m \in M\}$ with $C = \{case, control\}$ $C = \{case, control\}$ , where C is the set of class labels and M is the set of features (m/z values in the spectrum). In order to identify those masses in M that show highest discriminatory ability according to class C, we adjusted an algorithm previously published by our group (Plant et al., 2006).

In the first step, a filter approach is used to select relevant features (m/z values). The resulting features from step 1, however, contain regions of adjacent features that are highly correlated, as most of them are redundant, representing the same information of the spectra. Consequently, in step 2, we identify a representative for every region in the spectra. Finally, in step 3, a wrapper approach further reduces the dimensionality of the result set of step 2 by optimizing the discriminatory ability using a classifier.

Note that for evaluating these analysis steps, synthetic spectra were created using the spectrum generator of mMass 5.0 (Strohalm et al., 2010). Overall, we generated 50 case and 50 control spectra and inserted randomly two artificial, well-discriminating m/z regions.

Step 1: Selecting relevant features In this work, we used a Student t-test as a filter approach. In addition to the resulting P-value, we calculated a second parameter Δ representing the ratio of the mean intensities in each class ${\bar{x}}_{c_{i}}$ ${\bar{x}}_{c_{i}}$ relative to the maximum intensity I_max in all spectra. The parameter Δ was defined by

$Δ = \frac{|{\bar{x}}_{c_{1}} - {\bar{x}}_{c_{2}}|}{I_{\max}}$ $Δ = \frac{|{\bar{x}}_{c_{1}} - {\bar{x}}_{c_{2}}|}{I_{\max}}$

si11_e (28.4)

This parameter is important to ensure that differences in the intensity can also be technically detected. A feature is defined as relevant if the following two conditions are fulfilled:

$P - value < α$ $P - value < α$

$Δ > Γ .$ $Δ > Γ .$

The parameters α and Γ are set by the user.

Step 2: Selecting region representatives We used a forward selection strategy to identify a representative for every region in the spectra. The representative feature is the feature with the highest quality (i.e., the discriminatory ability according to the filter method applied in step 1) within the region. The size s of the region depends on the index of the feature representing the m/z value, which is due to technical reasons, as different fragments of peptides with low molecular weight cause many narrow peaks in the spectral region of low m/z values.

Step 3: Selecting the best features To further reduce the number of features, a wrapper-based approach is used, including a classifier and a search strategy to find a smaller feature subset while keeping the discriminatory ability at least constant. We apply logistic regression (Le Cessie and Van Houwelingen, 1992) as the classifier and a modified binary search (MBS) (Plant et al., 2006) as the search strategy. The area under the receiver operating characteristic (ROC) curve is selected as the measure for assessing the discriminatory ability. As introduced, selected features using this approach represent highly discriminating m/z values in the spectra. Finally, the local maximum in the neighborhood of a selected m/z value is determined to ensure that the selected mass represents a real peak in the spectra. Figure 28.7 demonstrates a snapshot of two matched spectra when comparing two different groups of mass spectrometry data (e.g., cases versus controls). This three-step strategy is able to identify highly discriminating mass peaks between the spectra (in this example, with superior sensitivity and specificity). In particular, in this example, we were able to identify our two predefined artificial biomarker candidates.

f28-07p1-9780128025086 — Figure 28.7 Identified mass peaks when comparing two different groups of mass spectrometry data using our synthetic data. These two mass peaks were classified as highly discriminatory features.

f28-07p2-9780128025086 — Figure 28.7 Identified mass peaks when comparing two different groups of mass spectrometry data using our synthetic data. These two mass peaks were classified as highly discriminatory features.

After applying this computational strategy, identified well-discriminating mass peaks need to be verified and validated as biomarkers by subsequent database verification, lab experiments, and clinical trials before selected biomarker candidates can go into clinical application.

4 Analysis tool

As the analysis of MALDI data comprises several steps, and a variety of methods are available for each of those steps, the analysis of such data is challenging. Knowledge Discovery in Databases (KDD) is a process to manage and analyze huge amounts of data. Fayyad et al. (1996) splits the KDD process into several steps from storing data via selection, preprocessing and transformation to data mining methods. An interpretation and evaluation step follows this series of tasks, which is performed by experts in the individual field. One major point here is that the entire KDD process is iterative, which results in reanalyzing the raw data set or intermediate results with small or large changes in the analytical workflow. To overcome this hurdle of using different applications for those steps, KD³ has been developed by our group (Dander et al., 2011).

The following sections describe the KD³ Composition, KD³ Functional Object, and KD³ Workflow based on Pfeifer et al. (2008).

4.1 KD³ composition

The implemented KD³ application consists of four main parts. A screenshot of the application is depicted in Figure 28.8. The application can be divided into four parts. The first part shows the available functional objects, which are loaded using the Java reflection application programming interface (API) and are grouped using a hierarchical structure. In a workspace window, the user can drop functional objects and parameterize them by setting up annotated constructors and methods. In the center of the window, the workflow is visualized.

f28-08-9780128025086 — Figure 28.8 A screenshot of the complete preprocessing and biomarker identification workflow modeled in KD³.

4.2 KD³ functional object

In general, a functional object is composed as follows: From the user’s perspective, a task is a functional object, which consists of in- and out-ports that have to be assembled and parameterized to fulfill their purpose. In order to extend the functionality of KD³, the software engineer has to create a subclass of the superclass FunctionalObject containing the required algorithms. The abovementioned in- and out-ports get the data from another object and send the processed data to another object. For instance, the BinningStep functional object in Figure 28.8 receives the data from ReadSpectra and finally sends the preprocessed data to the AdjustmentStep. For performing the computation, an abstract method named execute() is available, which must be overridden in derived FunctionalObject classes in order to solve a specified problem.

4.3 KD³ workflow

A workflow is defined as a repeatable pattern activity, which is used to solve a defined process using different parameters and data sets. The user, therefore, can drop the functional objects into the workspace window and can specify the parameters. After the workflow is designed, it can be executed directly from KD³. Finally, if the workflow is properly designed, it can be saved as a GraphML (http://graphml.graphdrawing.org/) file and deployed (e.g., to biomedical researchers to analyze newly available data). In this work, we extended KD³ to allow the preprocessing and biomarker identification of MS data (see also Figure 28.8). Our proposed workflow-oriented architecture results in high usability. KD³ allows the user to exchange different methods for all the aforementioned steps, or the straightforward implementation and integration of newly developed algorithms.

As the integration of new algorithms should be as simple as possible, the developer has to develop the new algorithm in the programming language Java. Therefore, the only need for integrating new algorithms into KD³ is an elementary understanding of Java. KD³ automatically generates the graphical user interface (GUI) for each of the implemented methods, but programmers can also develop a specific GUI for each of those methods. The extended version of KD³ for preprocessing and biomarker identification of MS data is available at http://www.umit.at/kd3ms.

KD³ can be used in the analyses of other data types as well. Therefore, more than 100 different methods and algorithms have been integrated into this application. As our group mainly deals with research questions in computational biomarker discovery, we applied KD³ successfully in several research projects, such as addressing the search for metabolic biomarkers in cardiovascular disease or the identification of human breath gas markers in liver disease using ion-molecule reaction-mass spectrometry (Pfeifer et al., 2007; Netzer et al., 2009, 2011; Baumgartner et al., 2010; Millonig et al., 2010).

5 Conclusion

In this chapter, we have presented a comprehensive computational workflow for the identification of proteomic biomarker candidates using mass spectrometry data. The proposed approach includes the preprocessing of raw spectra coupled by a three-step-feature selection approach that combines the advantages of filter- and wrapper-based methods. We could illustrate the power of our approach in identifying predefined “biomarker candidates” using a synthetic MS data set for demonstration purposes. It is important to note that the selected filter and wrapper methods can generically be replaced by alternative feature selection methods. Only small parts of the entire workflow need to be modified to treat other types of MS data generated in proteomic and metabolomic experiments. The integration of the presented methods into KD³ results in high usability and accessibility, allowing a targeted search for highly predictive biomarker candidates in MS data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 28: Knowledge Discovery in Proteomic Mass Spectrometry Data

Create new playlist

Sign In

Sign Up