Chapter 10

CLAST

Clustering Biological Sequences

Vicente Molieri; Lina J. Karam; Zoé Lacroix    School of Electrical, Computer, and Energy Engineering, Arizona State University, Tempe, AZ, USA

Abstract

Clustering sequences is important in a variety of applications, including development of nonredundant databases, function prediction, and identifying patterns of gene expression. Currently, clustering methods rely on a prealignment as supplementary information to guide the construction of clusters. This chapter introduces a novel algorithm to cluster nucleotide and peptide sequences. The algorithm is a no-reference approach that utilizes only the sequences as input. We also introduce a novel metric that is used to describe the relationship between biological sequences, and serves as the distance measurement for clustering. Results are presented for real biological sequences, comparing the proposed algorithm to other similar tools available.

Keywords

Clustering

hashing

graph cuts

biological sequences

databases

nucleotide

peptide

Acknowledgments

Funding: This research was partially supported by the National Science Foundation1 (Grants IIS 0431174, IIS 0551444, IIS 0612273, IIS 0738906, IIS 0832551, and CNS 0849980, and several REU grants). We thank Naji Mounsef for sharing his hash-based approach for sequence assembly and prototype implemented in MatLab (Mounsef et al., 2008). We thank Christophe Legendre and Ruben Acuña for helping to supervise the undergraduate students Matthew Land, Ben J. Piorkowski, and Christopfer Watson, who put the tool to the test. We also acknowledge Louiqa Raschid and Ben Snyder for their contribution to BIPASS.

1 Introduction

Sequence clustering is an important method in the bioinformatics world. Applications of clustering include development of nonredundant databases (Holm and Sander, 1998; Holm et al., 2000; Li and Godzik, 2006), discovery of motifs (Kim and Lee, 2006), identifying patterns of gene expression (Quackenbush, 2001), function prediction and automatic annotation (Fleischmann et al., 2001), and genome assembly (Pertea et al., 2003). Clustering methods typically deal with sequences relating to either messenger RNA (mRNA), complementary DNA (cDNA), proteins, or other special types of sequences such as expressed sequence tags (ESTs) or genome survey sequences (GSSs). In existing bioinformatics clustering methods, alignment is typically used, either as a preprocessing step or as a refinement technique in filtered sets of data. For species where a reference genome is unavailable, a similar species may be used as a guide for clustering. A reference genome is necessary for most clustering methods because sequences are first aligned to the genome to determine the approximate location of that sequence. This global alignment is used to determine which sets of sequences are close to one another on the genome so that they can then be clustered. Clustering can also be used as part of a workflow for alternative splicing analysis (Lacroix et al., 2007). When clustering is used in conjunction with alignment, the robustness of assembly calculations can be improved as well.

Sequence clustering typically follows one of two approaches: (i) partial-order-based graph models or (ii) equivalence-based graph models. Partial-order graphs are those that can be viewed as a hierarchical model, wherein one of the nodes should always precede another in the graph, although the relationship may not apply to all pairs of nodes in the graph. Equivalence-based graphs differ from partial-order graphs by creating symmetric, bidirectional relationships between nodes.

This paper presents a new algorithm, which we refer to as CLustering Any Sequence Tool (CLAST) and which does not require any reference to generate clusters. The proposed CLAST algorithm does not use precomputed alignment information or any other information beyond the sequence data to guide the computation of the clusters.

1.1 Related work

The use of clustering sequences is not a new concept in bioinformatics. Some existing tools require an alignment, either with or without a reference genome. Several tools have been created to cluster sequences for the purposes of eliminating redundancy in data sets. These types of tools typically work in a partial-order graph-based manner (Holm and Sander, 1998; Holm et al., 2000; Li and Godzik, 2006). Other approaches are equivalence-order graph-based and consist of an all-against-all comparison approach. Equivalence-order methods generally have applications different from the partial-order methods. Equivalence-order methods can have applications in discovery of motifs (Kim and Lee, 2006), function prediction, automatic annotation, and phylogenetics (Fleischmann et al., 2001). In equivalence-order methods, thresholds are developed to “cut” the data sets into clusters based on the weight of the links using a variety of methods. In partial-order-based methods, a set of tests is used to generate subgroups consisting of similar sequences, and then the resultant subgroups are aligned and clustered. In Holm and Sander (1998), a tool is proposed that iteratively filters an input data set of sequences by first grouping all sequences that share a decapeptide, then creating subgroups that share at least 50% pentapeptide composition. Finally, a local alignment using the Smith-Waterman algorithm is computed and a relatively small number of representative sequences is chosen for which all other sequences are at least 90% shared in common with the set of representative sequences. In (Holm et al., 2000), E-values (which correspond to the probability that a given sequence would have a greater alignment score in an alternate alignment due purely to chance) from aligned sequences are used to determine familial relationships. In CD-hit (Li and Godzik, 2006), a filtering hierarchy similar to that used in Holm and Sander (1998) is used. But, in CD-hit, instead of generating a representative sequence in the cluster, the longest existing sequence is used. Metrics in these algorithms can vary, although the use of Z-scores (which correspond to a normalized matching score, measured from the mean) is popular. In Kim and Lee (2006), Z-scores from Monte Carlo simulations based on Smith-Waterman scores are used, while in Fleischmann et al. (2001), Z-scores from pairwise match scoring are used. In Blastclust (Dondoshanky, 2002), a method based on E-values is used to create single-linkage clusters using BLAST and MegaBLAST with default parameters.

In Mounsef et al. (2008), a method of hashing is introduced for use in an assembly tool. The hashing method is used to reduce computational complexity of the assembly process. The method proposed in Mounsef et al. (2008) generates hashes by first padding sequences of variable lengths so that all are the length of the longest sequence in the data set, and then a random number vector is used to generate a single number per sequence in the data set.

In Tarjanb et al. (2004), a method of clustering using graph cuts is proposed for use in clustering of topics on Internet websites. The approach presented in that paper is unique in that it introduces the concept of a sink node in contrast to the source node (representing the data set) which is linked to all sequences in the data set with a certain weight. Minimum cuts are then calculated, and all cuts that are under a certain threshold (which is related to the weight of the introduced links to the sink node) are cut (taken to be zero). This edge-weight-based approach is used to generate a set of clusters that are more related to one another than they are to the sink node (which represents random chance; i.e., the resultant clusters are linked by more than just chance). In Quackenbush (2001), an overview of clustering approaches for analyzing gene expression data is presented. In addition, a hierarchical clustering scheme is presented which uses pairwise distance matrices to cluster the two “closest” clusters together (initially, all sequences in the data set belong to their own single-sequence cluster) based on a distance metric, and this process is repeated until clusters can no longer be merged. Furthermore, a k-means clustering algorithm is presented. But because the final number of clusters is not usually known beforehand, a principal component analysis (PCA) is used as preinformation to estimate the number of clusters to be generated. The final algorithm presented in Quackenbush (2001) is a self-organizing map, which is an iterative approach is similar to k-means clustering, which converges based on the intercluster and intracluster distance metrics. In Diaz-Uriarte and de Andres (2006), random forests are used on gene expression data to identify sets of genes that can be useful in clinical diagnostics.

2 Methods

The methodology presented in this chapter aims to provide the ability to work independent of any data besides the sequence itself. The ability to work independent of precomputed or supplemental information (such as annotations, alignment, or a reference genome) beyond the actual sequence itself is important because, although this information is often useful, it may be incomplete or incorrect. Because the proposed algorithm has been designed not to use any supplemental information, the algorithm does not use it. The algorithm presented in this paper works in three steps: (i) hashing, (ii) matching and (iii) clustering. Details about each of these three steps are provided next.

2.1 Hashing

The hashing step of the algorithm generates a set of hash IDs for all sequences in the data set. The number of hash IDs generated for a given sequence will be equal to the length of the sequence less the size of the hashing window. Methods using a similar concept of hashing exist in computer science–based applications (Mounsef et al., 2008). Hashing in this work is meant to encompass the idea of representing multiple data points (in this case, nucleobases) with a single point (a hash ID). In this work, a block is used to denote a subsequence of nucleobases to be put into a single hash ID. Hashing is then done in the algorithm by generating a vector of random numbers equal to the length of the block, performing a pointwise multiplication, and then summing all the results to generate a single hash ID as follows:

IDj=i=01Xi+jhi

si1_e  (10.1)

where j is the start position of the rectangular window within a sequence, ℓ is the length of the window and of the hashing vector, X is the sequence, and h is the hashing vector of random numbers. This operation is akin to a common signal processing operation known as multiplication and accumulation. Once the hash ID is computed, the window is slid forward by one nucleobase and the process is repeated until all possible hash IDs have been computed (when the end of the sequence X is reached). This operation is equivalent to an operation referred to in signal processing as convolution. Consolidating information into a single point enables faster searches for matching than pairwise nucleobase matching as is done in typical biological sequence alignment algorithms.

Given the sparse nature of the data (with only four unique characters in nucleotide sequences: A, C, G, T), it is essential to evaluate the data across multiple points in order to reduce the number of biologically meaningless matches. Additionally, false positives in the system (instances where a matching hash ID exists when a matching sequence does not) are reduced with longer hash windows. In our implementation, 32-bit floating point numbers are used in the hashing vector. Using a window of size 31 with four unique nucleobases and 32-bit floating point numbers yields a probability of false positives of approximately 1e-9. It should also be noted that the generation of a false positive hash ID will not necessarily yield an error in the final clustering result. Conversely, sequenced data have numerous atypical points, such as single-nucleotide polymorphisms (SNPs), deletions, insertions or gaps, which make it essential to choose a length of hash that is sufficiently small so that these atypical data points do not mask biologically significant matches between sequences. With a properly chosen hash size, this algorithm design is robust to SNPs, insertions, deletions, gaps, and other such atypical data points (that is, points other than A, C, G, and T) in low quantities. A sequence with a high proportion of atypical data points, SNPs or indels may not be clustered correctly by this algorithm. The choice of a proper hash size is empirical and can depend on the data set; the choice of the default length in the algorithm is discussed in section 3 of this chapter. The length is the only adaptable parameter of the hashing portion of the algorithm. The hashing vector must be the same for all data analyzed in a data set for the hash IDs to be comparable. The pseudocode for the algorithmic creation of hash IDs is shown in Algorithm 10.1.

f10-05-9780128025086
Algorithm 10.1 Hashing.

2.2 Matching

Matching is done by comparing all hash IDs from a given sequence with all hash IDs from all other sequences. Comparing all elements of an unsorted vector to all other elements is an O(N 2) problem, with N equal to the number of hash IDs generated, which approaches the number of nucleobases in the data set as the size of the data set increases. The concept of matching hash IDs at the sequence level is demonstrated in Figure 10.1. In this figure, the black boxes correspond to matching hash IDs, while the thinner lines correspond to mismatches. In the example shown in Figure 10.1, the second nucleobase of Sequence B represents a SNP (relative to Sequence A) and the seventh and eighth nucleobases in Sequence A have been deleted when forming Sequence B. While the deletion has been shown explicitly for illustration purposes, this algorithm does not require prealignment; instead, two fewer hash IDs would be generated for the second sequence than for the first sequence. As shown in Figure 10.1, the matching hash IDs can be used to develop a concept of alignment in the algorithm, as matching hash IDs correspond to 100% matching of the corresponding sequence blocks, but this process is different than the pairwise distance measurements typically used in alignment algorithms. This algorithm has been designed to be much faster than O(N 2), approximating more closely O(N log2 N) operations. This has been done by first sorting the hash IDs by value, and then finding ranges of positions for which the hash IDs are identical. For ranges of data that have the same hash ID, information about the sequences is stored, creating equivalent symmetric links between the two sequences (i.e., AB rather than A→B and B→A). The C++ standard library sort function is used to sort the hash IDs by value. On average, the sort has been determined to be O(N log2 N), while the determination of ranges of consecutive matches requires a single exhaustive search of the sorted array, which is O(N), ignoring operations that take place once the ranges of indices are determined. Thus, the final algorithm is O(N (1 + log2 N)), which, as N becomes very large, is approximately O(N log2 N).

f10-01-9780128025086
Figure 10.1 Example of a matching process. Hash IDs represented by a black box denote cases where a match was found between the two sequences. Hash IDs represented by a thin line find no match between the two sequences. Arrows are used to show the two linked hash IDs.

The matching algorithm is shown in Algorithm 10.2. This process is used to identify sets of matches between two sequences. Matching yields a two-dimensional (2D) matrix of relationships between all sequences in the data set, and that matrix is diagonally symmetric due to the creation of symmetric links. The resulting matrix represents a graph of nodes and edges. In such a graph, nodes represent sequences and edges represent a metric of relatedness, which, at this step of the algorithm, is the number of matching hash IDs between the two considered sequences. Figure 10.2 specifically shows two groups of 10 sequences each, synthesized from data that originated from two separate genes (i.e., the two groups). The link between the two groups likely represents some sort of structural information shared between two mRNA sequences and serves as the basis for the need of the third clustering step to establish meaningful clusters.

f10-06-9780128025086
Algorithm 10.2 Matching.
f10-02-9780128025086
Figure 10.2 Graph representation. Nodes represent sequences and edges represent relations. The weight on an edge corresponds to the number of matching nucleotides between the sequences that are linked by that edge. The red cross (X) denotes the link to be cut.

2.3 Clustering

We adopt the concept of a sink node based on the ideas presented in Tarjanb et al. (2004). Due to the computational complexity of minimum graph-based cuts algorithms, our algorithm was simplified to increase efficiency (which is necessary when dealing with large data sets as in bioinformatics applications) by removing the notion of paths and instead focusing exclusively on intersequence linkages. A relationship metric has been introduced with the goal of discerning biologically significant links from biologically insignificant links. In this case, biologically insignificant links are taken to be links that relate to random chance, structural information, repeats, and other similar phenomena that do not necessarily imply that sequences belong to the same gene, while biologically significant links are defined as links that indicate two sequences belonging to the same cluster (gene). The proposed relation metric (RM) is set to be the number of matching nucleobases divided by the minimum of the total number of nucleobases in the two sequences. This metric is a matching percentage that denotes how related any two sequences are and it can be expressed as:

RMij=#matchingnucleobasesminSi,Sj

si2_e  (10.2)

where i and j are indices relating to the two considered sequences Si and Sj, and Sisi9800_e and Sjsi9805_e are the length of the sequences Si and Sj, respectively. Given the result of the matching algorithm, this metric can be calculated for all links between any two sequences. The calculation of these matching percentages will again yield a diagonally symmetric matrix that represents a graph of nodes and edges, though edges have now been weighted according to the new metric of relatedness. The practical utility of this metric is illustrated in example where Sequence A consists of 120 nucleobases, Sequence B consists of 900 sequences, and Sequence C and Sequence D consist of 10,000 nucleobases each. Sequence A shares 100 of its 120 nucleobases with Sequence B, while Sequence C shares 200 nucleobases with Sequence D. In this case, a metric using only the number of matching nucleobases is problematic, and true meaning is seen only in how much of a sequence is shared between two sequences (i.e., a proportion). Between any two sequences of differing lengths, however, there are two proportions that exist in terms of the number of shared nucleobases, one for the proportion of the first sequence and one for the proportion of the second sequence. In this example, the important information is that 100 of the 120 nucleobases in Sequence A are shared with Sequence B, rather than the fact that 100 of the 900 nucleobases of Sequence B are shared with Sequence A because this likely indicates that Sequence A is a subsequence of Sequence B.

To develop clusters from the matrix, the first sequence in the data set is used as a seed cluster, and then sequences are added to a cluster if they have a match (weighted edge) above a specified threshold to any of the other sequences in the cluster. If the computed metric is below that threshold, it is seen as zero and will not be added to the cluster. If no matches exist for a given sequence, a new cluster is created for that sequence. Since some matches will not be found until later in the algorithm, clusters are also merged as necessary. The algorithm for clustering is shown in Algorithms 10.3 and 10.4. These algorithms prepare a matrix containing values for weighted edges between all nodes in the data set (with weighted edges below the specified clustering threshold being set to zero). This matrix can be used to generate the final graph. It is worth noting that if Sequence A belongs to a cluster and Sequence B shares a biologically significant link with Sequences A and C, even if Sequences A and C do not share a biologically significant link, all three sequences will be clustered together since B shares significant links with both (this is a reasonable solution given the intended application of alternative splicing). Thus, it is possible that any set of two sequences may have a low amount of shared nucleobases, but within the cluster, there must be a path between the two wherein all edge weights are above the specified clustering threshold of the clustering algorithm.

f10-07-9780128025086
Algorithm 10.3 Clustering, Part 1.
f10-08-9780128025086
Algorithm 10.4 Clustering, Part 2.

3 Evaluation and discussion

Transcript sequences have been retrieved from BIPASS (Lacroix et al., 2007). While the method in BIPASS relies on the use of a published genome to perform the mapping and clustering of transcripts, the no-reference clustering approach proposed in this chapter clusters transcripts without the need for any additional genomic information. CLAST was implemented in C++ and compiled into an executable code for testing on a standard desktop PC running Linux. The threshold value used was 0.41000, and the hash size value (hash window value) was 31 in all tested cases (the choice of these values will be discussed in detail later in this chapter). For all tested data sets, the clustering execution time ranged from less than one second to 50 seconds.

The transcript sequences of 44 genes of the human organism or Homo sapiens (Hs) were retrieved from BIPASS with the queries kinase, collegenase, and transcription factor, as shown in Table 10.1. These data sets of single genes are referred to as single-gene sets (SGS). From those transcript sequences, we generated 25 data sets for testing containing a subset of the 44 SGS, denoted as multigene sets (MGS). Table 10.2 presents the SGS composition of the MGS data sets. As shown in Table 10.2, each MGS contains transcripts from more than one gene. Data sets MGS1 to MGS17 were formed by selecting genes from Table 10.1. In order to test the overlapping gene case, we created the data sets MGS18 to MGS25 which were formed by merging transcripts from clusters corresponding to overlapping genes. For example, MGS18 is composed of genes DLG4 and ACADVL which overlap head-to-head (Zhou and Blumberg, 2003). Similarly, gene TOE1 has overlapping exons with MUTYH at the 5′-end and with TESK2 at the 3′-end (Veeramachaneni et al., 2004). This was done to evaluate the robustness of the proposed CLAST algorithm when there is an overlap between genes. In Table 10.2, SGS data sets that CLAST was unable to cluster into a single cluster are highlighted in pink, and the SGS data sets that were added to address the overlapping genes question are highlighted in blue. Some of the genes added to address the overlapping genes question were unable to be grouped into a single cluster by CLAST and are thus highlighted in both applicable colors.

Table 10.1

MGS Formed by Combining Transcripts from Two or More BIPASS Clusters

t10-01-9780128025086

Note: Each of the 44 genes represents one SGS. Genes highlighted in pink (light gray in print version) were not clustered by CLAST into a single cluster.

Table 10.2

Gene Sequences Retrieved from BIPASS with Associated Statistics

t10-02-9780128025086

Note: The 44 genes listed on the top row each represents one SGS. Genes highlighted in pink (light gray in print version) were not cluster by CLAST into a single cluster. The gene composition of each MGS listed in the first column is shown in each corresponding row. MGS highlighted in blue (dark gray in print version) contain genes which are overlapping.

To evaluate the proper default values for the variable parameters in the algorithm (hash size and threshold) during development, training data sets were evaluated by the algorithm over a range of values for both hash size and threshold until the algorithm returned the expected clusters (both in terms of the number of clusters and the content of the clusters). To develop the data sets used in this evaluation, sets of clusters were extracted from BIPASS and, using the exonic information, sequences were randomly constructed by splicing together exons in a biologically realistic fashion. These data sets represent a set of genes, each with a set of sequences that have undergone simple alternative splicing events. Although the values were determined empirically during development using a different training data set, testing was done during evaluation of the algorithm to assess the validity of the previously chosen values. It should be noted that in determining the expected number of clusters for any complex data set, it was necessary to account for the presence of the simpler data sets that the algorithm was unable to cluster correctly individually, so that sometimes the expected number of clusters in this evaluation was greater than the number of genes in the data set. Figure 10.3 shows the variation of the number of clusters found by the algorithm using MGS25, which contains all genes used in the evaluation of this algorithm. In Figure 10.3, the impact of both the hash size and the threshold is clearly demonstrated. It is important to note the situation where the number of clusters remained steadily at the number of expected clusters over a short range of hash sizes at a given threshold (57 clusters, in this case). This condition is what we refer to as a steady-state solution, and this solution shows empirically that in the data sets evaluated, the default sizes chosen during development were optimal and provided a margin in both hash size and threshold choice. The values chosen during development and which were used in testing were a default hash size of 31 and a default threshold of .41000 when performing nucleotide clustering.

f10-03-9780128025086
Figure 10.3 Number of clusters found by the algorithm for a given hash size and threshold. The boxed area is the point at which the expected number of clusters is found. When examined in greater detail, the data within the box computes the expected number of clusters (57) at a threshold of .4 and a hash size of 30–33.

We first ran CLAST against each of the data sets described in Table 10.1. CLAST resulted in one cluster for each data set in 35 of the 44 SGS data sets from Table 10.1, while the other 10 genes produced more than the single expected cluster. One such scenario is presented in Figure 10.4. This is an interesting scenario where two distinct clusters of several sequences of high similarity exist, with two additional sequences (indicated by arrows) that would optimally link the two clusters together. The two sequences that have been clustered into a separate cluster contain a high proportion of DNA characters other than A, C, G, and T, which the algorithm treats as SNPs and thus treats the sequences as dissimilar. In all of the nine cases in which more than one cluster was found per gene, the failure to create one cluster stems from a high degree of sequence mismatch relative to the threshold being used. This can be introduced due to too few large sequences being shared or due to a large number of SNPs or DNA characters other than A, C, G, and T that are present in the transcript. Though CLAST is robust to some SNPs in a transcript, the proportions in these test sequences were too numerous for CLAST (a conservative approach) to still cluster the sequences correctly. Using postprocessing, it is likely that these clusters could be corrected by either using a lower threshold on small clusters, or using consensus voting to combine smaller clusters with a high degree of matching that did not meet the inital threshold.

f10-04-9780128025086
Figure 10.4 The PALM2-AKAP2 gene cluster as visualized in BIPASS (Lacroix et al., 2007). Only the vertical bars are part of the actual sequence, and the gaps exist as a result of alignment to the genome. The boxed sequences form a separate cluster from the rest of the data set due to the high proportion of SNPs in the two sequences.

The second testing phase was conducted against data sets MGS1 to MGS25 and resulted in the expected clusters that match the original BIPASS clusters in number and in clustered sequences, except for those data sets that include transcripts from the genes highlighted in pink in Table 10.1 (SGS 6, 11, 18, 21, 25, 27, 34, 35 and 41), for which CLAST gives extra clusters as previously discussed. In this latter case, except for the extra clusters produced by these last nine gene clusters, the remaining resulting clusters match according to the annotations as well as the clustering as expected by BIPASS. Note that CLAST produced the expected results for MGS21 to MGS2, which contain the transcripts of 11 overlapping genes.

The third testing phase included comparisons of CLAST to Blastclust (Dondoshanky, 2002) and CD-hit (Li and Godzik, 2006). Blastclust (version 2.2.21) and CD-hit-EST (a version of the algorithm presented in Li and Godzik, 2006) are standard algorithms used in the field to cluster nucleotide sequences and/or amino acid sequences. Like CLAST, these algorithms do not use a reference genome to perform clustering. Blastclust and CD-hit often provided poor results at the default values, and testing was done for these algorithms across a range of values that were seen as most analogous to those used in CLAST as well, though that did not seem to improve the results of these tools significantly (results not shown). The default values used for each tool during the evaluation are given in Table 10.3. The results of the computations using each tool’s default values are given in Tables 10.4 and 10.5. As shown in Table 10.4, CLAST is much more effective at clustering the SGS data sets into fewer clusters (in all SGS data sets, ideally only one cluster will be discovered). In the complex MGS data sets, CLAST continues to cluster sequences correctly, accounting for the inaccuracies of the SGS clustering results. Blastclust does perform some clustering, but does not link many sequences from the same gene, and while CD-hit seems more successful in linking sequences from the gene, it is still not as successful as CLAST. For MGS25, Blastclust and CD-hit generated hundreds of clusters when only 44 were expected (with sequences per cluster ranging from 2 to 155). CLAST generated the expected 57 clusters (44 plus the 13 excess clusters known from the SGS data-set tests). Results for the sequences for Fugu were more approximately equal. Since Fugu contains fewer sequences per cluster than do the MGS data sets, this suggests that Blastclust and CD-hit decrease in accuracy as data is added, while CLAST increases in accuracy as data sets become more populated. As shown in Table 10.5, CLAST is computationally more expensive when running these tests, requiring 27.89 s to run the Fugu sequences, as opposed to the 1.17 s of blastClust or the 3.06 s of CD-hit-EST. However, it is important to note that the CLAST algorithm evaluated in this paper is an unoptimized developmental version.

Table 10.3

Default Values for Tools Used in Evaluation

Evaluation Tool's Default Parameters
Data sets Tool namesCLASTBlastclustCD-hit-estUicluster2
Default word size/window sizeH = 31W = 32n = 8H = 8
Default matching-like thresholdT = 0.41L = 0.9c = 0.9M = 40

t0020

Table 10.4

Cluster Results

# of Experimental Clusters ObtainedExpected # of clusters in reference BIPASS
Data sets Tool namesCLASTBlastclustCD-hit-estuicluster2
44 independent datasets = 44 clusters1 < # < 41 < # < 371 < # < 1601
MGS2557 (44 + 13)402179044
Fugu CoreNucleotide (1094 sequences)471518568440N.A.
Fugu Core + dbEST (27163 sequences)N.A.1525413742107282094

t0025

Table 10.5

Execution Time Results

Running Time in Seconds
Data sets Tool namesCLASTBlastclustCD-hit-estuicluster2
44 independent datasets = 44 clusters0.5 < T < 100.1 < t < 1.020.0005 < t < 0.317< 1
MGS25501.823.34< 1
Fugu CoreNucleotide (1094 sequences)27.891.173.0611
Fugu Core + dbEST (27163 sequences)N.A.62.5250219

t0030

4 Conclusions

This paper presents a no-reference hash-based clustering algorithm that outperforms existing no-reference techniques. The proposed CLAST algorithm incorporates a novel relation metric in order to assess the degree of matching between pairs of sequences. For this purpose, CLAST represents the input sequences as nodes on a graph with the weighted edges on the graph corresponding to links between sequences with matching nucleotides. Relatively low values of the weighted graph edges are removed (cut) from the graph, resulting in the final clusters. CLAST has been tested on real sequences and demonstrated its ability to produce biologically significant clusters. CLAST also generates information in a manner that makes it extremely easy to update clusters with minimal processing time (clusters do not need to be recomputed), including merging clusters, adding to existing clusters, and creating new clusters. Future work includes optimizing CLAST to improve time and memory usage, adopting CLAST in building efficient genomic applications including querying and updating genomic databases, as well as an expansion into alternative splicing applications. Indeed, the population of sequence databases such as BIPASS (Lacroix et al., 2007) relies on generating clusters, and any update requires executing the clustering workflow (Lacroix and Legendre, 2008), recomputing all the clusters, and populating a new database. A method such as CLAST could be used to populate the database and update it without having to generating the whole database from scratch as is currently done. Moreover, it could support online submission of new transcripts and identifying which cluster they belong to.

References

Diaz-Uriarte R, de Andres SA. Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006;7(3).

Dondoshanky I. Blastclust (NCBI Software Development Toolkit). 2002 Bethesda MD.

Fleischmann W, Zdobnov EM, Kriventseva EV, Apweiler R. CluSTr: a database of clusters of SWISS-PROT + TrEMBL proteins. Nucleic Acids Res. 2001;29(1):33–36.

Holm L, Sander C. Removing near-neighbor redundancy from large protein sequence collections. Bioinformatics. 1998;14(5):423–429.

Holm L, Heger A, Park J, Chothia C. RSDB: representative protein sequence databases have high information content. Bioinformatics. 2000;16(5):458–464.

Kim S, Lee J. BAG: a fast program for clustering and sequencing large sets of protein or nucleotide sequences. Int. J. Data Min. Bioinformatics. 2006;1(2):178–200.

Lacroix Z, Legendre C. BIPASS: Design of Alternative Splicing Services. Int. J. Comput. Biol. Drug Des. 2008;1(2):200–217.

Lacroix Z, Legendre C, Raschid L, Snyder B. BIPASS: Bioinformatics Pipelines Alternative Splicing Services. Nucleic Acids Res. 2007;35(Suppl. 2):W292–W296 (Web Server issue).

Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–1659.

Mounsef N, Karam LJ, Lacroix Z, Legendre C. A low-complexity probabilistic genome assembly based on hashing. In: IEEE International Workshop on Genomic Signal Processing and Statistics (GENSiPS); 2008:1–4.

Pertea G, et al. TIGR gene indices clustering tools (TGICL): a software system for fast clustering of large EST dataset. Bioinformatics. 2003;19(5):651–652.

Quackenbush J. Computational analysis of microarray data. Nat. Rev. Genomics. 2001;2(6):385–408.

Tarjanb RE, Flake GW, Tsioutsiouliklis K. Graph clustering and minimum cut trees. Internet Math. 2004;1(4):385–408.

Veeramachaneni V, Makalowski W, Galdzicki M, Sood R, Makalowska I. Mammalian Overlapping Genes: The Comparative Perspective. Genome Research. 2004;14(2):280–286.

Zhou C, Blumberg B. Overlapping gene structure of human VLCAD and DLG4. Gene. 2003;305(2):161–166.


1 Any opinion, finding, and conclusion or recommendation expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset