Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 16: Contribution of Noise Reduction Algorithms: Perception Versus Localization Simulation in the Case of Binaural Cochlear Implant (BCI) Coding

Chapter 15

Region Growing in Nonpictorial Data for Organ-Specific Toxicity Prediction

Ray R. Hashemi¹ [email protected]; Azita A. Bahrami² [email protected]; Mahmood Bahar³ [email protected]; Nicholas R. Tyler⁴ [email protected]; Daniel Swain, Jr.¹ [email protected] ¹ Department of Computer Science, Armstrong State University, Savannah, GA, USA
² IT Consultation, Savannah, GA, USA
³ Department of Physics, Islamic Azad University, Tehran North Branch, Tehran, Iran
⁴ School of Pharmacy, University of Georgia, Athens, GA, USA

Abstract

Region growing is a well-known concept in image processing that, among other things, effectively contributes to mining of pictorial data. The goal of this research effort is to (1) investigate region growing in nonpictorial data and (2) determine the effectiveness of the regions in mining of such data. Part (1) is met by introducing a new version of the self-organizing map (SOM), Neighborly SOM, capable of delivering such regions. Part (2) is met by introducing a new prediction methodology using the delivered regions and measuring its effectiveness by (a) applying the method to 10 pairs of training and test sets [repeated random sub-sampling (RRSS) cross-validation] predicting the chemical agents’ liver toxicity and (b) comparing the liver toxicity prediction accuracy with the predictions produced by C4.5, and the traditional SOM using leave-one-out (LOO) and RRSS cross-validations. The results revealed that the proposed methodology has a better performance.

Keywords

Self-organizing map (SOM)

Neighborly SOM

nonpictorial data region

data region use in prediction

data mining

toxicity

data region growing

1 Introduction

Region growing is a well-established concept in the field of image processing. It is used to identify a group of neighboring pixels within an image (pictorial data) that have the same features. To accomplish the region growing, a pixel is selected as the “seed” and all the pixels that satisfy two criteria of being neighbors of the seed and having a common set of features with the seed are added to the region. Each added pixel, in turn, acts as a new seed and the process of region growing continues until no more pixels can be added to the region (Hashemi, 2001). Selections of the features of interest along with the neighborhood radius are subjective and defined by the region grower. An image may have several regions. The regions, among other things, can be used effectively in mining of pictorial data. Examples are numerous, including mining of abnormalities in medical images, image content discovery, image understanding, and vision (Sahu et al., 2012; Antonie et al., 2001; Megalooikonomou et al., 2000; Smith et al., 2013).

The intriguing question is whether one can grow regions for the nonpictorial data and the grown regions can effectively contribute to the mining of such data. By nonpictorial data, we mean a data set comprising a set of objects (records) free of pixel-based data. There are four major challenges when exploring such an intriguing question: establishing foundation of region growing, creating a vehicle for delivering regions, building a mining methodology using the regions, and comparing the effect of such methodology with the known existing ones.

The goal of this research effort is to investigate and provide for all these challenges. To meet the goal, after defining the basic foundation, we specifically introduce a new version of the self-organizing map (SOM) neural network (Kohonen, 1982), named Neighborly SOM, which is able to deliver the regions for a nonpictorial data set. The data set of interest includes the structure activity relationships (SAR) of a set of chemical agents and their features. A new prediction methodology is introduced based on the delivered regions that will be used to examine the prediction of the chemical agents’ liver toxicity.

The remaining organization of this chapter is as follows. Related research in this area is covered in section 2. The basic foundation is presented in section 3. The methodology of the study is introduced in section 4, and the empirical results are discussed in section 5. Our conclusions and implications for future research is covered in section 6.

2 Related works

There is a large body of research about the concepts of neighborhood and closeness. When it comes to numerical data, Euclidean distance is often the choice of measuring the closeness of objects. However, two objects that are close together, due to their small Euclidean distance, are not necessarily neighbors. To explain this concept further, the sum of the Euclidian distances among the n attributes of the two objects (which is only one value) is the deciding factor for determining closeness. In other words, n attributes have a collective effect on the Euclidean distance, whereas the deciding factor for having two objects as neighbors is every individual attribute distance (i.e., n values) (Lin, 1997). To deliver the neighborhood for a given object, we introduce a modified version of the SOM neural network. To the best of our knowledge, using a neural network to deliver neighbors of an object (and not its close objects) has not been explored before.

The neighborhood concept as a tool for defining the granularity of data was explored by the soft computing community (Lin, 1997; Hashemi et al., 1998; Yao and Zhong, 2002). Recently, there are some attempts to use the neighborhood system as a foundation for prediction (Hashemi et al., 2013, 2014; Yao, 2000). We try to build such a tool using our previous experiments (Hashemi et al., 2013, 2014).

3 Basic foundation

An object with n attributes (A₁,..., A_n) may be represented as a point in an n-dimensional space. In such a space, object O_i is the neighbor of object O_j, if O_i is within the n-sphere neighborhood of O_j. This n-sphere is centered on O_j and its radius (r) is the neighborhood radius of O_j. Therefore, O_i is the neighbor of O_j if for every attribute A_k, the distance of |O_j.A_k – O_i.A_k| ≤ r (for k = 1 to n) (Hashemi et al., 2004, 2013; Yao, 2000). Two objects within the neighborhood n-sphere of O_j are not necessarily neighbors of each other. For example, the data set in Figure 15.1(a) has three objects and the objects are described by two attributes of A₁ and A₂. Consider the neighborhood radius of r = 1, the objects of O₂ and O₃ are neighbors of O₁, but they cannot be neighbors of each other because the distance between O₂ and O₃, at least for one attribute, is greater than r [see Figure 1(b)]. If one creates the two-sphere of object O₃, it does not include O₂.

f15-01-9780128025086 — Figure 15.1 Neighborhood concept: (a) A data set and (b) the neighborhood two-sphere of object O₁.

In an image, region growing starts with a pixel as a seed, p_j, and any pixel, p_i, that satisfies the following two criteria is added to the region of p_j (Hashemi et al., 2002):

Criterion I: p_i is the neighbor of p_j and

Criterion II: p_i satisfies some features

The same steps are followed to grow a region for nonpictorial data. However, due to the physical boundaries of the pixels (manifested as their coordinates), a pixel has maximum of eight immediate neighbors (8-Neighbor), whereas an object may have many immediate neighbors.

In other words, the first condition of region growing from a given pixel (seed) has a maximum of eight immediate candidates for further expansion, but an object may have a large number of candidates. Eventually, a pixel may participate in maximum of one region. In contrast, an object has a potential of participating in several regions. Such a potential may dilute those features of a given region that contribute to an effective data mining. Consequently, the multiple region inclusion of the same object needs to be controlled.

As part of the basic foundation, two issues need to be addressed. The first issue is about the second condition of region growing. In pictorial data, the features that a pixel can satisfy are determined through applying a function on the pixel value, the pixel’s coordinates, values of the neighboring pixels, coordinates of the neighboring pixels, or any combination of these. One may ask what can be used to determine features for objects of a nonpictorial data set. The response to this question comes from the nature of data itself. To explain it further, let us divide the nonpictorial data sets into two large groups as follows. Every object in a data set of the first and the second groups are composed of n attributes (descriptive attributes) and n+1 attributes (n descriptive attributes and one decision attribute), respectively. Only the members of the second group are fit for region growing. As a result, the features that an object can satisfy are determined through applying a function on all or a subset of these attributes. The subset always includes the decision attribute. The descriptive and decision attributes are also known as the independent and dependent variables.

The second issue is about the differences between a cluster and a region. By definition, objects of a cluster have great similarity in comparison with each other but are very dissimilar to the objects in other clusters (Han and Kamber, 2006). This is not true for a region. An object in one region may also be a member of another region. In fact, the objects within a region are only similar to the seed and not necessarily to each other. Considering Figure 15.1 again, the objects of O₂ and O₃ are in the region of O₁. However, O₂ and O₃ are not neighbors of each other, and they are dissimilar in comparison to each other, which contradicts the definition of a cluster. In other words, a cluster includes the objects that are close to the centroid of the cluster, whereas a region includes all the neighbors of a given object.

4 Methodology

First, a new version of the SOM neural network, Neighborly SOM, is presented as the methodology for delivering regions of nonpictorial data. Second, a new prediction methodology is introduced based on the delivered regions. The details of the methodologies are covered in the following two subsections.

4.1 Neighborly SOM

The traditional SOM is a two-layer neural network, introduced by Kohonen (1982), which cluster the records of an input data set. The weight vector for each output node, representing a cluster, is generated randomly, and each input vector (O_i) fires an output node (B_j) such that the Euclidean distance between O_i and weight vector (W_j) of B_j is the minimum, calculated as follows:

$D (B_{j}) = \sum_{i} {(W_{j} - O_{i})}^{2}$ $D (B_{j}) = \sum_{i} {(W_{j} - O_{i})}^{2}$

(15.1)

The weight vectors of the fired node and those nodes close to the fired one [within a local radius] are modified by a portion of the differences between the input and the weight vectors. The portion is decided by the learning rate, calculated as follows:

$W_{j} (new) = W_{j} (old) + η [O_{i} - W_{j} (old)],$ $W_{j} (new) = W_{j} (old) + η [O_{i} - W_{j} (old)],$

(15.2)

where O_i is an input vector, W_j is the weight vector of a node within the locale of the winner node, and η is the learning rate (0 < η < 1 ).

After a predefined number of epochs, both the local radius and the learning rate are reduced. The process repeats until the input vectors cannot change the size of the clusters.

Three elements shape the architecture and behavior of the Neighborly SOM and they are the nature of the input data, shortcomings of SOM (Hashemi et al., 2005), and neighborhood concept influence. The nature of data was addressed in section 2 of this chapter. In building the Neighborly SOM, the remedies for shortcomings of the traditional SOM were pointed out as needed. The influence of the neighborhood concept changes the SOM from a cluster builder to a region grower.

4.1.1 Architecture of Neighborly SOM

The traditional SOM performs the clustering of the objects of a given data set. However, we are not interested in the clusters for a given set of seeds; instead, we are interested to grow regions for them. As a result, we introduce changes to the traditional SOM to support the growth of regions by adopting the same two criteria for region growing in images (introduced earlier in this chapter in section 3).

The first departure from the traditional SOM is in the number of output nodes. The number of output nodes is initially equal to 1, and the first object of the data set is considered as the weight vector of the output node. Any object of the data set that is not able to fire the existing output nodes will serve as the weight vector of a new output node. By doing so, the number of output nodes is dictated by the objects of the data set and their weight vectors are not generated randomly. Clearly, this modification is in response to some of the SOM shortcomings. (For details about these shortcomings, consult Hashemi et al., 2005.)

The second departure from the traditional SOM is caused by providing for the first criterion of the region growing. Eq. (15.1) is replaced by Eqs. (15.3) and (15.4). We assume that both input and weight vectors have n elements:

$D (B_{j}) = \frac{P (B_{j})}{Q (B_{j})},$ $D (B_{j}) = \frac{P (B_{j})}{Q (B_{j})},$

si3_e (15.3)

where

$P (B_{j}) = \sum_{i}^{n} | w_{i j} - o_{i} |,$ $P (B_{j}) = \sum_{i}^{n} | w_{i j} - o_{i} |,$

(15.4)

and Q(B_j) is the number of terms of $| w_{i j} - o_{i}$ $| w_{i j} - o_{i}$ | in Eq. (4) that are not zero.

The winner node is the one for which

$D (B_{j}) \leq r + δ,$ $D (B_{j}) \leq r + δ,$

(15.5)

where r is the neighborhood radius and δ is a small value (usually equal to 1/10 of r). The purpose of having δ is to inject some small flexibility into the size of the radius. This flexibility is needed due to changes in the weight vector.

Just as a reminder, using Eq. (15.1) will put both O₁ and O₃ of the nonpictorial data set of Figure 15.1(a) in the same cluster, whereas Eq. (15.5) puts O₁ and O₃ in different neighborhoods, correctly.

The last departure from the traditional SOM is caused by providing the second criterion of region growing. Having the same decision value for the overwhelming majority of the objects of a region is considered the common feature among the objects. The dilution of the common feature is measured and represented by the dilution factor, DF, of the region. Let R be a region with k objects and weight vector of W. Let us also assume that k₁ objects in R have the same decision value as W, k₂ objects have different decisions from W, and k = k₁ + k₂. The dilution factor for R is defined as follows:

${D F}_{R} = \frac{k_{1}}{k}$ ${D F}_{R} = \frac{k_{1}}{k}$

(15.6)

It is obvious that if k₂ increases, the feature of the region R is more diluted. When an object can be a part of more than one region, it decreases the dilution factor for those regions whose decision value of their weight vectors is not the same as the object’s decision value. However, if increases in k₂ are too small, then dilution can be tolerated. Both prevented dilution and tolerated dilution are used to handle the case of multiple-region membership for a given object. Of course, the enforcement of the prevented dilution has a priority over the tolerated dilutions.

Consequently, when more than one output node can be fired, the one that the decision attribute of its weight vector matches the decision attribute of the object is chosen to be fired (prevented dilution). If a tie was not resolved, then the output node whose region has the largest cardinality is fired. If the tie persists, a winner node is randomly chosen. The cardinality of a region that is used in tie breaking may be calculated using recent memory or past memory. The recent memory cardinality refers to the number of objects added to a given region during the current epoch, whereas the past memory cardinality is the number of unique objects that were a part of the region during at least one of the epochs starting from the first epoch through the current one.

As an example, let us concentrate on two output nodes of B₁ and B₂ and the first two consecutive epochs of 1 and 2. The objects firing B₁ and B₂ during each epoch which make their regions are shown in Table 15.1.

Table 15.1

Use of Recent and Past Memories in Calculating the Cardinality of a Region

Epochs and Cardinality	Objects in B₁ Region	Objects in B₂ Region
Epoch 1	{O₅, O₂, O₉, O₆}	{O₈, O₁, O₄}
Epoch 2	{O₅, O₁, O₉, O₆, O₄}	{O₈, O₂}
Card(Region)-Recent Memory	5	2
Card(Region)-Past Memory	6	4

Usually, the use of recent memory is preferred when the training set is large; otherwise, use of past memory is preferred.

The locale radius in Neighborly SOM is always zero to discourage firing of multiple output nodes. These changes in the traditional SOM are encapsulated in the following algorithm:

Algorithm

Neighborly_SOM

Given: A nonpictorial data set, D, with n + 1 attributes (n descriptive attributes and 1 decision attribute). Neighborhood radius r and δ, (0 < δ < 0.1*r). Learning rate of η, (0 < η < 1), and β factor for decreasing η as needed. A chosen number of epochs, E.

Objective: Growing regions of the data set D.

Step 1—Repeat while number of epochs < E;

Step 2—Repeat for every randomly chosen object, Oi = (o_i1 . . .o_in) from D until D is exhausted;

Step 3—Repeat for each output node, B_j, with weight vector of W_j = (w_i1 . . .w_in);

P(W_j) = Σ Abs(w_jm-o_im) (for m = 1 to n);

Q(W_j) = the number of terms (w_jm-o_im) ≠ 0 (for m = 1 to n);

Dist(W_j) = $\frac{P (W_{j})}{Q (W_{j})}$ $\frac{P (W_{j})}{Q (W_{j})}$ ;

If (Dist(W_j) ≤ r + δ

Then Add Bj to the candidateList;

Step 4—If (Card(candidateList) = 0)

Then Add a new output node, B_k, to the SOM;

Set W_k ← O_j;

Add O_i to the region for the new output node; go to step 2;

If (Card(candidateList) > 1)

Then Tie breaking rules and their orders of precedence for selecting a winner node are:

a—The winner node must have the same decision value as O_i;

a1—If more than one such node exist, then select the one with the highest cardinality of its region as the winner;

a2—If more than one such node exist, then randomly select a winner node among the ones that their cardinalities are the same;

b—If rule (a) fails, then fire the node with the minimum cardinality.

Step 5—For the winner node, B_g, Do:

a—Add O_i to the region for B_g;

b—W_gm (new) = W_gm(old) + η[O_im – W_gm(old)] (for m = 1 to n);

c—The decision value of Wg is set to the dominant decision among the objects in the region of B_g.

Step 6—η(new) = β * η(old) ;

Step 7—End;

4.2 A region-based prediction methodology

After the training of the Neighborly SOM is completed, it identifies the final set of regions (R₁, . . . , R_n), their corresponding output nodes (B₁, . . . , B_n), and their corresponding weight vectors (W₁, . . . , W_n). To turn the Neighborly SOM into a prediction tool, some adjustments are necessary. The adjustment rules are listed below based on their precedence, from highest to lowest:

If v ≠ v’, then set v ← v’, where v and v’ are the values for the decision attribute of Wi and dominant decision value of R_i, respectively.

If W_i = W_j, then replace B_i and B_j with a new output node of B_y such that W_y = W_i = W_j and R_y = R_i ∪ R_j. R_y may have a dilution factor that is different from both Ri and Rj.

If DF_Ri < T_df, (∀ i), then remove output of node B_i, its region, and its weight vector. T_df is a threshold value.

If the Card(R_i) = 1, then remove the output node B_i, its region, and its weight vector.

The Neighborly SOM is now ready to act as a predictive tool. The features of the regions are their weight vectors. A test object that has not been previously seen by the system is fed to the system only once. A degree of neighborhood between the test object and each weight vector (say, W_j), is calculated using Eq. (15.7):

$Degree of neighborhood = \sum_{i = 1}^{n} Z_{i},$ $Degree of neighborhood = \sum_{i = 1}^{n} Z_{i},$

(15.7)

where n = Card(W_j) and

$Z i = |\begin{array}{c} 1, & if (|w_{i j} - o_{i}| \leq r + δ \\ 0, & Otherwise \end{array}$ $Z i = |\begin{array}{c} 1, & if (|w_{i j} - o_{i}| \leq r + δ \\ 0, & Otherwise \end{array}$

si10_e

The winner node is the one that has the largest degree of neighborhood. The value of the decision attribute of the winner node’s weight vector is the predicted decision for the test object. If more than one node is fired, then the dominant decision value among the weight vectors of the fired nodes is the predicted decision for the test object. If there is no dominant decision, then the test object cannot be predicted.

5 Empirical results

John Young and his research team in the National Center for Toxicological Research (NCTR) have investigated the properties of more than a thousand chemical agents (Young et al., 2004). The purpose of such investigation was to provide a means for rapidly predicting organ-specific toxicity of new chemicals. As the first step, there was a need for building a data set based on the structure activity relationships (SAR) of the chemical agents. Two main sources that contributed to the building of the SAR data set were the Carcinogenic Potency Database (CPDB) (Gold et al., 1984) and research findings reported in the literature. With regard to the first source, CPBD, a chemical agent may have several records representing studies for different genders, species, rout of administration, and organ-specific toxicity. Such group of records was mapped onto a single record for the SAR data set. A single value for the agent toxicity was also concluded using the different toxicity level of the agent. In addition, elements that could either cause inaccuracy or structural inconsistency were removed. With regard to the second resource, the findings reported in the literature were verified by Young’s research team, duplicating the reported experiments.

After structural cleanup of the agents, SAR data set included 999 chemical agents (objects). Each object had 140 attributes, including a decision attribute that its value indicated whether the agent causes cancer in liver or not. The value for decision attribute was either 1 (toxin) and 0 (nontoxin). We used the proposed methodology to build a system to predict the organ-specific toxicity of the agent using its descriptive and decision attributes.

Building of the predictive system was started by identifying and removal of those descriptive attributes that are redundant. A redundant attribute does not strongly contribute to the decision. An entropy approach (Natt et al., 2012) was used to complete the task that resulted in the identification of only seven descriptive attributes as nonredundant. Removal of the redundant attributes generated many duplicates and contradictory objects. Two contradictory objects have the same values for their descriptive attributes, but different values for their decision attributes. As part of the preprocessing of the SAR data set, one copy of the duplicated objects was kept and the contradictory objects were removed from data set. After the preprocessing of SAR data set was completed, it contained 426 objects.

To measure the prediction accuracy of the Neighborly SOM, the repeated random sub-sampling, RRSS, cross-validation was used. For this validation, 10 pairs of training and test sets were carved from the preprocessed SAR data set as follows. A total of 10% of the SAR objects were randomly selected as a test set. From the remaining records in SAR, the maximum number of objects was selected as a training set, such that the number of objects with decision 1 and with decision 0 in the training set were the same. The reason for having the same number of objects for each decision value was to make sure the trained Neighborly SOM was not biased toward one of the decisions. Since only 10% of the SAR objects were chosen as a test set, 10 pairs of the training and test sets were created, such that each test object appeared in only one test set.

For each pair of training and test sets, the Neighborly SOM was trained by the training set and its prediction accuracy was tested by the corresponding test set. The average results for the 10 test set along with the specificity and sensitivity are shown in Table 15.2.

Table 15.2

Prediction Results for Neighborly SOM, Traditional SOM, and C4.5 Using 10 Pairs

Methodology	% of Correct Predictions	% False Negatives	% False Positives	% Sensitivity	% Specificity	% of Failed Predictions
Neighborly SOM	77	7.62	4.3	78.3	88.7	13
Traditional SOM	63.57	29.8	6.67	72	60.94	0
C4.5	36.2	8.3	25.5	26.7	76.7	30

t0015

It is logical to compare the prediction accuracy of the Neighborly SOM with the prediction accuracy of the traditional SOM. To complete the task, we repeated the following process for each pair of the training and test set using the traditional SOM:

1. The initial possible number of output nodes was selected to be 20, for which 20 weight vectors were randomly generated (one per output node). The number of epochs remained the same so as the initial values of η and β. After the clusters of the training set were produced, the dominant decision for each cluster was calculated.

2. During the test process, the decision value of the test object was compared with the dominant decision of the cluster of the fired node.

The average results for all 10 pairs are shown in Table 15.2.

The prediction accuracy of the Neighborly SOM was also compared with the prediction accuracy of the well-known approach of C4.5 (Quinlan, 1993) using the same 10 pairs of training and test sets. These results are also shown in Table 15.2.

In addition, we compared the performance of the three predictive methods of the Neighborly SOM, traditional SOM, and C4.5, using leave-one-out (LOO) cross-validation (Kohavi, 1995). In this cross-validation, one object of the data set was set aside as a test set, and from the remaining objects, a training set was carved such that the number of objects in the training set was the same for all the decision values. The process was repeated for every object of the data set and then the prediction accuracy results for each model were averaged. The results are shown in Table 15.3.

Table 15.3

Prediction Results for Neighborly SOM, Traditional SOM, and C4.5 Using LOO Cross-Validation

Methodology	% of Correct Predictions	% False Negatives	% False Positives	% Sensitivity	% Specificity	% of Failed Predictions
Neighborly SOM	78.2	8.5	5.9	72.5	88.1	7.5
Traditional SOM	69	25.4	6.6	70.1	67.9	0
C4.5	67.8	0.24	17.2	57.8	99.5	14.8

t0020

6 Conclusions and future research

Neighborly SOM generates regions of the input data set, whereas the traditional SOM clusters the input data set. It is important that the differences between the region and cluster be pointed out one more time. Similar objects make a new cluster; however, neighbor objects make a new region. In clustering, the similarity function produces one value as a base for identifying similar objects. In region growing, the neighborhood function produces several values as a base for identifying neighbors of an object. As a result, two similar objects are not necessarily neighbors of each other, but two neighbor objects are similar.

The results of this study revealed that the regions delivered by the Neighborly SOM for a nonpictorial data set were effective in mining such data, and the sensitivity of the prediction methods showed that the Neighborly SOM is more robust than the other two methods. Results in Tables 15.2 and 15.3 also disclose that Neighborly SOM is less sensitive to the size of the training set.

For the sake of observation, the training and test process for the traditional SOM was repeated under the conditions that the number of outputs and the original weight vectors were exactly the same as the ones for Neighborly SOM. In fact, the set of output nodes and their initial weight vectors generated in the first epoch of Neighborly SOM were used in the traditional SOM. This condition improved the overall performance of the traditional SOM to some degree. However, the sensitivity was reduced and specificity increased.

Currently, investigations of several variations of the Neighborly SOM are in progress for development of nested regions. The ultimate goal is to use nested regions for subpartitioning a given problem space.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 15: Region Growing in Nonpictorial Data for Organ-Specific Toxicity Prediction

Create new playlist

Sign In

Sign Up