Ray R. Hashemi1 [email protected]; Azita A. Bahrami2 [email protected]; Mahmood Bahar3 [email protected]; Nicholas R. Tyler4 [email protected]; Daniel Swain, Jr.1 [email protected] 1 Department of Computer Science, Armstrong State University, Savannah, GA, USA
2 IT Consultation, Savannah, GA, USA
3 Department of Physics, Islamic Azad University, Tehran North Branch, Tehran, Iran
4 School of Pharmacy, University of Georgia, Athens, GA, USA
Region growing is a well-known concept in image processing that, among other things, effectively contributes to mining of pictorial data. The goal of this research effort is to (1) investigate region growing in nonpictorial data and (2) determine the effectiveness of the regions in mining of such data. Part (1) is met by introducing a new version of the self-organizing map (SOM), Neighborly SOM, capable of delivering such regions. Part (2) is met by introducing a new prediction methodology using the delivered regions and measuring its effectiveness by (a) applying the method to 10 pairs of training and test sets [repeated random sub-sampling (RRSS) cross-validation] predicting the chemical agents’ liver toxicity and (b) comparing the liver toxicity prediction accuracy with the predictions produced by C4.5, and the traditional SOM using leave-one-out (LOO) and RRSS cross-validations. The results revealed that the proposed methodology has a better performance.
Region growing is a well-established concept in the field of image processing. It is used to identify a group of neighboring pixels within an image (pictorial data) that have the same features. To accomplish the region growing, a pixel is selected as the “seed” and all the pixels that satisfy two criteria of being neighbors of the seed and having a common set of features with the seed are added to the region. Each added pixel, in turn, acts as a new seed and the process of region growing continues until no more pixels can be added to the region (Hashemi, 2001). Selections of the features of interest along with the neighborhood radius are subjective and defined by the region grower. An image may have several regions. The regions, among other things, can be used effectively in mining of pictorial data. Examples are numerous, including mining of abnormalities in medical images, image content discovery, image understanding, and vision (Sahu et al., 2012; Antonie et al., 2001; Megalooikonomou et al., 2000; Smith et al., 2013).
The intriguing question is whether one can grow regions for the nonpictorial data and the grown regions can effectively contribute to the mining of such data. By nonpictorial data, we mean a data set comprising a set of objects (records) free of pixel-based data. There are four major challenges when exploring such an intriguing question: establishing foundation of region growing, creating a vehicle for delivering regions, building a mining methodology using the regions, and comparing the effect of such methodology with the known existing ones.
The goal of this research effort is to investigate and provide for all these challenges. To meet the goal, after defining the basic foundation, we specifically introduce a new version of the self-organizing map (SOM) neural network (Kohonen, 1982), named Neighborly SOM, which is able to deliver the regions for a nonpictorial data set. The data set of interest includes the structure activity relationships (SAR) of a set of chemical agents and their features. A new prediction methodology is introduced based on the delivered regions that will be used to examine the prediction of the chemical agents’ liver toxicity.
The remaining organization of this chapter is as follows. Related research in this area is covered in section 2. The basic foundation is presented in section 3. The methodology of the study is introduced in section 4, and the empirical results are discussed in section 5. Our conclusions and implications for future research is covered in section 6.
There is a large body of research about the concepts of neighborhood and closeness. When it comes to numerical data, Euclidean distance is often the choice of measuring the closeness of objects. However, two objects that are close together, due to their small Euclidean distance, are not necessarily neighbors. To explain this concept further, the sum of the Euclidian distances among the n attributes of the two objects (which is only one value) is the deciding factor for determining closeness. In other words, n attributes have a collective effect on the Euclidean distance, whereas the deciding factor for having two objects as neighbors is every individual attribute distance (i.e., n values) (Lin, 1997). To deliver the neighborhood for a given object, we introduce a modified version of the SOM neural network. To the best of our knowledge, using a neural network to deliver neighbors of an object (and not its close objects) has not been explored before.
The neighborhood concept as a tool for defining the granularity of data was explored by the soft computing community (Lin, 1997; Hashemi et al., 1998; Yao and Zhong, 2002). Recently, there are some attempts to use the neighborhood system as a foundation for prediction (Hashemi et al., 2013, 2014; Yao, 2000). We try to build such a tool using our previous experiments (Hashemi et al., 2013, 2014).
An object with n attributes (A1,..., An) may be represented as a point in an n-dimensional space. In such a space, object Oi is the neighbor of object Oj, if Oi is within the n-sphere neighborhood of Oj. This n-sphere is centered on Oj and its radius (r) is the neighborhood radius of Oj. Therefore, Oi is the neighbor of Oj if for every attribute Ak, the distance of |Oj.Ak – Oi.Ak| ≤ r (for k = 1 to n) (Hashemi et al., 2004, 2013; Yao, 2000). Two objects within the neighborhood n-sphere of Oj are not necessarily neighbors of each other. For example, the data set in Figure 15.1(a) has three objects and the objects are described by two attributes of A1 and A2. Consider the neighborhood radius of r = 1, the objects of O2 and O3 are neighbors of O1, but they cannot be neighbors of each other because the distance between O2 and O3, at least for one attribute, is greater than r [see Figure 1(b)]. If one creates the two-sphere of object O3, it does not include O2.
In an image, region growing starts with a pixel as a seed, pj, and any pixel, pi, that satisfies the following two criteria is added to the region of pj (Hashemi et al., 2002):
Criterion I: pi is the neighbor of pj and
Criterion II: pi satisfies some features
The same steps are followed to grow a region for nonpictorial data. However, due to the physical boundaries of the pixels (manifested as their coordinates), a pixel has maximum of eight immediate neighbors (8-Neighbor), whereas an object may have many immediate neighbors.
In other words, the first condition of region growing from a given pixel (seed) has a maximum of eight immediate candidates for further expansion, but an object may have a large number of candidates. Eventually, a pixel may participate in maximum of one region. In contrast, an object has a potential of participating in several regions. Such a potential may dilute those features of a given region that contribute to an effective data mining. Consequently, the multiple region inclusion of the same object needs to be controlled.
As part of the basic foundation, two issues need to be addressed. The first issue is about the second condition of region growing. In pictorial data, the features that a pixel can satisfy are determined through applying a function on the pixel value, the pixel’s coordinates, values of the neighboring pixels, coordinates of the neighboring pixels, or any combination of these. One may ask what can be used to determine features for objects of a nonpictorial data set. The response to this question comes from the nature of data itself. To explain it further, let us divide the nonpictorial data sets into two large groups as follows. Every object in a data set of the first and the second groups are composed of n attributes (descriptive attributes) and n+1 attributes (n descriptive attributes and one decision attribute), respectively. Only the members of the second group are fit for region growing. As a result, the features that an object can satisfy are determined through applying a function on all or a subset of these attributes. The subset always includes the decision attribute. The descriptive and decision attributes are also known as the independent and dependent variables.
The second issue is about the differences between a cluster and a region. By definition, objects of a cluster have great similarity in comparison with each other but are very dissimilar to the objects in other clusters (Han and Kamber, 2006). This is not true for a region. An object in one region may also be a member of another region. In fact, the objects within a region are only similar to the seed and not necessarily to each other. Considering Figure 15.1 again, the objects of O2 and O3 are in the region of O1. However, O2 and O3 are not neighbors of each other, and they are dissimilar in comparison to each other, which contradicts the definition of a cluster. In other words, a cluster includes the objects that are close to the centroid of the cluster, whereas a region includes all the neighbors of a given object.
First, a new version of the SOM neural network, Neighborly SOM, is presented as the methodology for delivering regions of nonpictorial data. Second, a new prediction methodology is introduced based on the delivered regions. The details of the methodologies are covered in the following two subsections.
The traditional SOM is a two-layer neural network, introduced by Kohonen (1982), which cluster the records of an input data set. The weight vector for each output node, representing a cluster, is generated randomly, and each input vector (Oi) fires an output node (Bj) such that the Euclidean distance between Oi and weight vector (Wj) of Bj is the minimum, calculated as follows:
The weight vectors of the fired node and those nodes close to the fired one [within a local radius] are modified by a portion of the differences between the input and the weight vectors. The portion is decided by the learning rate, calculated as follows:
where Oi is an input vector, Wj is the weight vector of a node within the locale of the winner node, and η is the learning rate (0 < η < 1 ).
After a predefined number of epochs, both the local radius and the learning rate are reduced. The process repeats until the input vectors cannot change the size of the clusters.
Three elements shape the architecture and behavior of the Neighborly SOM and they are the nature of the input data, shortcomings of SOM (Hashemi et al., 2005), and neighborhood concept influence. The nature of data was addressed in section 2 of this chapter. In building the Neighborly SOM, the remedies for shortcomings of the traditional SOM were pointed out as needed. The influence of the neighborhood concept changes the SOM from a cluster builder to a region grower.
The traditional SOM performs the clustering of the objects of a given data set. However, we are not interested in the clusters for a given set of seeds; instead, we are interested to grow regions for them. As a result, we introduce changes to the traditional SOM to support the growth of regions by adopting the same two criteria for region growing in images (introduced earlier in this chapter in section 3).
The first departure from the traditional SOM is in the number of output nodes. The number of output nodes is initially equal to 1, and the first object of the data set is considered as the weight vector of the output node. Any object of the data set that is not able to fire the existing output nodes will serve as the weight vector of a new output node. By doing so, the number of output nodes is dictated by the objects of the data set and their weight vectors are not generated randomly. Clearly, this modification is in response to some of the SOM shortcomings. (For details about these shortcomings, consult Hashemi et al., 2005.)
The second departure from the traditional SOM is caused by providing for the first criterion of the region growing. Eq. (15.1) is replaced by Eqs. (15.3) and (15.4). We assume that both input and weight vectors have n elements:
where
and Q(Bj) is the number of terms of | in Eq. (4) that are not zero.
The winner node is the one for which
where r is the neighborhood radius and δ is a small value (usually equal to 1/10 of r). The purpose of having δ is to inject some small flexibility into the size of the radius. This flexibility is needed due to changes in the weight vector.
Just as a reminder, using Eq. (15.1) will put both O1 and O3 of the nonpictorial data set of Figure 15.1(a) in the same cluster, whereas Eq. (15.5) puts O1 and O3 in different neighborhoods, correctly.
The last departure from the traditional SOM is caused by providing the second criterion of region growing. Having the same decision value for the overwhelming majority of the objects of a region is considered the common feature among the objects. The dilution of the common feature is measured and represented by the dilution factor, DF, of the region. Let R be a region with k objects and weight vector of W. Let us also assume that k1 objects in R have the same decision value as W, k2 objects have different decisions from W, and k = k1 + k2. The dilution factor for R is defined as follows:
It is obvious that if k2 increases, the feature of the region R is more diluted. When an object can be a part of more than one region, it decreases the dilution factor for those regions whose decision value of their weight vectors is not the same as the object’s decision value. However, if increases in k2 are too small, then dilution can be tolerated. Both prevented dilution and tolerated dilution are used to handle the case of multiple-region membership for a given object. Of course, the enforcement of the prevented dilution has a priority over the tolerated dilutions.
Consequently, when more than one output node can be fired, the one that the decision attribute of its weight vector matches the decision attribute of the object is chosen to be fired (prevented dilution). If a tie was not resolved, then the output node whose region has the largest cardinality is fired. If the tie persists, a winner node is randomly chosen. The cardinality of a region that is used in tie breaking may be calculated using recent memory or past memory. The recent memory cardinality refers to the number of objects added to a given region during the current epoch, whereas the past memory cardinality is the number of unique objects that were a part of the region during at least one of the epochs starting from the first epoch through the current one.
As an example, let us concentrate on two output nodes of B1 and B2 and the first two consecutive epochs of 1 and 2. The objects firing B1 and B2 during each epoch which make their regions are shown in Table 15.1.
Table 15.1
Use of Recent and Past Memories in Calculating the Cardinality of a Region
Epochs and Cardinality | Objects in B1 Region | Objects in B2 Region |
Epoch 1 | {O5, O2, O9, O6} | {O8, O1, O4} |
Epoch 2 | {O5, O1, O9, O6, O4} | {O8, O2} |
Card(Region)-Recent Memory | 5 | 2 |
Card(Region)-Past Memory | 6 | 4 |
Usually, the use of recent memory is preferred when the training set is large; otherwise, use of past memory is preferred.
The locale radius in Neighborly SOM is always zero to discourage firing of multiple output nodes. These changes in the traditional SOM are encapsulated in the following algorithm:
After the training of the Neighborly SOM is completed, it identifies the final set of regions (R1, . . . , Rn), their corresponding output nodes (B1, . . . , Bn), and their corresponding weight vectors (W1, . . . , Wn). To turn the Neighborly SOM into a prediction tool, some adjustments are necessary. The adjustment rules are listed below based on their precedence, from highest to lowest:
If v ≠ v’, then set v ← v’, where v and v’ are the values for the decision attribute of Wi and dominant decision value of Ri, respectively.
If Wi = Wj, then replace Bi and Bj with a new output node of By such that Wy = Wi = Wj and Ry = Ri ∪ Rj. Ry may have a dilution factor that is different from both Ri and Rj.
If DFRi < Tdf, (∀ i), then remove output of node Bi, its region, and its weight vector. Tdf is a threshold value.
If the Card(Ri) = 1, then remove the output node Bi, its region, and its weight vector.
The Neighborly SOM is now ready to act as a predictive tool. The features of the regions are their weight vectors. A test object that has not been previously seen by the system is fed to the system only once. A degree of neighborhood between the test object and each weight vector (say, Wj), is calculated using Eq. (15.7):
where n = Card(Wj) and
The winner node is the one that has the largest degree of neighborhood. The value of the decision attribute of the winner node’s weight vector is the predicted decision for the test object. If more than one node is fired, then the dominant decision value among the weight vectors of the fired nodes is the predicted decision for the test object. If there is no dominant decision, then the test object cannot be predicted.
John Young and his research team in the National Center for Toxicological Research (NCTR) have investigated the properties of more than a thousand chemical agents (Young et al., 2004). The purpose of such investigation was to provide a means for rapidly predicting organ-specific toxicity of new chemicals. As the first step, there was a need for building a data set based on the structure activity relationships (SAR) of the chemical agents. Two main sources that contributed to the building of the SAR data set were the Carcinogenic Potency Database (CPDB) (Gold et al., 1984) and research findings reported in the literature. With regard to the first source, CPBD, a chemical agent may have several records representing studies for different genders, species, rout of administration, and organ-specific toxicity. Such group of records was mapped onto a single record for the SAR data set. A single value for the agent toxicity was also concluded using the different toxicity level of the agent. In addition, elements that could either cause inaccuracy or structural inconsistency were removed. With regard to the second resource, the findings reported in the literature were verified by Young’s research team, duplicating the reported experiments.
After structural cleanup of the agents, SAR data set included 999 chemical agents (objects). Each object had 140 attributes, including a decision attribute that its value indicated whether the agent causes cancer in liver or not. The value for decision attribute was either 1 (toxin) and 0 (nontoxin). We used the proposed methodology to build a system to predict the organ-specific toxicity of the agent using its descriptive and decision attributes.
Building of the predictive system was started by identifying and removal of those descriptive attributes that are redundant. A redundant attribute does not strongly contribute to the decision. An entropy approach (Natt et al., 2012) was used to complete the task that resulted in the identification of only seven descriptive attributes as nonredundant. Removal of the redundant attributes generated many duplicates and contradictory objects. Two contradictory objects have the same values for their descriptive attributes, but different values for their decision attributes. As part of the preprocessing of the SAR data set, one copy of the duplicated objects was kept and the contradictory objects were removed from data set. After the preprocessing of SAR data set was completed, it contained 426 objects.
To measure the prediction accuracy of the Neighborly SOM, the repeated random sub-sampling, RRSS, cross-validation was used. For this validation, 10 pairs of training and test sets were carved from the preprocessed SAR data set as follows. A total of 10% of the SAR objects were randomly selected as a test set. From the remaining records in SAR, the maximum number of objects was selected as a training set, such that the number of objects with decision 1 and with decision 0 in the training set were the same. The reason for having the same number of objects for each decision value was to make sure the trained Neighborly SOM was not biased toward one of the decisions. Since only 10% of the SAR objects were chosen as a test set, 10 pairs of the training and test sets were created, such that each test object appeared in only one test set.
For each pair of training and test sets, the Neighborly SOM was trained by the training set and its prediction accuracy was tested by the corresponding test set. The average results for the 10 test set along with the specificity and sensitivity are shown in Table 15.2.
Table 15.2
Prediction Results for Neighborly SOM, Traditional SOM, and C4.5 Using 10 Pairs
Methodology | % of Correct Predictions | % False Negatives | % False Positives | % Sensitivity | % Specificity | % of Failed Predictions |
Neighborly SOM | 77 | 7.62 | 4.3 | 78.3 | 88.7 | 13 |
Traditional SOM | 63.57 | 29.8 | 6.67 | 72 | 60.94 | 0 |
C4.5 | 36.2 | 8.3 | 25.5 | 26.7 | 76.7 | 30 |
It is logical to compare the prediction accuracy of the Neighborly SOM with the prediction accuracy of the traditional SOM. To complete the task, we repeated the following process for each pair of the training and test set using the traditional SOM:
1. The initial possible number of output nodes was selected to be 20, for which 20 weight vectors were randomly generated (one per output node). The number of epochs remained the same so as the initial values of η and β. After the clusters of the training set were produced, the dominant decision for each cluster was calculated.
2. During the test process, the decision value of the test object was compared with the dominant decision of the cluster of the fired node.
The average results for all 10 pairs are shown in Table 15.2.
The prediction accuracy of the Neighborly SOM was also compared with the prediction accuracy of the well-known approach of C4.5 (Quinlan, 1993) using the same 10 pairs of training and test sets. These results are also shown in Table 15.2.
In addition, we compared the performance of the three predictive methods of the Neighborly SOM, traditional SOM, and C4.5, using leave-one-out (LOO) cross-validation (Kohavi, 1995). In this cross-validation, one object of the data set was set aside as a test set, and from the remaining objects, a training set was carved such that the number of objects in the training set was the same for all the decision values. The process was repeated for every object of the data set and then the prediction accuracy results for each model were averaged. The results are shown in Table 15.3.
Table 15.3
Prediction Results for Neighborly SOM, Traditional SOM, and C4.5 Using LOO Cross-Validation
Methodology | % of Correct Predictions | % False Negatives | % False Positives | % Sensitivity | % Specificity | % of Failed Predictions |
Neighborly SOM | 78.2 | 8.5 | 5.9 | 72.5 | 88.1 | 7.5 |
Traditional SOM | 69 | 25.4 | 6.6 | 70.1 | 67.9 | 0 |
C4.5 | 67.8 | 0.24 | 17.2 | 57.8 | 99.5 | 14.8 |
Neighborly SOM generates regions of the input data set, whereas the traditional SOM clusters the input data set. It is important that the differences between the region and cluster be pointed out one more time. Similar objects make a new cluster; however, neighbor objects make a new region. In clustering, the similarity function produces one value as a base for identifying similar objects. In region growing, the neighborhood function produces several values as a base for identifying neighbors of an object. As a result, two similar objects are not necessarily neighbors of each other, but two neighbor objects are similar.
The results of this study revealed that the regions delivered by the Neighborly SOM for a nonpictorial data set were effective in mining such data, and the sensitivity of the prediction methods showed that the Neighborly SOM is more robust than the other two methods. Results in Tables 15.2 and 15.3 also disclose that Neighborly SOM is less sensitive to the size of the training set.
For the sake of observation, the training and test process for the traditional SOM was repeated under the conditions that the number of outputs and the original weight vectors were exactly the same as the ones for Neighborly SOM. In fact, the set of output nodes and their initial weight vectors generated in the first epoch of Neighborly SOM were used in the traditional SOM. This condition improved the overall performance of the traditional SOM to some degree. However, the sensitivity was reduced and specificity increased.
Currently, investigations of several variations of the Neighborly SOM are in progress for development of nested regions. The ultimate goal is to use nested regions for subpartitioning a given problem space.