The calculation of Gini( VIV) value for j: 7 (see Section 6.5 eq. (6.8)).

Ginileft(VIV)=1[(3/5)2+(2/5)2]=0.48Giniright(VIV)=1[(3/5)2+(2/5)2]=0.48Gini(VIV)=(5(0.48)+5(0.48))/10=0.48

The calculation of Gini(VIP) value for j: 8 (see Section 6.5; eq. (6.8)) is as follows:

Ginileft(VIP)=1[(3/5)2+(2/5)2]=0.48Giniright(VIP)=1[(3/5)2+(2/5)2]=0.48Gini(VIP)=(5(0.48)+5(0.48))/10=0.48

The calculation of Gini(VIT) value for j: 9 (see Section 6.5; eq. (6.8)) is as follows:

Giniright(VIT)=1[(5/5)2+(0/5)2]=0Ginileft(VIT)=1[(1/5)2+(4/5)2]=0.32Gini(VIT)=(5(0.32)+5(0))/10=0.16

The calculation of Gini(DM) value for j: 10 (see Section 6.5; eq. (6.8)) is as follows:

Ginileft(DM)=1[(2/3)2+(1/3)2]=0.46Giniright(DM)=1[(4/7)2+(3/7)2]=0.5Gini(DM)=(3(0.46)+7(0.5))/10=0.49

The lowest Gini value is Gini(Information) = 0. The lowest value in the Gini values is the root of the decision tree. The decision tree obtained from the Sample WAIS-R dataset is provided in Figure 2.14.

Figure 2.14: The decision tree obtained from sample WAIS-R dataset D(10 × 10).

The decision rule as obtained from sample WAIS-R dataset in Figure 2.14 is given as Rules 1 and 2:

Rule 1: If Information ≥11, class is “healthy.”

Rule 2: If Information < 11, class is “patient.”

Sample synthetic WAIS-R dataset D(3 × 10) in Rule 1 Information ≥ 11 and in Rule 2 Information <11 (see Figure 2.14) is applied and obtained accordingly (Table 2.19).

In Table 2.19, sample synthetic WAIS-R dataset D(3 × 10) is obtained by applying the following: Rules 1 and 2 (see Figure 2.14).

In brief, sample WAIS-R dataset is randomly chosen from the WAIS-R dataset. CART algorithm is applied to the WAIS-R dataset. By applying this, we can get synthetic sample dataset (see Table 2.22). Sample synthetic dataset is composed of two patients and one healthy data (it is possible to increase or reduce the entry number in the synthetic WAIS-R dataset) (see Table 2.23).

Table 2.22: Forming the sample synthetic WAIS-R dataset for information ≥ 11 and information <11 as per Rule 1 and Rule 2.

Table 2.23: Forming the sample synthetic WAIS-R dataset for Rule 1: Information ≥11, then class is “healthy” and Rule 2: Information <11, then class is “patient as per Table 2.22.”

Note that for the following sections, synthetic datasets are used for those in medical area and real datasets are used for those in economy with regard to applications and examples.

2.5Basic statistical descriptions of data

For a successful preprocessing, it is important to have an overview and general picture of the data. This can be done by some descriptive variables that can be obtained by some elementary or basic statistics. Mean, median, mode and midrange are measures of central tendency, which measure the location of the middle or center of a data distribution. Apart from the central tendency of the dataset, it is also important to have an idea about the dispersion of the data, which refers to how the data spread out. The most frequently used ones in this area are the range, quartiles and interquartile range; boxplots; the variance and standard deviation of the data. Data summaries and distributions as other display tools include bar charts, quantile plots, quantile–quantile plots and histograms and scatter plots [33–39].

2.5.1Central tendency: Mean, median and mode

x1, x2, . . . , xN is the set of N observed values or observations for X. The values here may also be referred to as the dataset (forD). If the observations were plotted, where would most of the values fall? This would elicit an idea of the data central tendency. Measures of central tendency comprise the mean, median, mode and midrange [33–39].

The most effective and common numeric measure of the “center” of a set of data is the (arithmetic) mean.

Definition 2.5.1.1. The arithmetic mean of this set of values is given as follows:

x¯=i=1NxiN=x1+x2++xNN(2.1)

Example 2.4 Suppose that we have the following 10 values X = 3.5, 3.5, 4, {2.5, 4, 4.5, 5, 5, 6, 6.5}. Using eq. (2.1), the mean is given as:

x¯=2.5+3.5+3.5+4+4+4.5+5+5+6+6.510=44.510=4.45

Sometimes each value xi in a set may be linked with a weight wi for i = 1, . . . , N.. Here the weights denote the significance or frequency of occurrence according to their values in the relevant order. In this case, we can have a look at the following definition.

Definition 2.5.1.2. Weighted arithmetic mean or weighted average

x¯=i=1Nwixii=1Nwi=w1x1+w2x2++wNxNw1+w2++wN(2.2)

The mean is the only useful quantity for the description of a dataset, but it does not always offer the optimal way of measuring the center of the data. A significant problem related to the mean is its sensitivity to outlier values, also known as extreme values. It is possible to see cases in which even small extreme values can corrupt the mean. For instance, the mean income for bank employees may be significantly raised by a manager whose salary is extremely high. Similarly, the mean score of a class in a college course in a test could be pulled down slightly by a few very low scores to counterbalance the effect by a small number of outliers. In order to avoid such problems, sometimes the trimmed mean is used.

Definition 2.5.1.3. Trimmed mean is defined as the mean obtained after one chops off the values that lie at the high and low extremes. For example, it is possible to arrange the values observed for credits and remove the top and bottom 3% prior to calculating the mean. It should be noted that one is supposed to avoid trimming a big portion (such as 30%) at both ends; such a procedure may bring about a loss of valuable information.

For skewed or asymmetric data, median is known to be a better measure of the center of data.

Definition 2.5.1.4. Median is defined as the middle value in a set of ordered data values. It is the value that separates the higher half of a dataset from the lower half. In the fields of statistics and probability, the median often signify numeric data, and, it is possible to extend the concept of ordinal data. Assume that a given dataset of N values for X attribute is arranged in growing order. If N is odd, the median is the value lying at the middle position of the ordered set. If N is even, the median is not exceptional.. It is given by the two values that lie in the middle. If X is a numeric attribute, the median is considered by rule and handled as the mean of the values that lie in the very middle [33–39].

Example 2.5 Median. By using the data of Example 2.4 we can see that the observations are already ordered but even then the median is not unique. This means that the median may be any value that lie in the very middle between values 4 and 4.5. According to this example, it is within the sixth and seventh values in the list. It is a rule to assign the average of the two middle values as the median, which gives us: 4+4.52=8.52=4.25. From this, the median is 4.25.

In another scenario, let us say that only the first 11 values are in the list. With an odd number of values, the median is the middle most value. In the list, it is the sixth value with a value of 4.

The median is computationally difficult to calculate when there is a large number of observations. For numeric attributes, Boolean value can be calculated easily. Data may be grouped in intervals based on their xi data values and the frequency (number of data values) of each interval may be known. For example, the individuals whose EDSS score is between 2 and 8.5. The interval that contains the median frequency is the median interval. In this case, we can estimate the median of the whole dataset (the median personal credit) by interpolation by using the following formula:

median=L1+(N/2(freq)1freqmedium)width(2.3)

where L1 is the lower boundary of the median interval, N is the number of values in the whole dataset, (freq)1 is the frequencies’ sum that belongs to the intervals that are lower than the median interval, freqmedian is the median interval’s frequency and width i refers to the width of the median interval.

Median is the value that falls in the middle when the values are ranked from largest to smallest for ungrouped data. Median is the value that splits the array into two.

Equation 2.4(a) and (b) is applied depending on whether the number of elements in an array is even or odd.

xmedian=x(N2)+x(N+12)Niseven(2.4a)

xmedian=x(N+12)Nisodd (2.4b)

The mode is another measure of central tendency.

Definition 2.4.1.5. The mode is the value that occurs most frequently in the set. For this reason, it can be identified for qualitative and quantitative attributes. The greatest frequency may correspond to a number of different values, which brings about more than one mode. Datasets with one, two and three modes are called unimodal, bimodal and trimodal, respectively. Overall, a dataset that has two or more modes is multimodal. At the other extreme of the spectrum, if each data value is seen once only, we can say that there is no mode at all [7, 33–39].

Example 2.6 Mode. The data from Example 2.5 are bimodal. The two modes are 3.5, 4 and 5.

The following empirical relation is obtained for unimodal numeric data that are skewed moderately or happen to be asymmetrical [40].

meanmode3×(meanmedian)(2.5)

This suggests that the mode for unimodal frequency curves, moderately skewed, can be estimated easily if we know the mean and median value.

We can also use the midrange for the assessment of the central tendency of a numeric dataset. Midrange is known to be the average of the largest and smallest values in the set. This measurement is convenient and easy kind when the SQL aggregate functions, max() and min(), are used [34].

Example 2.7: Midrange. The midrange of the data provided in Example 2.4 is 2.5+6.52=4.5

In a unimodal frequency curve with perfect symmetric data distribution, the mean, median and mode are all at the same center value (Figure 2.15(a)). Data in most applications in real life are not symmetric. Instead, they may be positively skewed (the mode occurs at a value smaller than the median (see Figure 2.15(b)), or negatively skewed (the mode occurs at a value greater than the median (see Figure 2.15(c)) [35–36].

2.5.2Spread of data

The spread or dispersion of numeric data is an important measure to be assessed in statistics. The most common measures of dispersion are range, quantiles, quartiles, percentiles and the interquartile range. Box plot is also another useful tool to identify outliers. Variance and standard deviation are the other measures that show the spreading of a data distribution.

Figure 2.15: Mean, median and mode of symmetric versus positively and negatively skewed data: (a) symmetric data, (b) positively skewed data and (c) negatively skewed data.

Let x :x1, x2, . . . , xN be a set of observations for numeric attribute. The range is the difference between the largest (max()) and smallest (min()) values.

range:max()min()(2.6)

Quantiles are points that are taken at regular intervals of a data spread or distribution. It is found by dividing the interval into equal size sequential sets.

The quartiles indicate a distribution’s center, spread and shape. The first quartile, as denoted by Q1, is the 25th percentile. It cuts off the lowest 25% of the data. The third quartile, as denoted by Q3, is the 75th percentile. It cuts off the highest 25% or the lowest 75% of the data. The second quartile is known to be the 50th percentile (Q2, Median). It is also the median, which yields the center of the data distribution [7, 33–39].

In order to find the middle half of the data, the distance between the first and third quartiles is seen as a simple measure of spread, which presents the range that is covered. Interquartile range (IQR) is the distance defined as follows:

IQR=Q3Q1(2.7)

Boxplot is a common technique used for the visualization of a distribution (see Figure 2.16). The ends of the box are at the quartiles. The box length is the interquartile range. A line within the box marks the median. Two lines are called whiskers, and these are outside the box extending to the minimum and maximum observations.

Example 2.8: Let us calculate the mean, median, interquartile range and min() and max() values of the numerical array X = [1, 2, 3, 4, 5, 6, 2, 3, 5]. Let us also draw its boxplot.

The mean calculation of the numerical array X = [1, 2, 3, 4, 5, 6, 2, 3, 5] is based on eq. (2.1):

Figure 2.16: A plot of the data distribution for some attribute X. The quantiles that are plotted are quartiles. The three quartiles split the distribution into four consecutive subsets (each having an equal size). The second quartile corresponds to the median.
X¯=i=1NxiN=x1+x2++xNNX¯=1+2+(3)+4+5(6)+2+(3)+59=0.7

Median calculation is based on eq. (2.4(a))

xmedian=x(N+12)xmedian=x(9+12)=x(5)=5

Range calculation is based on eq. (2.6).

Let us rank the numerical array X = [1, 2, 3, 4, 5, 6, 2, 3, 5] from small to large values:

X=[6,3,3,1,2,2,4,5,5]

interquartile range: (IQR) = max() − min()=5− ( − 6) = 11

max() value: 5

min() value: −6

Since boxplot is a plot of the distribution, so you may give a simple figure (see Figure 2.17) to show the boxplot for a simple distribution.

It is possible to make use of boxplots while making comparisons of several sets of compatible data. A method to calculate about boxplots is by making the computation in O(nlogn) time.

Figure 2.17: Boxplot of X vector.

The ends of the box are usually at the quartiles, so that the box length is the interquartile range.

A line marks the median within the box.

Two lines are called whiskers. These lines are outside the box extending to the smallest (minimum) and largest (maximum) observations.

Example 2.9 Boxplot. Figure 2.18 presents the boxplots for the EDSS score and lesion counts that belong to the MR images of the patients within a given time interval and are given in Table 2.10. The initial value for EDSS is seen as 0 and for the third quartile the value for this is 5.

The figure presents the boxplot for the same data with the maximum whisker length specified as 5.0 times the interquartile range. Data points beyond the whiskers are represented using +.

2.5.3Measures of data dispersion

These measures in a data distribution are spread out.

Let x1, x2, . . ., xN be the values of a numeric attribute X, and the variance of N observations is given by the following equation:

σ2=1Ni=1N(xix˜)2(1Ni=1Nxi2)(x)¯2(2.8)

Here, is the mean value of the observations. The standard deviation σof the observations is the square root of the variance σ2. A low standard deviation is a proof that the data observations are very close to the mean, and, a high standard deviation shows that the data are spread out over a large range of values [40–42].

Figure 2.18: Boxplot depiction of EDSS and MRI lesion diameters of MS dataset based on a certain time interval.

Example 2.10 By using eq. (2.1) we found in Example 2.4 that the mean is = 4.45. In order to determine the variance and standard deviation of the data from that example, we set N = 10 and use eq. (2.8).

σ2=110(2.52+3.52+3.52+42+42+4.52+52+52+62+6.52)4.4520.1σ=0.10.316

The basic properties of the standard deviation are as follows:

It measures how the values spread around the mean.

σ = 0 when there is no dispersion, which means that all observations share the same value.

For this reason, the standard deviation is considered to be a sound indicator of the dataset dispersion [43].

2.5.4Graphic displays

The graphic displays of basic statistical descriptions include histograms, scatter plots, quantile plot and quantilequantile plot.

Histograms are extensively used in all areas where statistics is used. It displays the distribution of a given attribute X. If X is nominal, a vertical bar is drawn for each known value of this attribute X. The height of the bar shows the frequency (how many times it occurs) of the X. The graph is also known as bar chart. Histograms are preferred options when X is numeric. The subranges are defined as buckets or bins, which are separate sections for data distribution for X attribute. The width of the bar corresponds to the range of a bucket. The buckets have equal width.

Scatter plot is another effective graphical method to determine a possible relationship trend between two attributes that are numeric in nature. In order to form a scatter plot, each pair of values is handled as a pair of coordinates, and plotted as points in the plane. Figure 2.19 illustrates a scatter plot for the set of data in Table 2.22.

The scatter plot is a suitable method to see clusters of points and outliers for bivariate data. It is also useful to visualize some correlation.

Figure 2.19: Histogram for Table 2.11.

Definition 2.5.4.1. Two attributes, X, and Y, are said to be correlated (more information in Chapter 3) if the variability of one attribute has some influence on the variability of the other.

Correlations can be positive, negative or null. In this case, the two attributes are said uncorrelated. Figure 2.15 provides some examples of positive and negative correlations. If the pattern of the plotted points is from lower left to upper right, this is an indication that values of X go up when the values of Y go up as well. is from lower left to upper right, this is an indication that values of X go up when the values of Y go up as well. As it is known, this kind of pattern suggests a positive correlation (Figure 2.21 (a)). If the pattern of plotted points is from upper left to lower right, with an increase in values of X the values of Y go down. A negative correlation is seen in Figure 2.21 (b). Figure 2.22 Figure 2.24 indicates some cases in which no correlation relationship between the two attributes exists in the datasets given [42, 43].

Quantile plot: A quantile plot is an effective and straightforward way to know about a univariate data distribution. It displays all the data for the given attribute. It also plots quantile information. Let xi, for i = 1 to N, be the data arranged in an increasing order, so we see x1 as the smallest observation and xN is the largest for numeric attribute X. Each observation, xi, is coupled with a percentage, fi. This shows that on average fi × 100% of the data are lower than the value, xi. The word “on average” is used here since a value with exactly a fraction, fi, of the data below xi may not be found here. As explained earlier, the 0.25 percentile corresponds to quartile Q1, the 0.50 percentile is the median and the 0.75 percentile is Q3.

Figure 2.20: Scatter plot for Table 2.10: time to import–time to export for Italy.
Figure 2.21: Scatter plot can show (a) positive or (b) negative correlations between attributes.
Figure 2.22: Three cases where there is no observed correlation between the two plotted attributes in each of the datasets.

Let

fi=i0.5N(2.9)

The numbers go up in equal steps of 1/N, varying from 1/2N (as slightly above 0) to 1 − 1/2N (as slightly below 1). xi is plotted against fi on a quantile plot. This enables us to compare different distributions based on their quantiles. For instance, the quantile plots of sales data for two different time periods, their Q1, median, Q3 and other fi values can be compared [44].

Example 2.11 Quantile plot. Figure 2.23 indicates a quantile for the time to import (days) and Time to exports (days) data for Table 2.8.

Quantilequantile plot, also known as qq plot, plots the quantiles of one univariate distribution against the corresponding quantiles of another. It is a powerful visualization tool because it enables the user to view if there is a shift going from one distribution to the other.

Let us assume that we have two sets of observations for the attribute or variable interest (income and deduction) that are obtained from two dissimilar branch locations. x1, . . . , xN is the data from the first branch, and y1, . . . , yM is the data from the second one. Each dataset is arranged in an increasing order. If M= N, then this is the case when the number of points in each set is the same. In this case, yi is plotted against xi where yi and xi are both (i − 0.5)=Nquantiles of their datasets, respectively. If MtN (to exemplify, the second branch has fewer observations compared to the first), there can be only M points on the qq plot. At this point, yi is the (i − 0.5) = M quantile of y [45].

Example 2.12 Table 2.15 is based on economy dataset. The histogram , scatter plot and qq plot are derived from the time to import (days) and time to exports (days) that belong to Italy dataset (economy data).

Histograms are extensively used. Yet, they might not prove to be as effective as the quantile plot, qq plot and boxplot methods while one is making comparisons in the groups of univariate observations.

In Table 2.24, Italy dataset (economy data) sets an example of time to import (days) and time to exports (days), normalized with max–min normalization.

Based on the data derived from Italy economy, q–q plot was drawn for the time to import (days) and time to export (days) values as shown in Figure 2.23 for Table 2.24.

Table 2.24: Italy dataset (economy data) presents an example of time to import (days) and time to exports (days).

Time to import(days)Time to exports (days)
00
012,602,581,252
013,677,328,059
10.8215,404,876,386
10.5917,741,843,854
12.720,268,351,005
14.3123,114,028,519
15.2826,378,552,578
12.911630,442,035,460
11.747535,450,634,467
10.997542,550,883,916
8.88551,776,353,504
. . .. . .
. . .. . .
. . .. . .
Figure 2.23: A q–q plot for time to import (days)–time to export (days) data from Table 2.24.

Figure 2.23 shows q–q plot graph of the attributes pertaining to time to import (days)–time to export (days) attributes as listed in Table 2.8. Based on Figure 2.23, while x = 0 for time to import (days) attribute, y = 0 between the period of 1960–2015. When x = 0, in y = 0.5, an increase has been observed in time to export (days). While time to import (days) attribute is seen within this interval, time to export (days) value has remained constant at 0.

2.6Data matrix versus dissimilarity matrix

When the objects in question are one-dimensional, it means they can be described by a single attribute. The aforementioned section has covered objects that were described by single attribute. In this section, we cover multiple attributes. While dealing with multiple attributes, a change in notation is required. If you have n objects (e.g., MR images) described by p attributes, they are also referred to as measurements ( or features), ( such as age, income )or marital status. The objects are x1 = x11, x12, . . . , x1p , x2 = x21, x22, . . . , x2p , and this goes on like this. Here, xij is the value for object xi of the ith attribute. For simple notation, xi as object can be shown an si. The objects may be datasets in a relational dataset, which is also referred to as feature vectors or data samples [46, 47].

Data matrix is a kind of object-by-attribute structure that stores n data objects either in a relational table format or in n × p matrix (n objects × p attributes). An example for data matrix is as follows:

[x11x1fx1pxi1xifxipxn1xnfxnp](2.10)

In this matrix, each row relates to an object. f is used as the index through p attributes.

Dissimilarity matrix is a kind of object-by-object structure that stores a group of proximities available for all pairs of n objects, often represented by an n × n table as shown:

[0rrrrd(2,1)0rrrd(3,1)d(3,2)0rr0rd(n,1)d(n,2)0](2.11)

where d(i, j) refers to dissimilarity between objects i and j. d(i, j) is generally a nonnegative number close to 0. Objects i and j are similar to each other. d(i, i)=0 shows the difference between an object and its elf which is 0. In addition, d(i, j) = d(j, i). The matrix is symmetric and for convenience, the d (j, i) entries are not shown [46].

Measures of similarity are usually described as a function of measures of dissimilarity.

For nominal data, for instance,

sim(i,j)=1d(i,j),(2.12)

Here, sim(i, j) denotes the similarity between the objects i and j. A data matrix consists of two entities: rows for objects and columns for attributes. This is why we often call a data matrix a two-mode matrix. The dissimilarity matrix includes one kind of entity or dissimilarities; when there is one kind of entity, it is called a one-mode matrix. Most clustering and nearest neighbor algorithms also operate on dissimilarity matrix. Before such algorithms are applied, data in the form of a data matrix can be converted into a dissimilarity matrix [47].

Example 2.13 These matrices have eigenvalues 0, 0, 0, 0. They have two eigenvectors, one from each block. However, their block sizes do not match and they are not similar.

U=[01.30000000001.30000],I=[01.300001.3000000000]

For matrix X( , if )UX = XI, then X is not invertible and so U is not similar to I.

Let X = xij . Then

UX=[x21x22x23x240000x41x42x43x440000]andXI=[0x11x1200x22x2200x31x3200x41x420]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset