Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

2Dataset

This chapter covers the following items:

–Defining dataset, big data and data science

–Recognizing different types of data and attributes

–Working through examples

In this chapter, we will discuss the main properties of datasets as a collection of data. In a more general abstract sense with the word “data,” we intend to represent a large homogeneous collection of quantitative or qualitative objects that could be recorded and stored into a digital file, thus converting into a large collection of bits. Some examples of data could be time series, numerical sequences, images, DNA sequences, large sets of primes, big datasets and so on. Time series and signals are the most popular examples of data and they could be found in any kind of human and natural activities such as neural activity, stock exchange rates, temperature changes in a year, heart blood pressure during the day and so on. Such kind of data could be processed aiming at discovering some correlations, patterns, trends, and whatever could be useful, for example, to predict the future values of data. When the processing of data is impossible or very difficult to handle, then these data are more appropriately called big data. In the following section, we will discuss how to classify data, by giving the main attributes and providing the basic element of descriptive statistics for analyzing data.

The main task with data is to discover some information from them. It is a strong temptation for anyone to jump into [data] mining directly. However, before doing such a leap, at the initial stage, we have to get the data ready for processing.

We can roughly distinguish two fundamental steps in data manipulation as in Figure 2.1:

Data preprocessing, which consists of one or several actions such as cleaning, integration, eduction and transformation, in such a way to prepare the data for the next step.
Data processing, also known as data analysis or data mining, where the data are analyzed by some statistical or numerical algorithms, in order to single out some features, patterns, singular values and so on.

At each step, there are also some more models and algorithms. In this book we limit ourselves to some general definition on preprocessing and on the methods from statistical model (linear–multilinear model), classification methods (decision trees (ID3, C4.5 and Classification and Regression Trees [CART]), Naïve Bayes, support vector machines (linear, nonlinear), k-nearest neighbor, artificial neural network (feed forward back propagation, learning vector quantization) theory to handle and analyze data (at the data analysis step) as well as fractal and multifractal methods with ANN (LVQ).

This process involves taking a deeper insight at the attributes and the data values. Data in the real world tend to be noisy or enormous in volume (often denoted in several gigabytes or more). There are also cases when data may stem from a mass of heterogeneous sources.

This chapter deals with the main algorithms used to organize data. Knowledge regarding your data is beneficial for data preprocessing (see Chapter 3). This preliminary task is a fundamental step for the following data mining process. You may have to know the following for the process:

–What are the sorts of attributes that constitute the data?

–What are the type of values each attribute owns?

–Which of the attributes are continuous valued or discrete?

–How do the data appear?

–How is the distribution of the values?

–Do some ways exist so that we can visualize data in order to get a better sense of it overall?

–Is it possible to detect any outliers?

–Is it possible that we can measure the similarity of some data objects with regard to others?

Gaining such insight into the data will help us with further analysis. On the basis of these, we can learn about our data that will prove to be beneficial for data preprocessing. We begin in Section 2.1 by this premise: “If our data is of enormous amount, we use big data.” By doing so, we also have to study the various types of attributes, including nominal, binary, ordinal and numeric attributes. Basic statistical descriptions can be used to learn more about the values of attributes that are described in Section 2.5. As for temperature attribute, for instance, we can determine its mean (average value), median (middle value) and mode (most common value), which are the measures of central tendency. This gives us an idea of the “middle” or center of distribution.

If we know these basic statistics regarding each attribute it facilitates filling in missing values, spot outliers and smooth noisy values in data preprocessing. Knowledge of attributes and attribute values can also be of assistance in settling inconsistencies in data integration. Through plotting measures of central tendency, we can see whether the data are skewed or symmetric. Quantile plots, histograms and scatter plots are other graphic displays of basic statistical descriptions. All of these displays prove to be useful in data preprocessing.

Data visualization offers many additional affordances like techniques for viewing data through graphical means. These can help us to identify relations, trends and biases “hidden” in unstructured datasets. Data visualization techniques are described in Section 2.5.

We may also need to examine how similar or dissimilar data objects are. Suppose that we have a database in which the data objects are patients classified in different subgroups. Such information can allow us to find clusters of like patients within the dataset. In addition, economy data involve the classification of countries in line with different attributes.

In summary, by the end of this chapter, you will have a better understanding as to the different attribute types and basic statistical measures, which help you with the central tendency and dispersion of attribute data. In addition, the techniques to visualize attribute distributions and related computations will be clearer through explanations and examples.

2.1Big data and data science

Definition 2.1.1. Big Data is an extensive term for any collection of large datasets, usually billions of values. They are so large or complex that it becomes a challenge to process them by employing conventional data management techniques. By these conventional techniques, we mean relational database management systems (RDBMS). In RDBMS, which is substantially a collection of tables, a unique name is assigned to each item. Consisting of a set of attributes (like columns or fields), each table normally stores a large set of tuples (records or rows). Each tuple in a relational table signifies an object that is characterized by a unique key. In addition, it is defined by a set of attribute values. While one is mining relational databases, he/she can proceed by pursuing trends or data patterns [1]. For example, RDBMS, an extensively adopted technique for big data, has been considered as a one-size-fits-all solution.

The data are stored on relation (table); the relation in a relational database is similar to a table where the column represents attributes, and data are stored in rows (Table 2.1). ID is a unique key, which is underlined in Table 2.1.

Table 2.1: RDBMS table example regarding New York Stock Exchange (NYSE) fundamentals data.

Data science involves employing methods to analyze massive amounts of data and extract the knowledge they encompassed. We can imagine that the relationship between data science and big data is similar to that of raw material and manufacturing plant. Data science and big data have originated from the field of statistics and traditional data management. However, for the time being, they are seen as distinctive disciplines.

The characteristics of big data are often referred to as the three Vs as follows [2]:

–Volume – This refers to how much data are there.

–Variety – This refers to how diverse different types of data are.

–Velocity – How fast new data are generated.

These characteristics are often supplemented with a fourth V that refers to

–Veracity, meaning how accurate is the data.

These four properties make big data different from the data existing in conventional data management tools. As a result, the challenges brought by big data are seen in almost every aspect from data capture to curation, storage to search as well as other aspects including but not limited to sharing, transferring and visualization. Moreover, big data requires the employment of specialized techniques for the extraction of the insights [3, 4].

Following a brief definition of big data and what it requires, let us now have a look at the benefits and uses of big data and data science.

Data science and big data are terms used extensively both in commercial and noncommercial settings. There are a number of cases with big data. The examples we offer in this book will elucidate such possibilities. Commercial companies operating in different areas use data science and big data so as to know more about their staff, customers, processes, goods and services. Many companies use data science to be able to offer a better user experience to their customers. They also use big data for customization.

Let us have a look at the following example to relate the information about aforementioned big data, for example, NYSE (https://www.nyse.com/data/transactions-statistics-data-library) [5]. Big dataset is an area for fundamental and technical analysis. In this link, dataset includes fundamentals.csv file. Fundamentals file show which company has the biggest chance of going out of business, which company is undervalued, what the return on Investment is metrics extracted from annual sec 10K fillings (from 2012 to 2016), should suffice to derive most popular fundamental indicators as can be seen from Table 2.2.

Let us have a look at the 4V stage for the NYSE data: investments of companies are continuously tracked by NYSE. Thus, this enables related parties to bring the best deals with fundamental indicators for going out of business. Fundamentals involve getting to grips with the following properties:

–Volume: Thousand of companies are involved.

–Velocity: Companies, company information and prices undervalued.

–Variety: Investments: futures, forwards, swaps and so on.

–Veracity: Going out of business (refers to the biases, noise and abnormality in data).

In the NYSE, there is big data with a dimension of 1,781 × 79 (1,781 companies, 79 attributes). There is an evaluation of 1,781 firms as to their status of bankruptcy (going out of business) based on 79 attributes (ticker symbol, period ending, accounts payable, add income/expense items, after tax ROE (Return On Equity), capital expenditures, capital surplus, cash ratio and so on). Now, let us analyze the example in Excel table (Table 2.2) regarding the case of NYSE. The data have been selected from NYSE (https://www.nyse.com/data/transactions-statistics-data-library).

If we are to work with big data as in Table 2.2, the first thing we have to do is to understand the dataset. In order to understand the dataset, it is important that we show the relationship between the attributes (ticker symbol, period ending, accounts payable, accounts receivable, etc.) and records (0th row, 1st row, 2nd row, …, 1,781st row) in the dataset.

Table 2.2: NYSE fundamentals data as depicted in RDBMS table (sample data breakdown).

Histograms are one of the ways to depict such a relationship. In Figure 2.2, the histogram provides the analyses of attributes that determine bankruptcy statuses of the firms in the NYSE fundamentals big data. The analysis of the selected attributes has been provided separately in some histogram graphs (Figure 2.3).

Figure 2.2: NYSE fundamentals data (1,781 × 79) histogram (mesh) graph 10K.

In Figure 2.2, the dataset that includes the firm information (for the firms on the verge of bankruptcy) has been obtained through histogram (mesh) graph with a dimension of 1,781 × 79.

The histogram graphs pertaining four attributes (accounts payable, accounts receivable, capital surplus and after tax ROE) as chosen from 79 attributes that determine the bankruptcy of a total of 1,781 firms are provided in Figure 2.3.

2.2Data objects, attributes and types of attributes

Definition 2.2.1. Data are defined as a collection of facts. These facts can be words, figures, observations, measurements or even merely descriptions of things.

In some instances, it would be most appropriate to express data values in purely numerical or quantitative terms, for example, in currencies like in pounds or dollars, or in measurement units like in inches or percentages (measurements values of which are essentially numerical).

Figure 2.3: NYSE fundamentals attributes histogram graph: (a) accounts payable companies histogram; (b) accounts receivable companies histogram; (c) capital surplus companies histogram and (d) after tax ROE histogram.

In other cases, the observation may signify only the category that an item belongs to. Categorical data are referred to as qualitative data (whose data measurement scale is essentially categorical) [6]. For instance, a study might be addressing the class standing-freshman, sophomore, junior, senior or graduate levels of college students. The study may also handle students to be assessed depending on the quality of their education as very poor, poor, fair, good or very good.

Note that even if students are asked to record a number (1–5) to indicate the quality level at which the numbers correspond to a category, the data will still be regarded as qualitative because the numbers are merely codes for the relevant (qualitative) categories.

A dataset is a numerable, but not necessarily ordered, collection of data objects. A data object represents an entity; for instance, USA, New Zealand, Italy, Sweden (U.N.I.S.) economy dataset, GDP growth, tax payments and deposit interest rate. In medicine, all kind of records about human activity could be represented by datasets of suitable data objects; for instance, the heart beat rate measured by an electrocardiograph has the wave shape with isolated peaks and each R–R wave in an R–R interval could be considered as a data objects. Another example of dataset in medicine is the multiple sclerosis (MS) dataset, where the objects of datasets may include subgroups of MS patients and the subgroup of healthy individuals, magnetic resonance images (MRI), expanded disability status scale (EDSS) scores and other relevant medical findings of the patients. In another example, for instance, clinical psychology dataset may consist of test results of the subjects and demographic features of the subjects such as age, marital status and education level. Data objects, which may also be referred to as samples, examples, objects or data points, are characteristically defined by attributes.

Definition 2.2.2. Attributes are data fields that denote feature of a data objects. In literature, there are instances where other terms are used for attributes depending on the relevant field of study and research. It is possible to see some terms such as feature, variable or dimension. While statisticians prefer to use the term variable, professionals in data mining and processing use the term attribute.

Definition 2.2.3. Data mining is a branch of computer science and of data analysis, aiming at discovering patterns in datasets. The investigation methods of data mining usually are based on statistics, artificial intelligence and machine learning.

Definition 2.2.4. Data processing is also called information processing is a way to discover meaningful information from data.

Definition 2.2.5. A collection of data depending on a given parameter is called a distribution. If the distribution of data involves only one attribute, it is called univariate. A bivariate distribution, on the other hand, involves two attributes. A set of possible values determine the type of an attribute. For instance, nominal, numeric, binary, ordinal, discrete and continuous are the aspects an attribute can refer to [7, 8].

Let us see some examples of univariate and bivariate distributions. Figure 2.4 shows the graph of univariate distribution for net domestic credit attribute. This is one of the attributes in Italy Economy Dataset Explanation dataset (Figure 2.4(a)). The same dataset is given in Figure 2.4(b), with bivariate distribution graph together with deposit interest rate attributes based on years as chosen from the dataset for two attributes (for the period between 1960 and 2015).

Figure 2.4: (a) Univariate and (b) bivariate distribution graph for Italy Economy Dataset.

The following sections provide further explanations and basic examples for these attributes.

2.2.1Nominal attributes and numeric attributes

Nominal refers to names. A nominal attribute, therefore, can be names of things or symbols. Each value in a nominal attribute represents a code, state or category. For instance, healthy and patient categories in medical research are example of nominal attributes. The education level categories such as graduate of primary school, high school or university are other examples of nominal attributes. In studies regarding MS, subgroups such as relapsing remitting multiple sclerosis (RRMS), secondary progressive multiple sclerosis (SPMS) or primary progressive multiple sclerosis (PPMS) can be listed among nominal attributes. It is important to note that nominal attributes are to be handled distinctively from numeric attributes. They are incommensurable but interrelated. For mathematical operations, a nominal attribute may not be significant. In medical dataset, for example, the schools from which the individuals, on whom Wechsler Adult Intelligence Scale – Revised (WAIS-R) test was administered, graduated (primary school, secondary school, vocational school, bachelor’s degree, master’s degree, PhD) can be given as examples of nominal attribute. In economy dataset, for example, consumer prices, tax revenue, inflation, interest payments on external debt, deposit interest rate and net domestic credit can be provided to serve as examples of nominal attributes.

Definition 2.2.1.1. Numeric attribute is a measurable quantity; thus, it is quantitative, represented in real or integer values [7, 9, 10]. Numeric attributes can either be interval- or ratio-scaled.

Interval-scaled attribute is measured on a scale that has equal-size units, such as

Celsius temperature (C): (−40°C, −30°C, −20°C, 0°C, 10°C, 20 °C, 30°C, 40°C, 50°C, 60°C, 70°C, 80°C).

We can see that the distance from 30°C to 40°C is the same as distance from 70°C to 80°C.

The same set of values converted into Fahrenheit temperature scale is given in the following list:

Fahrenheit temperature (F): (−40°F, −22°F, −4°F, 32°F, 50°F, 68°F, 86°F, 104°F, 122°F, 140°F, 158°F, 176°F).

Thus, the distance from 30°C to 40°C [86°F – 104°F] is the same distance from 60°C to 70°C [140°F – 158°F].

Ratio-scaled attribute has inherent zero point. Consider the example of age. The clock starts ticking forward once you are born, but an age of “0” technically means you do not actually exist. Therefore, if a given measurement is ratio scaled, the value can be considered as one that is a multiple or ratio of another value. Moreover, the values are ordered so the difference between values, mean, median and mode can be calculated.

Definition 2.2.1.2. Binary attribute is a nominal attribute that has only two states or categories. Here, 0 typically shows that the attribute is absent and 1 means that the attribute is present. States such as true and false fall into this category. In medical cases, healthy and unhealthy states pertain to binary attribute type [11–13].

Definition 2.2.1.3. Ordinal attributes are those that yield possible values that enable meaningful order or ranking. A basic example is the one used commonly in clothes size: x-small, small, medium, large and x-large. In psychology, mild and severe refer to ranking a disorder [11–14].

Discrete attributes have either finite set of values or infinite but countable set of values that may or may not be represented as integers [7, 9, 10]. For example, age, marital status and education level each have a finite number of values; therefore, they are discrete. Discrete attributes may have numeric values, for example, age may be between 0 and 120. If an attribute is not discrete, it is continuous. Numeric attribute and continuous attribute are usually used interchangeably in literature. However, continuous values are real numbers, while numeric values can be either real numbers of integers. The income level of a person or the time it takes a computer to complete a task can be continuous [15–20].

Definition 2.2.1.4. Data Set (or dataset) is a collection of data. For i =0, 1,…, K in x: i0, i1,…, iK; j =0, 1,…, K in y: j0, j1,…, jL D(K × L) represents a dataset. x: i. is the entry on the line and y: j. is the matrix that represents the attribute in the column.

Figure 2.5 shows the dataset D(3 × 4) as depicted with row (x) and column (y) numbers.

Figure 2.5: Shown in the figure with x row and y column as an example regarding dataset D.

2.3Basic definitions of real and synthetic data

In this book, there are two main type of datasets identified. These are real dataset and synthetic dataset.

2.3.1Real dataset

Real dataset refers to specific topic collected for a particular purpose. In addition, the real dataset includes the supporting documentation, definitions, licensing statements and so on. A dataset is a collection of discrete items made up of related data. It is possible to access to this data either individually or in combination. It is also possible that it is managed as an entire sample. Most frequently real dataset corresponds to the contents of a single statistical data matrix in which each column of the table represents a particular attribute and every row corresponds to a given member of the dataset. A dataset is arranged into some sort of data structure. In a database, for instance, a dataset might encompass a collection of business data (sales figures, salaries, names, contact information, etc.). The database can be considered a dataset per se, as can bodies of data within it pertaining to a particular kind of information, such as sales data for a particular corporate department.

In the applications and examples to be addressed in the following sections, dataset from economy field has been applied by having formed real data. Economy dataset, as real data, is selected from World Bank site (you can see http://data.worldbank.org/country/) [21] and used accordingly.

2.3.2Synthetic dataset

The idea of original fully synthetic data was generated by Rubin in 1993 [16]. Rubin originally designed this with the aim of synthesizing the Decennial Census long-form responses for the short-form households. A year later, in 1994, Fienberg generated the idea of critical refinement. In this refinement he utilized a parametric posterior predictive distribution (rather than a Bayes bootstrap) in order to perform the sampling. They together discovered a solution for how to treat synthetic data partially with missing data [14–17].

Creating synthetic data is a process of data anonymization. In other words, synthetic data are a subset of anonymized, or real, data. It is possible to use synthetic data in different areas to act as a filter for information that would otherwise compromise the confidentiality of particular aspects of the data. The particular aspects generally emerge in the form of patient individual information (i.e., individual ID, age, gender, etc.).

Using synthetic data has been recommended for data analysis applications for the following conveniences it offers:

–it is quick and affordable to produce as much data as needed;

–it can yield accurate labels (labeling that may be impossible or expensive to obtain by hand);

–it is open to modification for the improvement of the model and training;

–it can be used as a substitute for certain real data segments.

In the applications to be addressed in the following sections, datasets from medicine have been applied forming synthetic data by CART algorithm (see Section 6.5) Matlab. There are two ways to deal with generating synthetic dataset from real dataset: one is to develop an algorithm to generate synthetic dataset from real dataset or another may use available software (SynTReN, R-package, ToXene, etc.) [18–20].

The synthetic datasets in our book have been obtained by the CART algorithm (for more details see Section 6.5) application on the real dataset.

For real dataset: i=0, 1, …, R in x: i0, i1,…, iR; j =0, 1, …, L in y: j0, j1,…, jL D(R × L) represents a dataset. x: i. is the entry on the line and y: j. is the matrix representing the attribute in the column.

For synthetic dataset, i=0, 1,…, K in x: i0, i1,…, iK; j=0, 1,…, K in y: j0, j1,…, jL D(K × L) represent a dataset. x: i. is the entry on the line and y: j. is the matrix representing the attribute in the column.

A brief description of the contents in synthetic dataset is shown in Figure 2.7.

The steps pertaining to the application of CART algorithm (Figure 2.6) to the real dataset D(K × L) for obtaining the synthetic dataset D(R × L) are provided as follows.

Figure 2.6: Obtaining the synthetic dataset.

The steps of getting the synthetic dataset through the application of general CART algorithm are provided in Figure 2.7.

Steps (1–6) The data belonging to each attribute in the real dataset are ranked from lowest to highest (when the attribute values are same, only one value is included in the ranking). For each attribute in the real dataset, the average of median and the value following the median is calculated. xmedian represents the median value of the attribute and xmedian + 1 represents the value after the median value (see eq. (2.4(a))).

The average of xmedian and xmedian + 1 is calculated as $\frac{x_{median} + x_{median+1}}{2}$ $\frac{x_{median} + x_{median+1}}{2}$ see eq. (2.1)).

Figure 2.7: Obtaining the synthetic dataset from the application of general CART algorithm on the real dataset.

Steps (7–10) The Gini value of each attribute is calculated for the formation of decision tree from the real dataset (see Section 6.5). The lowest Gini value calculated is the root of the decision tree. Synthetic data are formed based on the decision tree obtained from the real dataset and the rules derived from the decision tree.

Let us now explain how the synthetic dataset is generated from real dataset through a smaller scale of sample dataset.

Example 2.1 Dataset D(10 × 2) consists of two classes (gender as female or male) and two attributes (weight in kilograms and height in centimeters; see Table 2.3) as CART algorithm has been applied and synthetic data need to be generated.

Now we can apply CART algorithm on the sample dataset D(10 × 2) (see Table 2.3)

Steps (1–6) The data that belong to each attribute in the dataset D(10 × 2) are ranked from lowest to highest (when the attribute values are the same, only one value is incorporated in the ranking; see Table 2.3). For each attribute in the dataset the average of median and the value following the median is computed. xmedian represents the median value of the attribute and xmedian + 1 represents the value after the median value (see eq. (2.4(a))).

Table 2.3: Dataset D(10 × 2).

Class	The attributes of the dataset D(10 × 2)
	Weight	Height
Female	70	170
Female	60	165
Female	55	160
Female	47	167
Female	65	170
Male	90	190
Male	82	180
Male	83	182
Male	85	186
Male	100	185

The average of xmedian and xmedian + 1 is calculated as $\frac{x_{median} + x_{median+1}}{2}$ $\frac{x_{median} + x_{median+1}}{2}$ (see eq. (2.1)).

The threshold value calculation for each attribute (column; j: 1, 2) in the dataset D(10 × 2) in Table 2.3 is applied based on the following steps.

In the following, we discuss the steps of the threshold value calculation for the j: 1: {Weight} attribute.

In Table 2.3. {Weight} values (line: i : 1,. . ., 10) in j: 1 column is ranked from lowest to highest (when the attribute values are the same, only one value of is incorporated in the ranking).

\begin{array}{l} {Weight}_{sort} = {47, 55, 60, 65, 70, 82, 83, 85, 90, 100} \\ {Weight}_{median} = 70 \\ {Weight}_{median+1} = 82 \\ \bar{Weight} = \frac{70 + 82}{2} = 76 \end{array}

$\begin{array}{l} {Weight}_{sort} = {47, 55, 60, 65, 70, 82, 83, 85, 90, 100} \\ {Weight}_{median} = 70 \\ {Weight}_{median+1} = 82 \\ \bar{Weight} = \frac{70 + 82}{2} = 76 \end{array}$

For “Weight ≥ 76”, “Weight <76” two groups are formed.

The following are the steps for the calculation of threshold value for the j: 2: {Height} attribute.

In Table 2.3, {Height} values (line: i: 1,. . ., 10) in j: 2 column are ranked from lowest to highest. (When the attribute values are same, only one value is incorporated in the ranking.)

\begin{array}{l} {Height}_{sort} = {160, 165, 167, 170, 180, 182, 185, 186, 190} \\ {Height}_{median} = 180 \\ {Height}_{median} = 180 \\ \bar{Height} = \frac{180 + 182}{2} = 181 \end{array}

$\begin{array}{l} {Height}_{sort} = {160, 165, 167, 170, 180, 182, 185, 186, 190} \\ {Height}_{median} = 180 \\ {Height}_{median} = 180 \\ \bar{Height} = \frac{180 + 182}{2} = 181 \end{array}$

For “Height ≥ 181”, “Height <181,” two groups are formed.

Steps (7–10) The Gini value of each attribute is calculated for the formation of decision tree of the dataset D(10 × 2). For each attribute, Gini value calculation is done after being split into two groups, and the root of the tree is found (more details in Section 6.5.)

In Table 2.4, you can see the results derived from the splitting of the binary groups for the attributes (j = 1, 2) that belong to the sample real data (female, male class).

Table 2.4: Results derived from the splitting of the binary groups for the attributes that belong to dataset (female, male class).

The results provided in Table 2.4 for each attribute in sample dataset D(10 × 2) will be used for the Gini calculation.

The calculation of Gini (Weight) for j: 1 (see Section 6.5, eq. (6.8)):

\begin{array}{l} {Gini}_{left} (Weight) = 1 - [{(0 / 5)}^{2} + {(5 / 5)}^{2}] = 0 \\ {Gini}_{right} (Weight) = 1 - [{(5 / 5)}^{2} + {(0 / 5)}^{2}] = 0 \\ Gini (Weight) = (5 (0) + 5 (0)) / 10 = 0 \end{array}

$\begin{array}{l} {Gini}_{left} (Weight) = 1 - [{(0 / 5)}^{2} + {(5 / 5)}^{2}] = 0 \\ {Gini}_{right} (Weight) = 1 - [{(5 / 5)}^{2} + {(0 / 5)}^{2}] = 0 \\ Gini (Weight) = (5 (0) + 5 (0)) / 10 = 0 \end{array}$

The calculation of Gini (Height) value for j: 1 (see Section 6.5, eq. (6.8)):

\begin{array}{l} {Gini}_{left} (Height) = 1 - [{(0 / 4)}^{2} + {(4 / 4)}^{2}] = 0 \\ {Gini}_{right} (Height) = 1 - [{(5 / 6)}^{2} + {(1 / 6)}^{2}] = 0.284 \end{array}

$\begin{array}{l} {Gini}_{left} (Height) = 1 - [{(0 / 4)}^{2} + {(4 / 4)}^{2}] = 0 \\ {Gini}_{right} (Height) = 1 - [{(5 / 6)}^{2} + {(1 / 6)}^{2}] = 0.284 \end{array}$

The lowest Gini value is Gini(Weight) = 0. The lowest value in the Gini values is the root of the decision tree. The decision tree obtained from the dataset is shown in Figure 2.8.

Figure 2.8: The decision tree obtained from dataset D(10 × 2).

The decision rule as obtained from dataset in Figure 2.8 is given as Rules 1 and 2:

Rule 1: If Weight ≥76, class is male.

Rule 2: If Weight <76, class is female.

In Table 2.5, synthetic dataset D(3 × 10) in Rule 1 Weight ≥ 76 and in Rule 2 Weight <76 (see Figure 2.8) are applied and obtained accordingly.

Table 2.5: Forming the sample synthetic dataset for
Weight ≥ 76 and Weight <76.

Weight	Height
76	180
70	162
110	185

In Table 2.6, synthetic dataset D(3 × 2) is obtained by applying Rules 1 and 2 (see Figure 2.8).

Table 2.6: Forming the sample synthetic dataset for Weight ≥76, then class is male and Rule 2: Weight <76, then class is female.

Weight	Height	Class
76	180	Male
70	162	Female
110	185	Male

CART algorithm is applied to the real dataset. On applying this, we have received synthetic dataset. Synthetic dataset is composed of two entries from the male class and one entry (row) from the female class (it is possible to increase or reduce the entry number in the synthetic dataset).

2.4Real and synthetic data from the fields of economy and medicine

In this book, to apply and explain the models described in the following section, we will refer to only three different types of datasets. These are concerned with economy data (USA, New Zealand, Italy, Sweden), medical data (Multiple Sclerosis dataset), clinical psychology data (WAIS-R dataset). The sample datasets taken from these diverse fields have been discussed throughout the book. These can be found in the examples and applications.

2.4.1Economy data: Economy (U.N.I.S.) dataset

As the first set of real dataset, we will use, in the following sections, some data related to USA (abbreviated by the letter U), New Zealand (N), Italy (I) and Sweden (S) economies for U.N.I.S. countries’ economies. Data belonging to their economies from 1960 to 2015 are defined based on the attributes given in Table 2.7 (Economy dataset; http://data.worldbank.org) [21].

The matrix dimension of economy (U.N.I.S.) dataset is defined as follows (see Figure 2.9).

For i=0, 1,…, 228 in x: i0, i1,…, i228; j=0, 1,…, 18 in y: j0, j1,…, j18, × D(228 18) represents economy dataset.

In Table 2.7, the economic growth and parameters thereof are explained as to U. N.I.S. countries for the period between 1960 and 2015.

The sample data values that belong to these economies can be presented as in Table 2.8 (see the World Bank website [21]).

2.4.2MS data and content (neurology and radiology data): MS dataset

MS is a demyelinating disease of central nervous system (CNS). The disease is brought about by an autoimmune reaction that results from a complex interaction of genetic and environment factors [22–25]. MS may affect any site in the CNS; nevertheless, it is possible to recognize six types regarding the disease (Table 2.9).

Table 2.7: U.N.I.S. economies dataset explanation.

RRMS: This is the typical form of MS that usually has onset in the late teens or twenties characterized by a severe attack that is followed by complete or incomplete recovery. About 70% of patients with MS experience a relapsing–remitting course [22, 23]. Further attacks may occur at unpredictable intervals, and each of such attacks are followed by increasing disability. Relapsing–remitting pattern tends to turn into secondary progressive form of the disease in the late thirties.
PPMS: The disease displays a steady deteriorating course that may be interrupted by periods of quiescence without improvement. The rate of progression may vary: This form can end up with death within a few years in the most severe cases. In contrast, the more chronic form of progressive MS resembles the benign form of the disease [22–24].
SPMS: The relapsing–remitting form of the disease usually develops into SPMS following a changeable period of time [22] (usually this corresponds to the late thirties).
Relapsing progressive MS: Occasionally it has been observed that patients with progressive form of MS have superimposed relapses with no significant recovery [22].
Benign MS: The benign form of MS is seen in approximately 20% of the cases. It is particularly improbable that these patients will ever be debilitated by the disease and they can go on leading a full life span, with only occasional minor symptoms. The existence of a benign form of MS emphasizes the significance of recording the date of the first symptoms patients develop with few residual abnormal signs several years after the beginning of the disease. It is important that these patients be informed about the existence of a benign form of MS 10 years after the first record of symptoms, and also about the fact that the benign course will continue in the upcoming years of their lives [22].
Spinal form of MS: This form of MS manifests symptoms of predominantly involvement of spinal cord from the beginning and keeps up with this .pattern. There may be a clear-cut pattern of relapse and remission initially, followed by the secondary progressive form of the disease following several years, or the manifestation may be seen as one of the steady [22].

Table 2.8: An excel table to provide an example of economy (U.N.I.S.) data.

Table 2.9: Six types of MS [23].

1 –	Relapsing-Remitting MS (RRMS)
2 –	Primary Progressive MS (PPMS)
3 –	Secondary Progressive MS (SPMS)
4 –	Relapsing Progressive MS (RPMS)
5 –	Benign MS
6 –	Spinal form of MS

One of the most important tools in the diagnosis of MS is MRI. Using this tool, specialists examine the current formation of the plaques that form in the MS disease.

MRI: This is a very important tool that can reveal the inflammated or harmed tissue sites in the diagnosis of MS. This study makes use of the MRI of the patients who have been diagnosed with MS based on Mc Donald criteria [24]. The dataset has been formed by getting the measurement as to the lesions from three sites based on the MRI [25]. These sites are the brain stem, corpus callosum-periventricular region and upper cervical region (see Figure 2.9).

Figure 2.9: A sample MRI image that belongs to a healthy and an MS patient (obtained from Hacettepe University). (a) An MRI image of a healthy subject and (b) an MRI image with lesions indicated.

EDSS: EDSS is a scale that is capable of measuring eight regions, partially of the CNS known as the functional system. This scale measures the freedom of movement first by using the temporary numbness in the face and fingers or visual impairments and walking distance [26]. The ranking of each item from these systems is conducted based on the disability status, and to this functional system, mobility as well as daily life limitations are added. Twenty ranks within EDSS are provided with their corresponding definitions in Table 2.10 [26, 27].

Table 2.10: Description of EDSS [26, 27].

Score	Description
1.0	No disability, minimal signs in one FS
1.5	No disability, minimal signs in more than one FS
2.0	Minimal disability in one FS
2.5	Mild disability in one FS or minimal disability in two FS
3.0	Moderate disability in one FS, or mild disability in three or four FS. No impairment to walking
3.5	Moderate disability in one FS and more than minimal disability in several others. No impairment to walking
4.0	Significant disability but self-sufficient and up and about some 12 h a day. Able to walk without aid or rest for 500m
4.5	Significant disability but up and about much of the day, able to work a full day; may otherwise have some limitation of full activity or require minimal assistance. Able to walk without aid or rest for 300m
5.0	Disability severe enough to impair full daily activities and ability to work a full day without special provisions. Able to walk without aid or rest for 200m
5.5	Disability severe enough to preclude full daily activities. Able to walk without aid or rest for 100 m
6.0	Requires a walking aid-cane, crutch and so on to walk about 100m with or without resting
6.5	Requires two walking aids-pair of canes, crutches and so on to walk about 20m without resting
7.0	Unable to walk beyond approximately 5 m even with aid. Essentially restricted to wheelchair; though wheels self in standard wheelchair and transfers alone. Up and about in wheelchair some 12 h a day
7.5	Unable to take more than a few steps. Restricted to wheelchair and may need aid in transferring. Can wheel self but cannot carry on in standard wheelchair for a full day and may require a motorized wheelchair
8.0	Essentially restricted to bed or chair or pushed in wheelchair. May be out of bed itself much of the day. Retains many self-care functions. Generally has effective use of arms
8.5	Essentially restricted to bed much of day. Has some effective use of arms retains some self-care functions
9.0	Confined to bed. Can still communicate and eat
9.5	Confined to bed and totally dependent. Unable to communicate effectively or eat/swallow
10.0	Death due to MS

Now let us provide definitions for the MS dataset (synthetic MS data) with details, which are used in the applications of the algorithms in the following sections.

2.4.2.1Obtaining synthetic MS dataset

The MS dataset consist of EDSS and MRI data belonging the MS dataset (139 × 112) which constitute 120 MS patients (with RRMS, SPMS, PPMS subgroups) and 19 healthy individuals. Level of disability in MS patients was determined by using the Extended Disability Status Scale (EDSS). Besides EDSS, MRIs of the patients were also used. Magnetic resonance images are obtained with a 1.5 Tesla device (Magnetom, Siemens Medical Systems, Erlangen, Germany). The brainstem, corpus callosum-periventricular region, including the upper cervical lesions in the three regions were included in the information. The MRIs have been evaluated by radiologists. MRIs of the patients were obtained from Hacettepe University Radiology Department and Primer Magnetic Resonance Imaging Center. The MS data have been obtained from Rana Karabudak, Professor (MD), the Director of Neuroimmunology Unit in Neurology Department, Ankara, Turkey.

A brief description of the contents in synthetic MS dataset is shown in Figure 2.10.

Figure 2.10: Obtaining the synthetic MS dataset.

Synthetic MS dataset (304 × 112) is obtained by applying CART algorithm to the MS dataset (139 × 112) as shown in Figure 2.10.

For the synthetic MS dataset D(304 × 12) addressed in the following application sections regarding neurology and radiology data, there are three clinical definite MS subgroups (RRMS, PPMS and SPMS) and a healthy group to be handled [25]. From the synthetic data belonging to these patients, the following subgroups of 76 patients with RRMS, 76 with PPMS, 76 with SPMS and 76 healthy individuals have been formed homogenously. Dataset is composed of MRI for the three sites of the brain with lesion size and lesion count and EDSS belonging to the MS patients.

Table 2.11 lists the relevant data for MRI for the three sites of the brain with lesion size and lesion count and EDSS belonging to the MS patients.

The matrix dimension of the MS dataset is defined as follows (see Figure 2.10).

For i=0, 1,…, 304 in x: i0, i1,…, i304; j=0, 1,…, 112 in y: j0, j1,…, j112, D(304 × 112) represents synthetic MS dataset.

Table 2.11: Synthetic MS dataset explanation.

The sample data values of synthetic MS dataset belonging to the individuals are listed in Table 2.12.

The first region represents the brain stem lesion count (MRI 1) and lesion size, while the second region represents corpus callosum lesion count (MRI 2) and lesion size. The third region represents upper cervical region lesion count (MRI 3) and lesion size. The number of lesions belonging to the brain stem is (0 mm, 1 mm, …, 16 mm, …) and for corpus callosum the number of lesions is (0mm, 2.5 mm,…, 8mm,…, 27 mm,…). The number of lesions for the upper cervical region is (0 mm, 2.5mm,…, 8mm,…, 15 mm, …).

CART algorithm has been used to obtain the synthetic MS dataset D(139 × 121) from MS dataset D(304 × 121).

Let us now explain how synthetic MS dataset D(304 × 112) is generated from MS dataset through a smaller scale of sample dataset by using CART algorithm shown in Figure 2.7 (for more details see Section 6.5).

Example 2.2 Sample MS dataset D(20 × 5) (see Table 2.13) is chosen from MS dataset D(304 × 112). CART algorithm is applied and sample synthetic dataset is derived. The steps for this process are provided in detail.

Let us apply the CART algorithm to the sample MS dataset below D(20 × 5) (see Table 2.14).

The steps of obtaining the sample synthetic MS dataset through the application of CART algorithm on the sample MS dataset D(20 × 5) are shown in Figure 2.7 (for more details see Section 6.5). The steps for the algorithm are given in detail.

Steps (1–6) The data belonging to each attribute in the sample MS dataset D(20× 5) (see Table 2.14) are ranked from lowest to highest (when the attribute values are same, only the one value is included in the ranking). For each attribute in the sample MS dataset, the average of median and the value following the median are calculated. xmedian represents the median value of the attribute and xmedian + 1 represents the value after the median value (see eq. (2.4(a))).

The average of xmedian and xmedian + 1 is calculated as (see eq. (2.1)). The threshold value calculation for each attribute (column; j: 1,2…,6) in the sample MS dataset D(20 × 5) listed in Table 2.7 is applied according to the following steps.

Table 2.12: An excel table to provide an example of synthetic MS data.

Table 2.13: Sample MS dataset D(20 × 5) chosen from the MS dataset.

Table 2.14: Results derived from the splitting of the binary groups for the attributes that belong to sample MS dataset (RRMS and SPMS subgroups).

The steps of the threshold value calculation for the j: 1: {EDSS} attribute are as follows:

In Table 2.13, {EDSS} values (line: i: 1,…, 20) in j: 1 column are ranked from lowest to highest (when the attribute values are same, only one value is included in the ranking).

\begin{array}{l} {EDSS}_{sort} = {3.3.5, 4, 5, 5.5, 6, 6.5, 7, 7.5, 8} \\ {EDSS}_{median} = 5.5 (see eq . (2.4 (a))) \\ {EDSS}_{median+1} = 6 \\ \bar{EDSS} = \frac{5.5 + 6}{2} = 5.75 (see eq . (2.1)) \end{array}

$\begin{array}{l} {EDSS}_{sort} = {3.3.5, 4, 5, 5.5, 6, 6.5, 7, 7.5, 8} \\ {EDSS}_{median} = 5.5 (see eq . (2.4 (a))) \\ {EDSS}_{median+1} = 6 \\ \bar{EDSS} = \frac{5.5 + 6}{2} = 5.75 (see eq . (2.1)) \end{array}$

Two groups are formed for “EDSS ≥ 5.75” and “EDSS < 5.75.”

The steps for calculating the threshold value of j: 2: {MRI 1} attribute are as follows:

In Table 2.13, {MRI 1} values (line: i: 1,…, 20) in j: 2 column are ranked from lowest to highest. (When the attribute values are same, only one value is included in the ranking.)

\begin{array}{l} {MRI1}_{(sort)} = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10} \\ {MRI1}_{median} = 5 \\ {MRI1}_{median+1} = 6 \\ \bar{MRI1} = \frac{5 + 6}{2} = 5.5 \end{array}

$\begin{array}{l} {MRI1}_{(sort)} = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10} \\ {MRI1}_{median} = 5 \\ {MRI1}_{median+1} = 6 \\ \bar{MRI1} = \frac{5 + 6}{2} = 5.5 \end{array}$

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 2 Dataset

Create new playlist

Sign In

Sign Up