Chapter 6. Classifying Disease Diagnosis

So far, we have been working with supervised learning for predicting numerical values; however, in the real world, numbers are just part of the data addressed. Real variables also contain categorical values, which are not purely numerical, but describe important features that have influence on the problems neural networks are applied to solve. In this chapter, the reader will be presented with a very didactic but interesting application involving categorical values and classification: disease diagnosis. This chapter digs deeper into classification problems and how to represent categorical data, as well as showing how to design a classification algorithm using neural networks. The topics covered in this chapter are as follows:

  • Foundations of classification problems
  • Categorical data
  • Logistic regression
  • Confusion matrix
  • Sensibility and specificity
  • Neural networks for classification
  • Disease diagnosis using neural networks
  • Diagnosis for cancer
  • Diagnosis for diabetes

Foundations of classification problems

One thing neural networks are really good at is classifying records. The very simple perceptron network draws a decision boundary, defining whether a data point belongs to one region or another, whereas a region denotes a class. Let's take a look visually on an x-y scatter chart:

Foundations of classification problems

The dashed lines explicitly separate the points into classes. These points represent data records which originally had the corresponding class labels. That means their classes were already known, therefore this classification tasks falls in the supervised learning category.

A classification algorithm seeks to find the boundaries between the classes in the data hyperspace. Once the classification boundaries are defined, a new data point, with an unknown class, receives a class label according to the boundaries defined by the classification algorithm. The figure below shows how a new record is classified:

Foundations of classification problems

Based on the current class configuration, the new record's class is the third class.

Categorical data

Applications usually lead with the types of data shown in the following figure:

Categorical data

Data can be numerical or categorical or, simply speaking, numbers or words. Numerical data is represented by a numeric value, from which it can be continuous or discrete. This data type has been used so far in this book's applications. Categorical data is a wider class of data that includes words, letters, or even numbers, but with a quite different meaning. While numerical data can support arithmetic operations, categorical data is only descriptive and cannot be processed like numbers, even if the value is a number. An example is the severity degree of a disease in a scale (from zero to five, for example). Another property of categorical data is that a certain variable has a finite number of values; in other words, only a defined set of values can be assigned to a categorical variable. A subclass of data inside the categorical is ordinal data. This class is particular because the defined values can be sorted in a predefined order. An example is adjectives indicating the state or quality of something (bad, fair, good, excellent):

Numerical

Categorical

Only numbers

Numbers, words, letters, signs

Can support arithmetic operations

Do not support arithmetic operations

Infinite or undefined range of values

Finite or defined set of values

Continuous

Discrete

Ordinal

Non-ordinal

Real values

Integers, decimal

Can be ordered

Cannot be ordered

Any possible value

Predefined intervals

Can be assigned numbers

Each possible value is a flag

Tip

Note that here we are addressing structured data only. In the real world, most data is unstructured, including text and multimedia content. Although these types of data are also processed in learning from data applications, neural networks require them to be transformed into structured data types.

Working with categorical data

Structured data files, such as those used in CSV or Excel, usually contain columns of numerical and categorical data. In Chapter 5, Forecasting Weather we have created the classes LoadCsv (for loading csv files) and DataSet (for storing data from csv), but these classes are prepared only for working with numerical data. The simplest way of representing categorical value is converting each possible value into a binary column, whereby if the given value is presented in the original column, the corresponding binary column will have a one as the converted value, otherwise it will be zero:

Working with categorical data

Ordinal columns can assume the defined values as numerical in the same column; however, if the original values are letters or words, they need to be converted into numbers via a Java Dictionary.

The strategy described above may be implemented by you as an exercise. Otherwise, you would have to handle this manually. In this case, depending on the number of data rows, it can be time-consuming.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset