Visualizing distributions

In the last section, we discussed distributions and we saw some measures that describe them. In this section, we're going to see how to visualize distributions. Visualizations are more intuitive than numeric measures and they will help us to understand our data.

Rattle offers two different set of charts depending on the nature of the variables. For numeric variables, we can use Box Plot, Histogram, Cumulative, and Benford. And for categorical variables, Rattle provides us with Bar Plot, Dot Plot, and Mosaic charts. We're going to explore the most common visual representations.

Before using Rattle to plot charts, make sure that the Advanced Graphics option is unchecked. With this option checked, some charts like histograms will not be plotted. This is shown in the following screenshot:

Visualizing distributions

Numeric variables

We're going to use the variable Age of the Titanic passenger list to show the different types of charts with numeric variables. Load the data set, set the variable Survived as target, and go to the Explore tab and select the Distributions type. The central area of the screen is divided into two panels – the upper panel is reserved for the numeric variables and the lower one for categorical variables.

In the numeric variables area, you can see six variables (Survived, Pclass, Age, SibSp, Parch, and Fare) and four different plots (Box Plot, Histogram, Cumulative, and Benford). To plot a chart, you have to select the appropriate checkbox and click on the Execute button, as shown in the following screenshot:

Numeric variables

Box plots

The first chart we're going to discuss will be the Box Plot. We're going to plot a chart of the variable Age of the Titanic's passenger list. Select the Annotate checkbox in order to have the values of the data points labeled as shown in the following screenshot:

Box plots

These plots summarize the distribution of a variable in a dataset. In the following screenshot, we can see the representation of the variable Age:

Box plots

If you have identified the target variable when you loaded the dataset, Rattle will create a plot for all observations and a plot for every possible value of the target variable. In this example, the target variable Survived has two possible values, – 0 and 1.

We have highlighted some points of the central plot – the green part. In the center of the plot, the horizontal line labeled with a 28 is the median. The point labeled with a 21 is the first quartile, or Q1, and 39 represents the third quartile, or Q3. In this plot, the interquartile range is 39 - 21 = 18 (Q3 – Q1). The lower and upper points labeled with 1 and 66 are 1.5 times the interquartile range from the median. Points above the point labeled with a 66 are outliers.

Histograms

Histograms give us a quick view of the spread of a distribution. Rattle's histogram combines three charts in one, namely the R histogram (the bars), the density plot (the line), and the rug plot. The rug plot is marked with a red arrow in this screenshot:

Histograms

This histogram shows us the distribution in terms of age. The vertical bars are the original histogram. Every bar represents an Age range and the height of the bar represents the Frequency or the number of observations that fall in that age range. The density plot is a more accurate representation of the estimated values. Finally, in the rug plot, every line shows the exact value of an observation, as shown in the following screenshot:

Histograms

In the preceding histogram, we can see that most people on the Titanic were between 20 and 40 years of age.

Cumulative plots

The cumulative plot represents the percentage of the population that has a value than or equal to the value shown in the x axis. I've plotted the cumulative plot for the variable Age. If you look at the following screenshot, you can see that nearly 80 percent of the passengers were less than or equal to 40 years old.

We've circled the younger passengers. In this plot, like in the histogram we plotted before, we see that young people had a greater probability of survival.

Cumulative plots

Categorical variables

We're now going to explore categorical variables. As with numeric variables, you have to load the Titanic dataset and set Survived as the target variable. Then go to the Explore tab and select the Distributions type.

To plot a new graph, you have to check the plot and the variable in the Categoric variable panel and click on Execute. This is illustrated in the following screenshot:

Categorical variables

We'll use the variable embarked from the Titanic passenger list to plot a bar plot, a dot plot, and a mosaic plot.

Bar plots

The bar chart is probably the simplest and easiest to understand – it uses vertical or horizontal bars to compare among categories. In the following screenshot, we can see a bar chart of the variable embarked:

Bar plots

In the previous chapter, we introduced this dataset and we explained that the variable embarked has three possible values – C for Cherbourg, Q for Queenstown, and S for Southampton. If you look at this chart, it is quick and easy to see that most of the people (644) embarked in Southampton. Looking at the blue and green bars, we can see that around a third of the passengers that embarked at Southampton survived and around half of the passengers who embarked at Cherbourg survived.

Tip

Try to create a bar chart of the variable sex and you'll discover that 74.2 percent of females survived and only 18.9 percent of the males survived the Titanic disaster.

Mosaic plots

The mosaic plot shows the distribution of the values for a variable. Look at the following screenshot. At the top of the plot, there are three letters—S, C, and Q—representing the three harbors. Below each letter, there is a bar divided into two sub-bars (blue and green). We have highlighted the bar below Q, as shown in the following screenshot:

Mosaic plots

The width of this bar represents the number of occurrences. In our plot, the wider bar is the bar below S. This is the harbor where most of the people embarked. For each harbor, we have a green and a blue bar. The size of the green bar represents the number of people who didn't survive and the blue bar represents the number of people who survived.

As you can see, the mosaic plot gives us a fast understanding about how our data is distributed.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset