Dataframes, lists, arrays, and matrices

Dataframes have several important features that make them useful for data analysis:

  • Rectangular data structures, with the typical use being cases (for example, the days in one month) listed down the rows and variables (page views, unique visitors, or referrers) listed along the columns
  • A mix of data types is supported. A typical data frame might include variables containing dates, numbers (integers or floats), and text
  • With subsetting and variable extraction, R provides a lot of built-in functionality to select rows and variables within a dataframe
  • Many functions include a data argument, which makes it very simple to pass dataframes into functions and process only the variables and cases that are relevant, which makes for cleaner and simpler code

We can inspect the first few rows of the dataframe using the head(analyticsData) command. The following screenshot shows the output of this command:

As you can see, there are four variables within the dataframe: one contains dates, two contain integer variables, and one contains a numeric variable.

Variables can be extracted from dataframes very simply using the $ operator, as follows:

> analyticsData$pageViews
[1] 836 676 940 689 647 899 934 718 776 570 651 816
[13] 731 604 627 946 634 990 994 599 657 642 894 983
[25] 646 540 756 989 965 821

Variables can also be extracted from dataframes using [], as shown in the following command:

> analyticsData[, "pageViews"]

Note the use of a comma with nothing before it to indicate that all rows are required. In general, dataframes can be accessed using dataObject[x,y], with x being the number(s) or name(s) of the rows required and y being the number(s) or name(s) of the columns required. For example, if the first 10 rows were required from the pageViews column, it could be achieved like this:

> analyticsData[1:10,"pageViews"]
[1] 836 676 940 689 647 899 934 718 776 570

Leaving the space before the comma blank returns all rows, and leaving the space after the comma blank returns all variables. For example, the following command returns the first three rows of all variables:

> analyticsData[1:3,]

The following screenshot shows the output of this command:

Dataframes are a special type of list. Lists can hold many different types of data, including lists. As with many data types in R, their elements can be named, which can be useful to write code that is easy to understand. Let's make a list of the options for dinner, with drink quantities expressed in milliliters.

In the following example, please also note the use of the c() function, which is used to produce vectors and lists by giving their elements separated by commas. R will pick an appropriate class for the return value, string for vectors that contain strings, numeric for those that only contain numbers, logical for Boolean values, and so on:

> dinnerList <- list("Vegetables" =
  c("Potatoes", "Cabbage", "Carrots"),
  "Dessert" = c("Ice cream", "Apple pie"),
  "Drinks" = c(250, 330, 500)
  )
Note that code is indented throughout, although entering code directly into the console will not produce indentations; it is done for readability.

Indexing is similar to dataframes (which are, after all, just a special instance of a list). They can be indexed by number, as shown in the following command:

> dinnerList[1:2]
$Vegetables
[1] "Potatoes" "Cabbage"  "Carrots"
    
$Dessert
[1] "Ice cream" "Apple pie"

This returns a list. Returning an object of the appropriate class is achieved using [[]]:

> dinnerList[[3]]
[1] 250 330 500

In this case, a numeric vector is returned. They can also be indexed by name, as shown in the following code:

> dinnerList["Drinks"]
$Drinks
[1] 250 330 500

Note that this also returns a list.

Matrices and arrays, which, unlike dataframes, only hold one type of data, also make use of square brackets for indexing, with analyticsMatrix[, 3:6] returning all rows of the third to sixth columns, analyticsMatrix[1, 3] returning just the first row of the third column, and analyticsArray[1, 2, ] returning the first row of the second column across all of the elements within the third dimension.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset