Configuring SparkR with RStudio

Let's assume you have RStudio installed on your machine. Follow the steps mentioned here:

  1. Now open RStudio and create a new R script; then write the following code:
          SPARK_HOME = "/home/spark-2.0.0-bin-hadoop2.7/R/lib" 
          Sys.setenv(SPARK_MEM="8g") 
          Sys.setenv(SPARK_HOME = "/home/spark-2.0.0-bin-hadoop2.7") 
          .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R",
          "lib"),.libPaths())) 
    
  2. Load the necessary package for SparkR by using this code:
          library(SparkR, lib.loc = SPARK_HOME)
          ibrary(SparkR) 
    
  3. Configure the SparkR environment as follows:
          sc <- sparkR.init(appName = "SparkR-DataFrame-example", master =
          "local")
          sqlContext <- sparkRSQL.init(sc) 
    
  4. Now let's create the first DataFrame and print the first few rows, as follows:
          df <- createDataFrame(sqlContext, faithful) 
          head(df) 
    
  5. You might need to install the following packages in order to make the devtools package work:
          install.packages("xml2", dependencies = TRUE) 
          install.packages("Rcpp", dependencies = TRUE) 
          install.packages("plyr", dependencies = TRUE) 
          install.packages("devtools", dependencies = TRUE) 
          install.packages("MatrixModels", dependencies = TRUE) 
          install.packages("quantreg", dependencies = TRUE)  
          install.packages("moments", dependencies = TRUE) 
          install.packages("xml2") 
          install.packages(c("digest", "gtable", "scales", "rversions",
          "lintr")) 
    
  6. Morever, you might need to install libcurl for RCurl, which devtools depends on. To do this, just run this command:
          sudo apt-get -y build-dep libcurl4-gnutls-dev 
          sudo apt-get install libcurl4-gnutls-dev 
          sudo apt-get install r-cran-plyr 
          sudo apt-get install r-cran-reshape2
    
  7. Now configure the ggplot2.SparkR package from GitHub using the following code:
          library(devtools) 
          devtools::install_github("SKKU-SKT/ggplot2.SparkR") 
    
  8. Now let's compute the skewness and kurtosis for the sample DataFrame that we have just created. Before that, load the necessary packages:
          library(moments) 
          library(ggplot2) 
    
  9. Let's create the DataFrame for the daily exercise example shown in the Feature engineering and data exploration section in Chapter 4, Extracting Knowledge through Feature Engineering, and show the first few rows using head command:
          time_taken <- c (15, 16, 18, 17.16, 16.5, 18.6, 19.0, 20.4, 20.6, 
          25.15, 27.27, 25.24, 21.05, 21.65, 20.92, 22.61, 23.71, 35, 39, 50) 
          df_new <- data.frame(time_taken)  
          head(df_new)  
          df<- createDataFrame(sqlContext, data = df_new)  
          head(df) 
    
  10. Now calculate the skewness and kurtosis, as follows:
          skewness(df) 
          kurtosis(df_new) 
    

    You are probably aware that we used the two terms skewness and kurtosis in Chapter 4, Extracting Knowledge through Feature Engineering. If you are not familiar with these two terms, here is a bit of definition of them. Well, from the statistical perspective, skewness is a measure of symmetry. Alternatively and more precisely, it signifies the lack of symmetry in a distribution of the dataset.

    Now you might be wondering what symmetric is. Well, a distribution of the dataset is symmetric if it looks the same to the left and right of the center point.

    Kurtosis, on the other hand, is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution:

  11. Finally, let's plot the density plot graph by calling the ggplot() method of the ggplot2.SparkR package:
          ggplot(df, aes(x = time_taken)) + stat_density(geom="line",
          col= "green", size = 1, bw = 4) + theme_bw() 
    

If you are not familiar with the ggplot2 R package, note that ggplot2 is a plotting system for R based on the grammar of graphics of base and lattice graphics. It provides many fiddly details of the graphics that make plotting a hassle, for example, placing or drawing legends in a graph, as well as providing a powerful model of graphics. This will make your life easier in order to produce simple as well as complex multi-layered graphics.

Tip

More info about ggplot2 and its documentation can be found at the following website: http://docs.ggplot2.org/current/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset