Let's assume you have RStudio installed on your machine. Follow the steps mentioned here:
SPARK_HOME = "/home/spark-2.0.0-bin-hadoop2.7/R/lib" Sys.setenv(SPARK_MEM="8g") Sys.setenv(SPARK_HOME = "/home/spark-2.0.0-bin-hadoop2.7") .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"),.libPaths()))
library(SparkR, lib.loc = SPARK_HOME) ibrary(SparkR)
sc <- sparkR.init(appName = "SparkR-DataFrame-example", master = "local") sqlContext <- sparkRSQL.init(sc)
df <- createDataFrame(sqlContext, faithful) head(df)
devtools
package work:install.packages("xml2", dependencies = TRUE) install.packages("Rcpp", dependencies = TRUE) install.packages("plyr", dependencies = TRUE) install.packages("devtools", dependencies = TRUE) install.packages("MatrixModels", dependencies = TRUE) install.packages("quantreg", dependencies = TRUE) install.packages("moments", dependencies = TRUE) install.packages("xml2") install.packages(c("digest", "gtable", "scales", "rversions", "lintr"))
libcurl
for RCurl, which devtools depends on. To do this, just run this command: sudo apt-get -y build-dep libcurl4-gnutls-dev
sudo apt-get install libcurl4-gnutls-dev
sudo apt-get install r-cran-plyr
sudo apt-get install r-cran-reshape2
ggplot2.SparkR
package from GitHub using the following code:library(devtools) devtools::install_github("SKKU-SKT/ggplot2.SparkR")
library(moments) library(ggplot2)
head
command:time_taken <- c (15, 16, 18, 17.16, 16.5, 18.6, 19.0, 20.4, 20.6, 25.15, 27.27, 25.24, 21.05, 21.65, 20.92, 22.61, 23.71, 35, 39, 50) df_new <- data.frame(time_taken) head(df_new) df<- createDataFrame(sqlContext, data = df_new) head(df)
skewness(df) kurtosis(df_new)
You are probably aware that we used the two terms skewness
and kurtosis
in Chapter 4, Extracting Knowledge through Feature Engineering. If you are not familiar with these two terms, here is a bit of definition of them. Well, from the statistical perspective, skewness
is a measure of symmetry. Alternatively and more precisely, it signifies the lack of symmetry in a distribution of the dataset.
Now you might be wondering what symmetric is. Well, a distribution of the dataset is symmetric if it looks the same to the left and right of the center point.
Kurtosis, on the other hand, is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution:
ggplot()
method of the ggplot2.SparkR
package:ggplot(df, aes(x = time_taken)) + stat_density(geom="line", col= "green", size = 1, bw = 4) + theme_bw()
If you are not familiar with the ggplot2
R package, note that ggplot2
is a plotting system for R based on the grammar of graphics of base and lattice graphics. It provides many fiddly details of the graphics that make plotting a hassle, for example, placing or drawing legends in a graph, as well as providing a powerful model of graphics. This will make your life easier in order to produce simple as well as complex multi-layered graphics.
More info about ggplot2
and its documentation can be found at the following website: http://docs.ggplot2.org/current/.