Book Description Master how to use the Julia language to solve business critical data science challenges. After covering the importance of Julia to the data science community and several essential data science principles, we start with the basics including how to install Julia and its powerful libraries. Many examples are provided as we illustrate how to leverage each Julia command, dataset, and function. Specialized script packages are introduced and described. Hands-on problems representative of those commonly encountered throughout the data science pipeline are provided, and we guide you in the use of Julia in solving them using published datasets. Many of these scenarios make use of existing packages and built-in functions, as we cover: An overview of the data science pipeline along with an example illustrating the key points, implemented in Julia Options for Julia IDEs Programming structures and functions Engineering tasks, such as importing, cleaning, formatting and storing data, as well as performing data preprocessing Data visualization and some simple yet powerful statistics for data exploration purposes Dimensionality reduction and feature evaluation Machine learning methods, ranging from unsupervised (different types of clustering) to supervised ones (decision trees, random forests, basic neural networks, regression trees, and Extreme Learning Machines) Graph analysis including pinpointing the connections among the various entities and how they can be mined for useful insights. Each chapter concludes with a series of questions and exercises to reinforce what you learned. The last chapter of the book will guide you in creating a data science application from scratch using Julia. Show and hide more
Table of Contents
Introduction CHAPTER 1: Introducing Julia How Julia Improves Data Science Data science workflow Julia’s adoption by the data science community Julia Extensions Package quality Finding new packages About the Book CHAPTER 2: Setting Up the Data Science Lab Julia IDEs Juno IJulia Additional IDEs Julia Packages Finding and selecting packages Installing packages Using packages Hacking packages IJulia Basics Handling files Creating a notebook Saving a notebook Renaming a notebook Loading a notebook Exporting a notebook Organizing code in .jl files Referencing code Working directory Datasets We Will Use Dataset descriptions Magic dataset OnlineNewsPopularity dataset Spam Assassin dataset Downloading datasets Loading datasets CSV files Text files Coding and Testing a Simple Machine Learning Algorithm in Julia Algorithm description Algorithm implementation Algorithm testing Saving Your Workspace into a Data File Saving data into delimited files Saving data into native Julia format Saving data into text files Help! Summary Chapter Challenge CHAPTER 3: Learning the Ropes of Julia Data Types Arrays Array basics Accessing multiple elements in an array Multidimensional arrays Dictionaries Basic Commands and Functions print(), println() typemax(), typemin() collect() show() linspace() Mathematical Functions round() rand(), randn() sum() mean() Array and Dictionary Functions in append!() pop!() push!() splice!() insert!() sort(), sort!() get() Keys(), values() length(), size() Miscellaneous Functions time() Conditionals if-else statements string() map() VERSION() Operators, Loops and Conditionals Operators Alphanumeric operators (<, >, ==, <=, >=, !=) Logical operators (&&, ||) Loops for-loops while-loops break command Summary Chapter Challenge CHAPTER 4: Going Beyond the Basics in Julia String Manipulation split() join() Regex functions ismatch() match() matchall() eachmatch() Custom Functions Function structure Anonymous functions Multiple dispatch Function example Implementing a Simple Algorithm Creating a Complete Solution Summary Chapter Challenge CHAPTER 5: Julia Goes All Data Science-y Data Science Pipeline Data Engineering Data preparation Data exploration Data representation Data Modeling Data discovery Data learning Information Distillation Data product creation Insight, deliverance, and visualization Keep an Open Mind Applying the Data Science Pipeline to a Real-World Problem Data preparation Data exploration Data representation Data discovery Data learning Data product creation Insight, deliverance, and visualization Summary Chapter Challenge CHAPTER 6: Julia the Data Engineer Data Frames Creating and populating a data frame Data frames basics Variable names in a data frame Accessing particular variables in a data frame Exploring a data frame Filtering sections of a data frame Applying functions to a data frame’s variables Working with data frames Altering data frames Sorting the contents of a data frame Data frame tips Importing and Exporting Data Accessing .json data files Storing data in .json files Loading data files into data frames Saving data frames into data files Cleaning Up Data Cleaning up numeric data Cleaning up text data Formatting and Transforming Data Formatting numeric data Formatting text data Importance of data types Applying Data Transformations to Numeric Data Normalization Discretization (binning) and binarization Binary to continuous (binary classification only) Applying data transformations to text data Case normalization Vectorization Preliminary Evaluation of Features Regression Classification Feature evaluation tips Summary Chapter Challenge CHAPTER 7: Exploring Datasets Listening to the Data Packages used in this chapter Computing Basic Statistics and Correlations Variable summary Correlations among variables Comparability between two variables Plots Grammar of graphics Preparing data for visualization Box plots Bar plots Line plots Scatter plots Basic scatter plots Scatter plots using the output of t-SNE algorithm Histograms Exporting a plot to a file Hypothesis Testing Testing basics Types of errors Sensitivity and specificity Significance and power of a test Kruskal-Wallis tests T-tests Chi-square tests Other Tests Statistical Testing Tips Case Study: Exploring the OnlineNewsPopularity Dataset Variable stats Visualization Hypotheses T-SNE magic Conclusions Summary Chapter Challenge CHAPTER 8: Manipulating the Fabric of the Data Space Principal Components Analysis (PCA) Applying PCA in Julia Independent Components Analysis (ICA): most popular alternative of PCA Feature Evaluation and Selection Overview of the methodology Using Julia for feature evaluation and selection using cosine similarity Using Julia for feature evaluation and selection using DID Pros and cons of the feature evaluation and selection approach Other Dimensionality Reduction Techniques Overview of the alternative dimensionality reduction methods Genetic algorithms Discernibility-based approach When to use a sophisticated dimensionality reduction method Summary Chapter Challenge CHAPTER 9: Sampling Data and Evaluating Results Sampling Techniques Basic sampling Stratified sampling Performance Metrics for Classification Confusion matrix Accuracy metrics Basic accuracy Weighted accuracy Precision and recall metrics F1 metric Misclassification cost Defining the cost matrix Calculating the total misclassification cost Receiver Operating Characteristic (ROC) Curve and related metrics ROC Curve AUC Metric Gini Coefficient Performance Metrics for Regression MSE Metric and its variant, RMSE SSE Metric Other metrics K-fold Cross Validation (KFCV) Applying KFCV in Julia KFCV tips Summary Chapter Challenge CHAPTER 10: Unsupervised Machine Learning Unsupervised Learning Basics Clustering types Distance metrics Grouping Data with K-means K-means using Julia K-means tips Density and the DBSCAN Approach DBSCAN algorithm Applying DBSCAN in Julia Hierarchical Clustering Applying hierarchical clustering in Julia When to use hierarchical clustering Validation Metrics for Clustering Silhouettes Clustering validation metrics tips Effective Clustering Tips Dealing with high dimensionality Normalization Visualization tips Summary Chapter Challenge CHAPTER 11: Supervised Machine Learning Decision Trees Implementing decision trees in Julia Decision tree tips Regression Trees Implementing regression trees in Julia Regression tree tips Random Forests Implementing random forests in Julia for classification Implementing random forests in Julia for regression Random forest tips Basic Neural Networks Implementing neural networks in Julia Neural network tips Extreme Learning Machines Implementing ELMs in Julia ELM tips Statistical Models for Regression Analysis Implementing statistical regression in Julia Statistical regression tips Other Supervised Learning Systems Boosted trees Support vector machines Transductive systems Deep learning systems Bayesian networks Summary Chapter Challenge CHAPTER 12: Graph Analysis Importance of Graphs Custom Dataset Statistics of a Graph Cycle Detection Julia the cycle detective Connected Components Cliques Shortest Path in a Graph Minimum Spanning Trees Julia the MST botanist Saving and loading graphs from a file Graph Analysis and Julia’s Role in it Summary Chapter Challenge CHAPTER 13: Reaching the Next Level Julia Community Sites to interact with other Julians Code repositories Videos News Practice What You’ve Learned Some features to get you started Some thoughts on this project Final Thoughts about Your Experience with Julia in Data Science Refining your Julia programming skills Contributing to the Julia project Future of Julia in data science APPENDIX A: Downloading and Installing Julia and IJulia APPENDIX B: Useful Websites Related to Julia APPENDIX C: Packages Used in This Book APPENDIX D: Bridging Julia with Other Platforms Bridging Julia with R Running a Julia script in R Running an R script in Julia Bridging Julia with Python Running a Julia script in Python Running a Python script in Julia APPENDIX E: Parallelization in Julia APPENDIX F: Answers to Chapter Challenges Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6 Chapter 7 Chapter 8 Chapter 9 Chapter 10 Chapter 11 Chapter 12 Chapter 13 Index