Being a data scientist entails using certain software, some of which we discussed in previous chapters of this book. This software covers the basic technical know-how that you need in order to apply for a data scientist position. The actual position may go beyond the initial job description as is often the case in IT jobs. That’s good, in a way, because it provides opportunities for learning new things, which is an integral part of being in the fascinating field of data science.
In this chapter, we will explore the types of software that are commonly used in a data science setting. Not all of these programs will be used in the data scientist position you will get, but being aware of them may help you understand your options better. In particular, we will examine the Hadoop suite and a few of its most promising alternatives (such as Spark, Storm, etc.), the various object-oriented programming languages that come into play (Java, C++, C#, Ruby and Python), the data analysis software that is available (R, Matlab, SPSS, SAS or Stata), the visualization program that you may have installed and the integrated big data system (e.g., IBM’s BigInsights, Cloudera, etc.) that may be available for you to use. We’ll also see other programs that you may encounter such as GIT, Excel, Eclipse, Emcien and Oracle. Note that this list of software will give you an idea of what to expect although it may not reflect the actual programs you will be using; some companies may require specialized software for their industry, which you will be probably be asked to get acquainted with as soon as you are hired. Familiarity with most of the software in this list should make it a relatively easy and straightforward task for you.
8.1 Hadoop Suite and Friends
Hadoop has become synonymous with big data software over the past few years; it is the backbone of a data scientist’s arsenal. It is important to know that Hadoop is not just a program, but more like a suite of tools (similar to MS Office). This suite is designed to handle, store and process big data. It also includes a scheduler (Oozie) and a metadata and table management framework (HCatalog). All data processing jobs in Hadoop are distributed over the computer cluster on which you have Hadoop installed. These jobs can be object-oriented programming (OOP) code, data analysis programs, data visualization scripts, or anything else that has a finite process time and is useful for the data analysis task. Hadoop makes sure that whatever you want to do with your data is done efficiently and is monitored in a straightforward way.
Hadoop does not have a particularly user-friendly software environment, as you can see in Fig. 8.1 where a screenshot of a typical Hadoop job is shown.
Fig. 8.1 Screenshot of a Task Dashboard in Hadoop.
The Hadoop suite is comprised of the following modules, all of which are important:
There are also a few other components of the Hadoop suite that are supplementary to these core ones. The best way to familiarize yourself with them is to download Hadoop and play around with it. If you prefer, you can read a tutorial instead (or, even better, a manual) while trying to solve a benchmark problem.
Hadoop is not your only option when it comes to big data technology. An interesting alternative that is not as well known as it should be is Storm (used by Twitter, Alibaba, Groupon and several other companies). Storm is significantly faster than Hadoop, is also open source and is generally easy to use, making it a worthy alternative. Unlike Hadoop, Storm doesn’t run MapReduce jobs, running topologies instead. The key difference is that a MapReduce job ends eventually, while a topology runs forever or until it is killed by the user. (You can think of it as a background process that runs on your OS throughout its operation). The topology can be visualized as a graph of computation, processing data streams. The sources of these data streams are called “spouts” (symbolized as taps), and they are linked to “bolts” (symbolized by lightning bolts). A bolt consumes any number of input streams, does some processing and potentially emits new streams. You can see an example of a Storm topology in Fig. 8.2.
Fig. 8.2 Example of a Topology in the Storm Software, a worthwhile Hadoop alternative. Creating a topology like this one is somewhat easier and more intuitive than a MapReduce sequence.
A topological approach to data processing guarantees that it will produce the right results even in the case of failure (since topologies run continuously), meaning that if one of the computers in the clusters breaks down, this will not compromise the integrity of the job that has been undertaken by the cluster. It should be noted that Storm topologies are programs usually written in Java, Ruby, Python and Fancy. The Storm software is written in Java and Clojure (a functional language that works well with Java), and its source code is the most popular project on this type of technology.
The advantages of this software are its ability to process data in real-time; its simple API; the fact that it’s scalable, fault tolerant, easy to deploy and use, free and open source and able to guarantee data processing; and that it can be used with a variety of programming languages. It also has a growing user community, spanning over the West and East Coasts of the USA as well as London and several other places.
Although Storm is a very popular and promising Hadoop alternative, providing flexibility and ease of use, there are other players boasting of similar qualities that also challenge Hadoop’s dominance in the big data world. The most worthwhile ones (at the time of this writing) are:
Parallel to all these systems, there are several projects that can facilitate the work undertaken by Hadoop, working in a complimentary way, so if you are going to learn Hadoop, you may want to check them out once you’ve got all the basics down. The most well-known of these projects are the following:
8.2 OOP Language
A data scientist needs to be able to handle an object-oriented programming (OOP) language and handle it well. Comparing the various OOP languages is beyond the scope of this book, so for the sake of example, Java will be discussed in this subchapter as it is well-known in the industry. Just like most OOP languages, Java doesn’t come with a graphical user interface (GUI), which is why many people prefer Python (which does come from its developers with a decent GUI). However, Java is very fast and elegant, and there is abundant educational material both online and offline. A typical Java program can be seen in Fig. 8.3.
Fig. 8.3 A Typical Java program for determining if a year is a leap year. The program is viewed in an editor that recognizes Java code.
Note that the highlighting of certain words and lines is done by the editor automatically (though this is not always the case, e.g., when using Notepad). Also, spacing is pretty much optional and is there to facilitate the user of this script. Note that most programs tend to be lengthier and more complicated than this simple example, yet they usually can be broken down to simple components like the one shown here.
Programming can be soul-crushing if you need to allocate a lot of your time to writing the scripts (usually on a text editor like Notepad++ or Textpad). To alleviate this, several integrated development environments (usually referred to as IDEs) have been developed over the years. These IDEs provide an additional layer to the programming language, integrating its engine, compiler and other components in a more user-friendly environment with a decent GUI. One such IDE, particularly popular among Java developers, is Eclipse (see Fig. 8.4), which also accommodates several other programming languages and even data analysis packages like R.
Fig. 8.4 Screenshot of Eclipse Running Java. Eclipse is an excellent Java IDE (suitable for other programming languages as well).
Other OOP languages you may want to consider are:
All of these are free and easy to learn via free tutorials (the IDE of the last one, Visual Studio, is proprietary software, however). Also, they all share some similarities, so if you are familiar with the basic OOP concepts, such as encapsulation, inheritance and polymorphism, you should be able to handle any one of them. Note that all of these programming languages are of the imperative paradigm (in contrast with the declarative/functional paradigm that is gradually becoming more popular). The statements that are used in this type of programming are basically commands to the computer for actions that it needs to take. Declarative/functional programming, on the other hand, focuses more on the end result without giving details about the actions that need to be taken.
Although at the time of this writing, OOP languages are the norm when it comes to professional programming, there is currently a trend towards functional languages (e.g., Haskell, Clojure, ML, Scala, Erlang, OCaml, Clean, etc.). These languages have a completely different philosophy and are focused on the evaluation of functional expressions rather than the use of variables or the execution of commands in achieving their tasks.
The big plus of functional languages is that they are easily scalable (which is great when it comes to big data) and much more error free since they don’t use a global workspace. Still, they are somewhat slower for most data science applications than their OOP counterparts although some of them (e.g., OCaml and Clean) can be as fast as C22 when it comes to numeric computations. If things take a turn for the better in the years to come, you may want to look into adding one of these languages in your skill-set as well, just to be safe. Note that there can be an overlap between functional languages and traditional OOP languages such as those described previously. For example, Scala is a functional OOP language, one that’s probably worth looking into.
8.3 Data Analysis Software
What good would all the programming be for a data scientist if there was nothing to compliment it and give meaning to it? That’s where all the data analysis software comes in. There are several options, the most powerful of which are Matlab and R. Though tempting, there will be no comparison between them as it is usually a matter of preference. Interestingly, they are so similar in their syntax and function that it shouldn’t take you more than two to three weeks to learn one if you know the other at a satisfactory level.
As R is somewhat more popular, mainly due to the fact that it is open source and has a huge community of users that contribute to it regularly, we will focus on it in this book. For those who are more inclined towards Matlab and are familiar with its advantages over R and other data analysis tools, keep an open mind. R also has an edge over other data analysis alternatives and is straightforward to write and run programs in, often without the need to include loops (a programming structure that generally slows down analysis done in a high-level programming language). Instead, it makes use of vector operations, which can also extend to matrices. This characteristic is known as vectorization and makes sense for data analysis scripts only (OOP languages are inherently fast, so loops are not an issue for them).
The R programming environment is very basic (similar to Python, in a way) but still user-friendly enough, especially for small programs. The screenshot in Fig. 8.5 gives you an idea of what the environment is like.
Fig. 8.5 The R Environment (vanilla flavor). As can be seen here, although the programming environment is quite user-friendly, it lacks many useful accessories like an IDE.
R is great as a data analysis tool and its GUI is quite well made. However, if you are serious about using this tool, you’ll need to invest some time in learning and customizing an IDE for it. There are several of them available (most of which are free), but the one that stands out is RStudio (see Fig. 8.6 for a screenshot).
Fig. 8.6 One of the many R IDEs, RStudio. You can see here that in addition to the console (bottom-left window), it also has a script editor (top-left), a workspace viewer (top-right) and a plot viewer (bottom-right) among several other useful features that facilitate writing and running R programs.
Other alternatives to R for data analysis applications are:
Note that all of these are proprietary software, so they may not ever be as popular as R or attract as large user communities. If you are familiar with statistics and understand programming, they shouldn’t be very difficult for you to learn; with Matlab, you don’t need to be familiar with statistics at all in order to use it. We will revisit R in subchapter 10.5, where we will examine how this software is used in a machine learning framework.
8.4 Visualization Software
The importance of visualizing the results of a data analysis is hard to overstate. That is why there are some visualization software options available to refine your software arsenal. Although all data analysis programs provide some decent visualization tools, it often helps to have a more specialized alternative such as Tableau, which can make the whole process much more intuitive and efficient (see Fig. 8.7 for a screenshot of this software to get an idea of its usability and GUI).
Tableau is, unfortunately, proprietary software and is somewhat costly. However, it allows for fast data visualization, blending and exporting of plots. It is very user-friendly, easy to learn, has abundant material on the web, is fairly small in size (<100 MB) and its developers are very active in educating users via tutorials and workshops. It runs on Windows (any version from XP onwards) and has a two-week trial period. Interestingly, it is part of the syllabus of the “Introduction to Data Science” course of the University of Washington.
Fig. 8.7 Screenshot of Tableau, an excellent visualization program. As you can see, it’s quite intuitive and offers a variety of features.
In the industry, Tableau appears to have a leading role compared to other data visualization programs. Though more suitable for business intelligence applications, it can be used for all kinds of data visualization tasks, and it allows easy sharing of the visualizations it produces via email or online. It also offers interactive mapping and can handle data from different sources simultaneously.
If you are interested in alternatives to this software, you can familiarize yourself with one or more of the following programs:
Generally, data visualization programs are relatively easy to learn, so this is not an issue when trying to add them in your software arsenal. Before dedicating a lot of time in mastering any one of them, make sure that it integrates well with the other programs you plan to use. Also, take a look at what visualization programs are included in most of the ads for the other programs in which you are interested.
8.5 Integrated Big Data Systems
Although not essential, it is good to be familiar with at least one integrated big data system. One such system, which is quite good despite the fact that it is still in its initial versions, is IBM’s BigInsights platform. The idea is to encapsulate most of the functions of Hadoop into a user-friendly package that has a decent GUI as well. As a bonus, it can also do some data visualization and scheduling, things that are useful to have in an all-in-one suite so that you can focus on other aspects of data science work. BigInsights runs on a cluster/server and is accessible via a web browser. A screenshot of the BigInsights platform can be seen in Fig. 8.8.
Fig. 8.8 IBM’s BigInsights platform running in the Mozilla Firefox browser. As you can see, it has a very good GUI and is quite user-friendly.
The big advantage of an integrated big data system is its GUI, which when combined with good documentation makes the whole system user-friendly, straightforward and relatively easy to learn. Also, as the GUI takes care of all the Hadoop operations, it allows you to focus on more high-level aspects of the data science process, freeing you from much of the low-level programming that’s needed.
An alternative to BigInsights is Cloudera, which is well known in the industry and more robust. Other worthy alternatives include Knime, Alpine Data Labs’ suite, the Pivotal suite, etc. It is quite likely that by the time you read these lines there will be other integrated big data systems available, so be sure to become familiar with what they are and what they offer.
8.6 Other Programs
The above list of programs would be incomplete if some auxiliary ones were not included. These programs may vary from company to company, but they are generally a good place to start when it comes to refining your software arsenal. For example, the GIT version control program is one that definitely deserves your attention since you are quite likely to need one such program, especially if you are going to work on a large project along with other people (usually programmers). You can see a screenshot of its interface and its most commonly used commands in Fig. 8.9.
Fig. 8.9 The GIT version control program. Not the most intuitive program available, but very rich in terms of functionality and quite efficient in its job.
Note that there are several GUI add-ons for GIT available for all major operating systems. One that is particularly good for the Windows OS is GIT Extensions (open source), although there are several GUIs for other OSs as well. This particular GUI add-on makes the use of GIT much more intuitive while preserving the option of using its command prompt (something that’s not always the case with GIT GUIs).
It would be sacrilege to omit the Oracle SQL Developer software since it is frequently used for accessing the structured data of a company whose DBMS is Oracle. Although this particular software is probably going to be less essential in the years to come due to big data technology spreading rapidly, it is still something useful to know when dealing with data science tasks. You can see a screenshot of this program in Fig. 8.10.
Fig. 8.10 The Oracle SQL Developer database software, a great program for working with structured data in company databases and data warehouses.
The key part of this software is SQL, so in order to use it to its full potential, you need to be familiar with this query language. As we saw in an earlier chapter, this is a useful language to know as a data scientist even if you don’t have to use it that much. This is because there are several variants of it that are often used in big data database programs.
Some other useful programs to be familiar with when in a data science position are:
8.7 Key Points