
A year and a half ago, I had no clear idea what a data scientist was and why it was an important role. Immersed in a dead-end job in an e-marketing company, I had started to forget all of the stuff I had learned through the many difficult years of my education. I am not sure what triggered my resolve to look into the matter more (at that time there were no decent books on the topic, and I had no-one to mentor me), but I do remember coming to the realization that this was my life’s vocation. Naturally, there were problems with this new type of work – lots of things I hadn’t learned and no idea of how to learn them, especially if you factor in my 50 hours per week schedule and the fact that there wasn’t a decent data science course anywhere in the country in which I was living. But I did power through, my resolve fueled by the conviction that this was something worthwhile and enjoyable. And if I happened to fail in my pursuit, at least I would have picked up some useful skills in the process.

This book is for people who have the same desire to learn about this fascinating field. When I started my quest into the data science world, I had to learn the hard way, through trial and error, as well as through hard research via articles, videos and other sources on the Web. Fortunately, it will be much easier for you. That’s why I wrote this book: so that you have a manual, of sorts, to provide you with guidelines for this challenging transition.

Data science is a very rewarding field that deals with a fascinating new entity in the data world: big data, something that constitutes a quite intriguing challenge since there is no straightforward way of dealing with it effectively. This leaves a lot of room for creativity and a wider array of possibilities that you are called to explore as a data scientist. In addition, through this role you have the opportunity to develop aspects of yourself that no other role in the IT field provides: namely creativity, communication, direct links with the business world, etc. Through all this you have a chance of providing something useful to the organization you work for (which can be a company, government agency, or even a charity) through the intelligent use of the data that is available. Since this data is bound to be large, diverse, and quite messy, it is not something you would normally find in a tidy database. Hence the term big data and the role of the data scientist, the professional who deals with big data in a scientific, creative and understandable manner.

Over the past few years, there has been heightened awareness of big data and its implications in business, as well as its impact on the job market. But what is big data exactly? And how is it different from traditional data? The short definition of big data is “data that cannot be handled by a single computer.” Although this is usually due to its very large size, there are a few other reasons. In general, it is defined by four main characteristics, usually referred to as the four Vs of big data:

  • Volume. Contrary to “normal” data, big data is significantly larger; i.e., it ranges from a few Terabytes (TB) to a few Zettabytes (ZB). The latter is a billion TB, or a trillion Gigabytes (GB). That’s a lot of data! In 2010, the data of the whole world was about 1 ZB – that’s 125 million 8 GB media players! What’s more, this number has been increasing rapidly over the past few years, and there is no sign of it stopping any time soon. This very high amount of data that characterizes big data, in combination with the fact that big data cannot be processed efficiently using a single machine (even a supercomputer), has brought about the use of parallel computing (a cluster of computers working together via a network connection), something that is inherent in the vast majority of data science projects.
  • Variety. Big data is also quite varied, coming from non-traditional as well as traditional sources. The data we are used to processing is structured data, the kind of data usually found in databases. We know what its data type and size are, and we generally know what’s supposed to be in each field. Big data, however, includes unstructured and semi-structured data as well. Unstructured data lacks a pre-defined structure in its subcomponents (e.g., data found in Facebook posts, tweets, phone call transcripts, etc.), while semi-structured has some structure and is something in between structured and unstructured (e.g., data in machine logs and email address headers).
  • Velocity. Another important characteristic of big data is velocity, or the rate at which it arrives at the enterprise and is processed. Traditional data is thought to be slower and fairly static in terms of how it is developed and transferred from the location it is generated to the location it is processed. Contrast this with big data, which is constantly moving, and moving fast (though there may be some exceptions to this rule). This means that it needs to be processed quickly, in real-time if possible, in order to harness its potential. For example, a financial services company may need to analyze over 5 million market messages every second, with a latency of about 30 microseconds.
  • Veracity. This last one was added relatively recently, so there are still many references to the three Vs of big data in books and articles on the topic. Big data is also characterized by veracity, an attribute that relates to the quality (trustworthiness) of the data. As one would expect, there is a lot of noise in all of this data. Working with big data effectively means being able to discern the noise from the signals that may hide within. This is a challenging process that requires advanced analytical techniques. If one is not careful, it is easy to draw conclusions backed by statistical significance that don’t have any real value, or that may lead to questionable decisions.

There are two more Vs that are sometimes included, Variability and Visibility, but there has not been consensus on these characteristics, yet.

It doesn’t take much to realize that making effective use of big data is a challenge. Ignoring it is no longer an option in many industries as its information potential is becoming more and more evident and ways to make use of it constantly increase. Think of Amazon and Netflix, for example. Their clever use of big data has given them a competitive advantage and has opened new roads for their industries. If you were in the online shopping business, for example, and you had a large customer base that supplied you with large amounts of data, imagine what you could learn about buying patterns, the demographics of your customers, and the opportunities you could take advantage of by analyzing the data.

Building on this newly acquired knowledge, you could go one step further: namely, design a widget or an app that makes use of the insights you have derived and helps its user to gain similar insights into their experience with the environment of the data (in this case, the online shop). That’s actually one of the reasons Amazon became so successful. It not only offered a large variety of products to its users, but made the whole experience of shopping easier and more enjoyable through the use of interesting features on its site, such as its recommender system. This and many other similar mini-programs that are based on intelligent analysis of big data are usually referred to as data products and constitute the goal of the majority of data scientists. There are data scientists, however, who are not directly involved in the creation of these products and focus on engineering ways of facilitating other data scientists in their work. So the field is quite diverse in the particular tasks data scientists can undertake through the application of their specific skill-sets.

So the question is not whether or not to hop on the big data wagon, but how. This is where the data scientist comes in. The data scientist is a fairly new role in the industry, and since its introduction to the job market, it has grown in popularity. It involves all the different aspects of dealing with data, particularly big data, in an intelligent and very methodical manner, in order to create a useful product (the aforementioned data product). The product is usually a widget or an app that can provide meaningful information the users do not already know (the last part is something that is stressed by John Foreman, a very successful and experienced data scientist). Big data has brought about new paradigms in data processing and data visualization, equipping the data scientist with powerful tools that require a different mindset and a different skill-set to accompany it.

Many people confuse the data scientist with the data analyst. However, they are quite different roles, much like space flight is different from traditional flight. A data analyst uses techniques that may work with data that borders on being big data, but may be inefficient and lack the flexibility of the techniques employed by a data scientist. The former relies on a series of pre-made models to derive useful information from the data and creates reports for a businessperson to view. The latter often develops his own models or uses a completely data-driven approach in his analyses, often resulting in something that many other people can use, not just a businessperson in his company. The data analyst will create intuitive plots in his reports. The data scientist will create an interactive dashboard that will plot all the essential information in real-time.

In other words, data analysis is a very useful tool, but if one is to make use of the data the world is immersed in today, one needs to not only be efficient with data analysis techniques, but also gain a working knowledge of other aspects of data science that will be described in this book. Being a data analyst is great, but it will limit you to a certain type of datasets that involve structured data only, and among these datasets you will only be able to deal with the relatively small ones. If you want to take a stab at the larger and more complicated ones, you’ll need to learn the ways of the data scientist.

Being a data scientist is not only about know-how, though; to someone who’s interested, it can also be a very enjoyable and intriguing occupation. The domain of the data scientist is constantly changing as new technologies are developed, making it a very dynamic field. He1 is at the cutting edge of science and gets to communicate with interesting people, some of whom drive these changes. Data science is an inter-disciplinary field, so the data scientist expands his worldviews by learning to think in a more systemic way, integrating things from various fields. Most importantly, he often gets to be creative in the way he deals with the problems that arise and the ways data can be processed.

Being a data scientist is also a great profession. For example, given that it is a new role that can provide a strategic advantage to an organization (and there aren’t many people trained to do the role properly), the data scientist can be very well paid, usually more than other IT professionals, according to, for the same years of experience. In addition, a data scientist has the opportunity to develop a wide variety of skills, making him a very versatile and adaptable professional who may have the opportunity to communicate with all kinds of people in the industry and the scientific world and work in different industry sectors. This is particularly useful in times of financial turmoil, when job-hunting becomes challenging for specialized professionals.

This book is comprised of eighteen chapters, covering the basic aspects of the transition to the data science world. In the first few chapters you will learn more about what the field entails (what data science and big data are; why data science is very important, especially nowadays; and the different types of data scientists). Afterwards, you will have a chance to learn about what it takes to be a data scientist (the data scientist’s mindset, his technical qualifications, the experience that is required for this role, and a few things about networking). Next, you will have an opportunity to learn about the everyday life of a data scientist (what software he uses, the importance of learning new things in this line of work, the kind of problems he encounters, and the main stages of the data science process). In the chapter that follows, you will be presented with the various migration paths from existing roles (what to do and learn if you are a programmer/software developer, if you are a statistician or machine learning practitioner, if you are a data-related professional, or if you are a student). Afterwards, you will be given some practical and down-to-earth advice on what you need to do to land your first data science job (where to look, how to present yourself as a would-be data scientist, and what you need to consider if you wish to follow the freelance track). Finally, you will have a chance to read about some real-world data scientists, their experiences and their views on the matter, as well as some real job posting examples for data scientist positions. At the end of the book, there is a glossary of the most important terms that have been introduced, as well as three appendices – a list of useful sites, some relevant articles on the Web, and a list of offline resources for further reading. There is also a comprehensive index at the end of this text.

Throughout the book, the Kea bird is used to represent the data scientist. The Kea is known for its intelligence, innovative attitude, curiousness, and is one of the rarest species of its category. These attributes are the discerning features of the Kea and are shared by the data science professional.

I sincerely hope that this book is useful and, perhaps, even enjoyable for you. The transition itself is quite demanding (especially if you are in the beginning of your professional life), but it is an intriguing and rewarding experience. And when you eventually become a data scientist, the field continues to be just as interesting. Not a role for the faint-hearted, being a data scientist is a wonderful experience on many levels and can be a fascinating journey. Are you ready to embark on it?

Dr. Zacharias Voulgaris

