Chapter 1
What is Data Science?

Data Science vs. Business Intelligence vs. Statistics

Nowadays, a growing number of people talk about data science and its various merits. However, many people have a hard time distinguishing it from business intelligence and statistics. What’s worse, some people who are adept at these other fields market themselves as data scientists, since they fail to see the difference and expect the hiring managers to be equally ignorant on this matter. However, despite the similarities among these three fields, data science is quite different, both in terms of the processes involved, the domain, and the skills required. Let’s take a closer look at these three fields.

Data Science

Data science can be seen as the interdisciplinary field that deals with the creation of insights or data products from a given set of data files (usually in unstructured form), using analytics methodologies. The data it handles is often what is commonly known as “big data,” although it is often applied to conventional data streams, such as the ones usually encountered in the databases, the spreadsheets, and the text documents of a business. We’ll take a closer look into big data in the next section.

Data science is not a guaranteed tool for finding the answers to the questions we have about the data, though it does a good job at shedding some light on what we are investigating. For example, we may be interested in figuring out the answer to “How can we predict customer attrition based on the demographics data we have on them?” This is something that may not be possible with that data alone. However, investigating the data may help us come up with other questions, like “Can demographics data supplement a prediction system of attrition, based on the orders they have made?” Also, it is as good as the data we have, so it doesn’t make sense to expect breathtaking insights if the data we have is of low quality.

Business Intelligence

As for business intelligence, although it too deals with business data (almost exclusively), it does so through rudimentary data analysis methods (mainly statistics), data visualization, and other techniques, such as reports and presentations, with a focus on business applications. Also, it handles mainly conventional sized data, almost always structured, with little to no need for in-depth data analytics. Moreover, business intelligence is primarily concerned with getting useful information from the data and doesn’t involve the creation of data products (unless you count fancy plots as data products).

Business intelligence is not a kind of data science, nor is it a scientific field. Business intelligence is essential in many organizations, but if you are after hard-to-find insights or have challenging data streams in your company’s servers, then business intelligence is not what you are after. Nevertheless, business intelligence is not completely unrelated to data science either. Given some training and a lot of practice, a business intelligence analyst can evolve into a data scientist.

Statistics

Statistics is a field that is similar to data science and business intelligence, but it has its own domain. Namely, it involves doing basic manipulations on a set of data (usually tidy and easy to work with) and applying a set of tests and models to that data. It’s like a conventional vehicle that you drive on city roads. It does a decent job, but you wouldn’t want to take that vehicle to the country roads or off-road. For this kind of terrain you’ll need something more robust and better-equipped for messy data: data science. If you have data that comes straight from a database, it’s fairly clean, and all you want to do is create a simple regression model or check to see if February sales are significantly different from January sales, analyzing statistics will work. That’s why statisticians remain in business, even if most of the methods they use are not as effective as the techniques a data scientist employs.

Scientists make use of statistics, though it is not formally a scientific field. This is an important point. In fact, even mathematicians look down on the field of statistics, for the simple reason that it fails to create robust theories that can be generalized to other aspects of Mathematics. So, even though statistical techniques are employed in various areas, they are inherently inferior to most principles of Mathematics and of Science. Also, statistics is not a fool-proof framework when it comes to drawing inferences about the data. Despite the confidence metrics it provides, its results are only as good as the assumptions it makes about the distribution of each variable, and how well these assumptions hold. This is why many scientists also employ simulation methods to ensure that the conclusion their statistical models come up with are indeed viable and robust enough to be used in the real world.

Big Data, Machine Learning, and AI

Big Data

Big data can mean a wide variety of things, depending on who you ask. For a data architect, for example, big data may be what is usually used in a certain kind of database, while for the business person, it is a valuable resource that can have a positive effect on the bottom line. For the data scientist, big data is our prima materia, the stuff we need to work with through various methods to extract useful and actionable information from it. Or as the Merriam Webster dictionary defines it, “an accumulation of data that is too large and complex for processing by traditional database management tools.” Whatever the case, most people are bound to agree that it’s a big deal, since it promises to solve many business problems, not to mention even larger issues (e.g. climate change or the search for extra-terrestrial life). It is not clear how big data came to get so much traction so quickly, but one thing is for certain: those who knew about it and knew how to harness its potential in terms of information could make changes wherever they were. The people who were the first to systematically study big data and define it as a kind of resource were:

  • directly responsible for the development of data science as an independent field
  • adept at the data-related problems organizations faced (and are still facing to some extent)
  • knowledgeable about data in general

This may sound obvious, but remember that back in the early 2000s, it was only data architects, experienced software developers, and database administrators who were adept at the ins and outs of data. So it was rare for an analyst to know about this new beast called “big data.” Whatever the case, these data analytics professionals who got a grip on big data first came to pinpoint its main characteristics, which distinguish it from other, more traditional kinds of data, namely, the so-called 4 V’s of big data:

  • Volume – Big data spans from a very large number of terabytes (TB) and beyond. In fact, a good rule-of-thumb about this characteristic of big data is that if the data is so much that it can’t be handled by a single computer, then it’s probably big data. That’s why big data is usually stored in computer clusters and cloud systems, like Amazon’s S3 and Microsoft’s Azure, where the total amount of data you can store is virtually limitless (although there are limitations regarding the sizes of the individual uploads, as described in the corresponding webpages, e.g. https://aws.amazon.com/s3/faqs). Naturally, even if the technology to store this data is available, having data at this volume makes analyzing it a challenging task.
  • Velocity – Big data also travels fast, which is why we often refer to the data we work with as data streams. Naturally, data moving at high bandwidths makes for a completely different set of challenges, which is one of the reasons why big data isn’t easy to work with (e.g. fast-changing data makes training of certain models unfeasible, while the data becomes stale quickly, making the constant retraining of the models necessary). Although not all big data is this way, it is often the case that among the data streams that are available in an organization, there are a few that have this attribute.
  • Variety – Big data is rarely uniform, as it tends to be an aggregate of various data streams that stem from completely different sources. Some of the data is dynamic (e.g. stock prices over time), while other data is fairly static (e.g. the area of a country). Some of the data can come from a database, while the rest of it may be derived from the API of a social medium. Putting all that data together into a format that can be used in a data analytics model can be challenging.
  • Veracity – Big data is also plagued with the issue of veracity, meaning the reliability of a data stream. This is due to the inherent uncertainty in the measurements involved or the unreliability of the sources (e.g. when conducting a poll for a sensitive topic). Whatever the case, more is not necessarily better, and since the world’s data tends to have its issues, handling more of it only increases the chances of it being of questionable veracity, resulting in unreliable or inaccurate predictive models.

Some people talk about additional characteristics (also starting with the letter V, such as variability), to show that big data is an even more unique kind of data. Also, even though it is not considered to be a discernible characteristic of big data specifically, value is also important, just like in most other kinds of data. However, value in big data usually becomes apparent only after it is processed through data science.

All of this is not set in stone, since just like data science, big data is an evolving field. IBM has created a great infographic on all this, which can be a good place to dive into this topic further: https://ibm.co/18nYiuo. Also, if you find books and articles stating that there are only three V’s of big data, chances are they are outdated, bringing home the point that veracity goes beyond just big data, as it applies to data science books too!

Machine Learning

One of the best ways to work with big data is through a set of advanced analytics methods commonly referred to as Machine Learning (ML). Machine learning is not derived from statistics. In fact, many ML methods take a completely different approach to statistical methods, as they are more data-driven, while statistical methods are generally model-driven. Machine learning methods also tend to be far more scalable, requiring fewer assumptions to be made about the data at hand. This is extremely important when dealing with messy data, the kind of data that is often the norm in data science problems. Even though statistical methods could also work with many of these problems, the results they would yield may not be as crisp and reliable as necessary.

Machine learning is not entirely divorced from the field of statistics. Some ML methods are related to statistical ones, or may use statistical methods on the back-end, as in the case of many regression algorithms, in order to build something with a mathematical foundation that is proven to work effectively. Also, many data science practitioners use both machine learning and statistics, and sometimes combine the results to attain an even better accuracy in their predictions. Keep that in mind when tackling a challenging problem. You don’t necessarily have to choose one method or the other. You do need to know the difference between the two frameworks in order to decide how to use each one of them with discernment.

Machine learning is a vast field, and since it has gained popularity in the data analytics community, it has spawned a large variety of methods as well as heuristics. However, you don’t need to be an expert in the latest and greatest of ML in order to use this framework. Knowing enough background information can help you develop the intuition required to make good choices about the ML methods to use for a given problem and about the best way to combine the results of some of these methods.

AI – The Scientific Field, Not the Sci-fi Movie!

Machine learning has gained even more popularity due to its long-standing relationship with Artificial Intelligence (AI), an independent field of science that has to do with developing algorithms that emulate sentient beings in their information processing and decision making. A sub-field of computer science, AI is a discipline dedicated to making machines smart so they can be of greater use to us. This includes making them more adept at handling data and using it to make accurate predictions.

Even though a large part of AI research is focused on how to make robots interact with their environment in a sentient way (and without creating a worldwide coop in the process!), AI is also closely linked to data science. In fact, most data scientists rely on it so much that they have a hard time distinguishing it from other frameworks used in data science. When it comes to tackling data analytics problems using AI, we usually make use of artificial neural networks (ANNs), particularly large ones. Since the term large-scale artificial neural networks doesn’t sound appealing nor comprehensive, the term “deep learning” was created to describe exactly that. There are several other AI methods that also apply to data science, but this is by far the most popular one; it’s versatile and can tackle a variety of data science problems that go beyond predictive analytics (which has been traditionally the key application of ANNs).

The most popular alternative AI techniques that apply to data science are the ones related to fuzzy logic, which has been popular over the years and has found a number of applications in all kinds of machines with limited computational power (for an overview of this framework, check out MathWorks’ webpage on the topic at http://bit.ly/2sBVQ3M). However, even though such methods have been applied to data science problems, they are limited in how they handle data, and don’t scale as well as ANNs. That’s why these fuzzy logic techniques are rarely referred to as AI in a data science setting.

The key benefit of AI in data science is that it is more self-sufficient and relies more on the data than on the person conducting the analysis. The downside of AI is that it makes the whole process of data science superficial and mechanical, not allowing for in-depth analysis of the data. Also, even though AI methods are very good at adapting to the data at hand, they require a very large amount of data, making it impractical in many cases.

The Need for Data Scientists and the Products/Services Provided

Despite the variety of tools and automated processes for processing data available to the world today, there is still a great need for data scientists. There are a number of products and services that we as data scientists offer, even if most of them fall under the umbrella of predictive analytics or data products. Examples are dashboards relaying information about a KPI in real-time, recommendation systems providing useful suggestions for books/videos, and insights geared toward what the demand of product X is going to be or whether patient Y is infected with a disease or not. Also, what we do involves much more than playing around with various models, as is often the case in many Kaggle competitions or textbook problems. So, let’s take a closer look at what a data scientist does when working with the given data.

What Does a Data Scientist Actually Do?

A data scientist applies the scientific method on the provided data, to come up with scientifically robust conclusions about it, and to engineer software that makes use of their findings, adding value for whoever is on the receiving end of this whole process, be it a client, a visitor to a website, or the management team.

There are three major activities within the data science process:

  • Data engineering This involves a number of tasks closely associated with one another, aiming at getting the data ready for use in the stages that follow. It is not a simple process and difficult to automate. That’s why around 80% of our time as data scientists is spent in the stage of data engineering. Luckily, some data is easier to work with than other data, so it’s not always that challenging. Also, once you find a way to deploy your creativity in data engineering, it can be a rewarding experience. Regardless, it is a necessary stage of data science, as it is responsible for cleaning up the data, formatting it, and picking the most information-rich parts of it to use later on.
  • Data modeling – This is probably the most interesting part of the data scientist’s work. It involves creating a model or some other system (depending on the application) that takes the data from the previous stage and does something useful with it. This is usually a prediction of sorts, such as “based on the characteristics of data point X, variable Y is going to take the value of 5.2 for that point.” The data modeling phase also involves validating the prediction, as well as repeating the process until a satisfactory model is created. It is then applied to data that hasn’t been used in the development of this model.
  • Information distillation – This aspect of the data scientist’s work has to do with delivering the insights acquired from the previous stages, communicating them, usually through informative visuals, or in some cases, developing a data product (e.g. an API that takes values of variables related to a client and delivering how likely this person is to be a scammer). Whatever the case, the data scientist ties up any loose ends, writes the necessary reports, and gets ready for the next iteration of the process. This could be with the same data, sometimes enriched with additional data streams. The next iteration may focus on a somewhat different problem, or an improved version of the model.

Naturally, all of these aspects of the data scientist’s work are highly sophisticated in practice and are heavily dependent on the problem at hand. The general parts of the data science process, however, remain more or less the same and are useful guidelines to have in mind. We’ll go into more detail about all this in the next chapter, where we’ll examine the various steps of the data science pipeline and how they relate to each other.

What Does a Data Scientist Not Do?

Equally important to knowing what a data scientist does is knowing what a data scientist doesn’t do, since there is a great deal of misconception about the limits of what data science can offer to the world. One of the most obvious but often neglected things that a data scientist cannot do is turn low-veracity data into anything useful, no matter how much of it you give him or how sophisticated a model employed. A data scientist’s work may appear as magic to someone who doesn’t understand how data science works. However, a data scientist is limited by the data as well as the computing resources available to him. So, even a skilled data scientist won’t be able to do much with poor quality data or a miniature of a computer cluster.

Also, a data scientist does not create professional software independently, even if he is able to create an interactive tool that encapsulates the information he has created out of the data. If you expect him to create the next killer app, you may be disappointed. This is why a data scientist usually works closely with software engineers who can build an app that looks good and works well, while also making use of his models on the back-end. In fact, a data scientist tends to have effective collaborations with software developers since they have a common frame of reference (computer programming).

Moreover, a data scientist does not always create his own tools. He may be able to tweak existing data analytics systems and get the most out of them, but if you expect him to create the next state-of-the-art system, you are in for a big disappointment. However, if he is on a team of data scientists who work well together, he may be able to contribute to such a product substantially. After all, most inventions in data science in today’s world tend to be the result of cumulative efforts and take place in research centers.

The Ever-growing Need for Data Science Professionals

If so many of us are willing to undergo the time-consuming process of pushing the data science craft to its limits, this is because there is a need for data science and the professionals that make it practical. If today’s problems could be solved by business intelligence people or statisticians, they would have. After all, these kinds of professionals are much more affordable to hire, and it’s easier to train an information worker in these disciplines. However, if you want to gain something truly valuable from the data that is too elusive to be tackled by conventional data analytics methods, you need to hire a data scientist, preferably someone with the right kind of mindset, one that includes not just technical aptitude, but also creativity, the ability to communicate effectively, and other soft skills not so common among technical professionals.

The need for data science professionals is also due to the fact that most of the data today is highly unstructured and in many cases messy, making it inappropriate for conventional data analytics approaches. Also, the sheer volume of such data being generated has generated the need for more pronounced, scalable predictive analytics. As data science is the best if not the only way to go when it comes to this kind of data analysis, data scientists are an even more valuable resource.

Start-ups tend to be appealing to individuals with some entrepreneurial vocation. However, many of them require a large capital at the beginning, which is hard to find, even if you are business savvy. Nevertheless, data science start-ups don’t cost that much to build, as they rely mainly on a good idea and an intelligent implementation of that idea. As for resources, with tech giants like Amazon, Microsoft, and IBM offering their cloud infrastructure equipped with a variety of analytics software at affordable prices, it’s feasible to make things happen in this field. Naturally, such companies are bound to focus on spending a large part of their funding to product development, where data scientists play an integral part.

Finally, learning data science has never been as easy a task as it is today. With many books written on it (e.g. the ones from Technics Publications), many quality videos (e.g. the ones on Safari Books Online), and Massive Open Online Courses (MOOC’s) (such as edX and Coursera), it is merely a matter of investing time and effort. As for the software required, most of it is open-source, so there is no real obstacle to learning data science today. This whole phenomenon is not random, however, since the increase in available data, the fact that it is messy, and the value this data holds for an organization, indicate that data science can actually be something worthwhile, both for the individual and for the whole. This results in a growing demand for data scientists, which in turn motivates many people who dedicate a lot of time to make this possible. So, take advantage of this privilege, and make the most out of it to jump-start your data science career.

Summary

Data science differs from business intelligence and statistics in these areas:

  • peculiarity of the data involved (aka big data)
  • messiness of the data
  • the use of more advanced data analytics techniques
  • the potential of data products
  • the inter-disciplinary nature of the field

Big data has four key characteristics:

  • Volume it is in very large quantities, unable to be processed by a single computer
  • Velocity it is often being generated and transmitted at high speeds
  • Variety it is very diverse and comprised of a number of different data streams
  • Veracity –it is not always high quality, making it sometimes unreliable

Machine learning and AI are two distinct yet important technologies. Machine learning has to do with alternative approaches to data analysis, usually data-driven and employing heuristics and other methods. AI involves various algorithms that enable computers and machines in general to process information to make decisions in a sentient manner.

The three main stages of the data science process are:

  1. Data engineering – preparing the data so it can be used in the stages that follow
  2. Data modeling – creating and testing a model that does something useful with the data
  3. Information distillation – delivering insights from the model, creating visuals, and in some cases, deploying a data product

A data scientist is not a magician, so if the data at hand is of low quality or if there is not enough computing power, it would be next to impossible to produce anything practically useful out of this, no matter how much data is available.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset