Data science is a response to the difficulties of working with big data and other data analysis challenges we collectively face today. We examined this briefly in the introduction, but that was just scratching the surface. In fact, there is so much literature on big data that this whole chapter will still not be able to do it justice. It will, however, give you a good idea of its importance in today’s world. Furthermore, it will help you understand what all the hype is about big data (a hype that has increased significantly over the past year), and why data science is so important.
Big data is a fundamental asset for today’s businesses, and it is not a coincidence that the majority of businesses today are using, or are in the process of adopting, the corresponding technology. Despite all the hype about it in various media, this is not a fad. There are specific advantages to using this asset, and the fact that it is growing more abundant is an indication that it is imperative to do something about it, and do it fast! Perhaps it is not useful for certain industries right now as big data tends to be quite chaotic or even non-existent for them. Those who do have it and make intelligent use of it, though, reap its benefits and stand a good chance of being more successful in today’s competitive economic ecosystems.
1.1 Digging into Big Data
Big data is abundant and contains information that is relevant to the business problems at hand. If you are a manager of an e-commerce company, for example, the data you collect on your servers regarding your customers and the visitors to your site are rich with information that, when analyzed properly, can be used to increase your sales, enhance your site’s design, and improve your customer service. It can also provide you with ideas on marketing strategies and ways to improve your company’s overall strategy; all that from a bunch of ones and zeroes that dwell on your servers. You just need to extract the information from them, allocating a small part of your resources. Not a bad trade-off, for sure. We’ll come back to this example later on.
Not every amalgamation of data qualifies for the term big data, although most Web-related data falls under this umbrella. This is because big data is characterized by the four Vs2.
Fig. 1.1 The four Vs of big data.
As we have already seen, these are:
Note that a piece of data may have one or more of these characteristics and still not be classified as big data. Big data has all four of these. Big data is a serious issue as it is not easy, even for a supercomputer, to manage it effectively, let alone perform a useful analysis of it.
In the example we started with, a typical set of data that you would encounter would have the following qualities:
Based on all of the above observations, do you think that you are dealing with big data in this company or not? Why? If you have understood the above concepts, you should be confident in replying positively to this question. Each one of the bullet points describing the data situation in that company has to do with one of the Vs of big data.
1.2 Big Data Industries
Naturally, not all industries are equally affected by the big data movement. Depending on how much they rely on data and how profitable information is to them, they may be looking at a goldmine or one more asset that can wait. Based on recent statistics, the following industries appear to have benefited, or are inclined to benefit the most from big data:
Note that the benefit is not always directly related to the bottom line, but it is definitely of significant business value. For example, by employing big data technologies in healthcare, physicians can use previous data to gain a better understanding of the patients’ issues, yielding a better diagnosis and enabling them to take better care of their patients in general. This can eventually result in greater efficiencies in the medical system, translating into lower costs through the intelligent use of medical information derived from that data.
Another example comes from customer care, where big data can help leverage bad customer experiences. By effective use of big data technologies, companies can gain a better understanding of what their customers like and don’t like in near real-time. This can help them amend their strategies in dealing with these customers and give them insight into how to improve their services in the future.
Note that there are many other industries that have the potential for gaining from big data, but based on their current status, it is not a worthwhile option for them. For example, the art industry is still not big on big data, since the data involved in this field is limited to descriptions of artwork and, in some cases, digitized forms of these works of art. However, it is possible that this may change in the future depending on how the artists act. For example, if a certain gallery makes use of sensors monitoring the number of people who view a certain painting, and in combination with other data (e.g., number of people who bought tickets to the various exhibitions that hosted that painting), they could gradually build a large database that would contain data about the sensor readings, the ticket sales, and even the comments some people leave on the gallery’s blog about the various paintings. All this can potentially yield useful information about which pieces of art are more popular (and by how much), as well as what the optimum ticket prices should be for the gallery’s exhibitions throughout the year.
All this is great, but how is it of any real use to you? Well, higher profit margins and the potential to significantly boost productivity are not going to happen on their own. It is naïve to think that just installing a big data package and assigning it to an employee (even if they are a skilled employee) could result in measurable gains. In order to take advantage of big data, a company needs to hire qualified people who can undertake the task of turning this seemingly chaotic bundle of data into useful (actionable) information. This is the problem that all data scientists are asked to solve and one of the driving forces of all developments in the field that came to be known as data science.
1.3 Birth of Data Science
The field of data science resulted from the attempt to discover potential insights residing in big data and overcoming the challenges that were reflected in the four Vs described previously. This was possible through the combination of various technological advances of modern computing. Specifically, parallel computing, sophisticated data analysis processes (mainly through machine learning), and powerful computing at lower prices made this feasible. What’s more, the continuously accelerating progress of the IT infrastructure and technology will enable us to generate, collect, and process significantly more data in the not-so-distant future. Through all this, data science addresses the issues of big data on a technical level through the application of the intelligence and creativity that is employed in the development and use of these technologies. That is, big data is somewhat manageable and at least able to provide some useful information to make the whole process worthwhile.
It’s important to note that data science is not a fad, but something that is here to stay and bound to evolve rapidly. If you were an IT professional when the World Wide Web came about, you might have seen it as a luxury or a fad that wouldn’t catch on, but those who managed to see its real value and the potential it held made very lucrative careers out of it. Imagine being one of the first people to learn HTML, CSS and JavaScript, or one of the first to create digital graphics to be used for websites. It would be like holding a winning lottery ticket, especially if you were good at your job. This is the situation with data science today. It would probably not be so well-known if it weren’t for so many people writing about its benefits. Still, most professionals and many students are not aware of what data science really means.
If you assimilate the aforementioned facts about big data, you will understand that data science is the solution to a real problem that is only going to become more pronounced in the years to come. This problem, as mentioned earlier, is reflected in the four Vs of big data, the characteristics that make it difficult to deal with using conventional technologies. As technology is on its side, data science is bound to become more robust and more diverse in the coming decade or so. There are already some post-graduate programs making an appearance in the academic world3, and there are plenty of respectable researchers writing papers on data science topics. This is not a coincidence. It shows a trend for the development of an infrastructure of knowledge and know-how that will nourish this field.
It is not very clear exactly when data science was born (there have been people working on this field as researchers for several decades), but the first conference where it received the spotlight was in 1996 (“Data Science, Classification, and Related Methods” by IFCS). It wasn’t until September 2005, however, when the term “data scientist” first appeared in the literature. Specifically, in a report released that year4, data scientists were defined as “the information and computer scientists, database and software engineers and programmers, disciplinary experts, curators and expert annotators, librarians, archivists, and others, who are crucial to the successful management of a digital data collection.” In June, 2009, the importance of the role of the data scientist became more apparent, as Nathan Yau’s article “Rise of the Data Scientist” in FlowingData was written5. Since then, references to and literature on data science have increased rapidly. Just take a look at how many conferences are being organized for it nowadays, appealing to both academics and people in the industry! What’s more, as several large companies that are leaders in their sectors (e.g., Amazon) make use of data science in their everyday workflow, it is quite likely that this trend will continue. Also, as the role of the data scientist adapts to the ever-changing requirements of the data world, it has come to include several things such as the application of state-of-the-art data analysis techniques, not just the original responsibilities.
1.4 Key Points