Chapter 16
Experienced Data Scientists Case Studies

We’ll begin the case studies with the story of two experienced data scientists who work in the retail sector and the law enforcement industries. In both cases, we’ll get to know them better with some basic professional and background information, then proceed with their views on data science in practice, how they see data science in the future and finally what advice they have for you, the aspiring data scientist. At the end of the chapter, we’ll have some take-away points, as usual, to help you remember the key lessons of these interviews.

16.1 Dr. Raj Bondugula

16.1.1 Basic Professional Information and Background

Dr. Bondugula has worked for Home Depot for the past few months although he’s been in the data science field for many years. He comes from a machine learning background and spent several years in academia, so he is more of a researcher type of data scientist. He is formally trained in computer vision and natural language processing (NLP), fields that are very relevant to data science as they both involve a great deal of challenging data. Although most of his work is in these two fields, his expertise goes beyond them; he has also spent several years practicing data science in the industry. At one point in his career, he worked for the Department of Defense on computer clusters.

Dr. Bondugula has been involved in associations, namely IEEE and the Computational Intelligence Society, specializing in fuzzy logic. He was also active in the research arena, contributing a number of papers in bioinformatics during his academic phase. Currently, he is involved in conferences focusing on data science technologies. He is also open to the idea of joining a data science group when he has more time.

16.1.2 Views on Data Science in Practice

This data scientist has a very mature and clear perception of what data science is, something that is uncommon among other data scientist, particularly those new to the field. Dr. Bondugula views dealing with data as an extension of machine learning practices, where the scope is scaled up while everything else pretty much stays the same. For him, Hadoop is an easy way to meet the challenge of parallelization for those unfamiliar with parallel computer programming. It not only saves a lot of time, but also a great deal of money (millions of dollars).

Dr. Bondugula runs a data science team for Home Depot. Along with his team, he handles data science projects from the conceptual level all the way to implementation and validation. Afterwards, they partner with IT to create their data products. As the whole group is relatively new to Home Depot (a little over a year, at the time of this writing), he is still the only official data scientist there although the team now includes some members who are adept in big data technology and a few machine learning practitioners. The people he works most closely with are a small subset of this group.

For Dr. Bondugula, the most important thing in this line of work is in-depth technical knowledge of the tools used and the ability to adapt to the problem at hand including modifying the tools if necessary. This “fundamental understanding of the techniques,” as he calls it, enabled him to use the same methods in a variety of domains, adapting them to fundamentally different problems effectively and efficiently.

His everyday works involves a variety of things. Sometimes he and his team are asked to improve internal manual processes employing data science (using NLP, for example). Other times they come up with novel ideas to improve customer satisfaction, e.g., through a recommender system they have developed for the company’s website. They also undertake Web intelligence tasks at times in order to ensure quality in the function of the website (e.g., pinpoint broken links).

Although Dr. Bondugula is not a senior data scientist yet (i.e., he doesn’t have other data scientists reporting to him), he is well on his way to becoming one. For him, there is a very fuzzy line between the two classes of data scientists; the division is not as clear-cut as it appears to be in job applications. He also finds that the titles of “data scientist” and “machine learning specialist” are pretty much the same thing since the former has been around only for a few years in the market.

Dr. Bondugula thinks that the retail industry lends itself to data science because of the amount of data available and the data-related problems faced by the industry. That’s what makes it interesting, too. He finds that most of the time he needs to apply and adapt existing methods rather than invent his own (as is often the case in the R&D departments of data-driven companies). An example of a data product he and his team are developing is a non-personalized, content-based recommender system for the company’s website. If a customer is looking to buy a bath faucet from Home Depot through its website, they may want to buy other bath products or other related items as well. His data science system will find those products and display them on the Web page for the customer to view, emulating the experience they would have if they were physically in the store.

He finds that although data science was not previously been essential in this industry, nowadays it is a very useful tool that is of great importance to it. The reason is that it satisfies a need that was always there.

There is no doubt that Dr. Bondugula loves his job. He makes that clear when he says that his day job doesn’t begin when he gets in the office, but rather as soon as he wakes up, when he starts thinking about the data science problems he is currently tackling. It is evident that not only he is satisfied with this line of work, but he is also very enthusiastic about it, stating that he is “having fun” every day in his work.

16.1.3 Data Science in the Future

Dr. Bondugula envisions that in the future, Extract, Transform, and Load (ETL) tools will become redundant and be replaced by Hadoop, which is, in his view, the most promising piece of data science technology today along with its “family”: Mahout, HDFS, etc. It is a technology that makes a very challenging task (computer parallelization) relatively easy, albeit not simple. Regarding Hadoop evolution, he foresees that it will employ more data analysis paradigms beyond the MapReduce one that is widely used in data science today.

For Dr. Bondugula, the most challenging part of data science, which will probably be the focus of the field in the future, is forming the right questions to yield useful and meaningful answers from big data. Open questions like “what’s interesting about this data?” may not be popular in the future because they may not yield very insightful answers (even if they are scientifically valid). He believes that the source of a hypothesis (scientific inquiry) should come from business knowledge. This is where creativity, an inherently human attribute that is less likely to be undertaken by computers, comes into play, at least in this field.

On a personal level, Dr. Bondugula is confident that regardless of the domain he works with in the future, he’ll do just fine even if he needs to learn all the relevant domain knowledge from scratch. As one would expect, he plans to continue in this line of work for many years to come.

16.1.4 Advice to New Data Scientists

Dr. Bondugula advises new data scientists to “become an expert in one field, be it statistics, machine learning or, say, Java programming, and then try to get into the other ones.” You also need to be prepared to accept help from other people as you won’t be able to solve every single problem on your own. Moreover, networking and communication skills also matter a lot, so you need to develop a varied skill-set, which includes “soft” skills, too.

16.2 Praneeth Vepakomma

16.2.1 Basic Professional Information and Background

Mr. Vepakomma works at PublicEngines Inc., a company that develops software for law enforcement that includes analytics and advanced predictive products. A very unique field in the business world, his current line of work makes use of data science in order to create advanced spatio-temporal predictive models and algorithms that predict crime to facilitate law enforcement agencies to accurately and efficiently use their resources. Mr. Vepakomma has worked as a researcher for about five years, three of which he has spent in the industry as a data scientist.

A member of the American Statistical Association (ASA) and American Mathematical Society (AMS) (similar to IEEE but for mathematicians) and former member of the Data Science Atlanta meetup group, where he also once gave a talk, Mr. Vepakomma is actively involved in research and regularly participates in conferences such as ECML and PKDD. The academic research he does is usually pertaining to an intersection of advanced mathematics, machine learning and statistics both in the theoretical and applied realms.

A very amicable person, Mr. Vepakomma is the personification of many data scientist qualities. He has great communication skills, curiosity and interest in many things and willingness to constantly learn apart from indeed his existent, technical strengths. He strongly believes in the importance of having a technical breadth of knowledge across many sub-domains apart from a depth of expertise in a few.

16.2.2 Views on Data Science in Practice

Mr. Vepakomma believes that it is very important to interact with all project participants throughout the development of a product- not just in the aspects of algorithm development and core problem solving, but also through the aspects of business strategy. He advocates that the most important things to have on your resume as a data scientist are a strong quantitative background, very good communication skills and experience in having lead a data-science project. (The latter is not essential for junior data scientists, but always valued.) A track record of problem-solving that has lead to end products of high value proposition or disruptive impact; within the market are pointers that would boost your resume.

Mr. Vepakomma is part of a team of ten people, mostly consisting of engineers. Apart from daily interactions with them, he also has a direct line of communication with executives about strategy and execution related matters in regular meetings. His everyday work includes problem-solving, R&D, coming up with evaluation metrics, developing the algorithmic backend and creating optimization hacks for scaling the devised solution to save computational time and resources. In addition, he also does some academic research on the side and strongly believes in the power of collaborative research.

Based on his experience, Mr. Vepakomma thinks that it is important to anticipate and investigate all the things that could go wrong when implementing a model to ensure better fault tolerance in the end result since many minor aspects are bound to go wrong sooner or later, if that attitude of looking towards faults is not inculcated. Regardless, of the core mathematical model being state-of-the art, minor faults in the product could arise in the pipeline that makes use of this mathematical secret-sauce. That said, when it comes to developing a quality data product, he believes that sticking to the scientific method is the most important strategic guideline of all that should guide the attitude towards the product development and execution.

The data products that he has been involved in at PublicEngines, that are currently in the market are “Command Central Predictive” and “Command Central.” The former is a crime prediction software that provides an accurate prediction of criminal activity in a small area (much smaller than a typical heat map), thereby yielding actionable information that proves invaluable to the police and law enforcement agencies. This level of focused tactical information generated by this multiple patent-pending predictive product helps reinforce ‘directed, actionable patrol plans’ and increase ‘resource-efficiency’ so that law enforcement agencies can use it to positively impact their communities through accurate and efficient policing. The “Command Central” software is a platform that provides spatial and temporal crime analytics, while Command Central Predictive is focused on spatio-temporal predictive models and algorithms.

Working in a breadth of industries or domains would be quite easy for Mr. Vepakomma because, as he explains, the domain doesn’t matter that much for a data scientist. What really matters is the presence of a right environment to produce hi-tech disruptive technologies with high value propositions. It all starts with the presence of a right environment and the right skillsets, he emphasizes. He quotes John Tukey, saying that Tukey liked Statistics and Applied mathematics because he got to work in everyone’s backyard. This mindset has continued on into the realm of Data Science as well, he says. However, he prefers working in a company that has a worthwhile strategy when dealing with the product development, and he doesn’t favor work environments whose monetization strategy is solely focused on aggressive marketing and is not naturally backed or driven by high-quality and impactful technological products. According to Mr. Vepakomma, there is a wonderful trade-off within the four pillars formed by the size of the market, the quality/relevance of the product, the marketing/delivery channels and the competition. A weakness amongst any of these four criteria has to be aggressively compensated by the rest in order to sustain and develop financial traction while (most importantly) continuing to correct for the weaknesses.

He believes that the roster of domains and industries where a data scientist can be an ‘A-player’ and contribute with high, lasting impact is pretty long. He naturally, is very satisfied with his job and is very excited about working in the domain of law enforcement and predictive policing. He finds that having a noticeable impact on society through the data products he and his team develop is a great motivator.

16.2.3 Data Science in the Future

According to Mr. Vepakomma, the future of data science is extremely bright because the amount of available data is growing exponentially thereby opening up many new opportunities for monetizing products and services through the development of mathematical/statistical models and algorithms that make use of this data by bringing intelligent use-cases out of it. He finds that people with varied yet formal quantitative backgrounds can come together to make this possible under the umbrella of a data science centric organization.

16.2.4 Advice to New Data Scientists

Mr. Vepakomma believes that it is very important to learn new skills and constantly develop your existing ones while working on a breadth and depth of technical sub-domains. He also believes that the environment in which you work is very important regardless of the domain of the company you work for. A good environment will greatly help you, particularly in the early stages of your career, while a bad one is bound to hold you back. If you are a new data scientist, it helps to nurture yourself for one to two years by working with a competent and experienced data scientist (as part of his team) before going solo.

16.3 Key Points

  • Hadoop is a very important tool and also plays an equally important role in the evolution of data science.
  • The most important thing in this line of work is in-depth technical knowledge (especially of the relevant tools), including the ability to adapt tools to the problem at hand.
  • The everyday life of a data scientist involves both optimizing existing processes as well as creating novel ones to improve the customer experience.
  • The titles of “data scientist” and “machine learning specialist” are pretty much synonymous in practice.
  • The retail industry lends itself to data science due to the amount of data available and the data-related problems the industry is facing. However, most of the data science work in this industry has to do with applying and adapting existing methods rather than developing innovative ones.
  • The most challenging part of data science, which is probably going to be the focus of the field in the future, is forming the right questions in order to find useful and meaningful answers from big data.
  • If you want to be a good data scientist, it is very important to master one particular field before moving on to the next one.
  • The work environment, as well as the industry you are in, plays an important role in your development as a data scientist, particularly in the early stages of your career.
  • The future of data science is bright, and the field is bound to grow to be more varied in the years to come.
