Chapter 5
Technical Qualifications

Similar to many other jobs nowadays, a robust set of technical qualifications is essential before you can opt for a data science job. The mindset of the data scientist, which was described in the previous chapter, is like an operating system you need to have installed in your mind, but it needs to be augmented with particular software (i.e., your technical skills) to enable you to get the job done. These skills fall into three broad categories: general programming, scientific background and specialized know-how (software and techniques). Naturally, all these qualifications will vary greatly from one company to another, but having a core set of skills across all of these categories may help you qualify for most data science jobs.

In this chapter, we will look into the most commonly expected qualifications for a data scientist position today. We’ll look into the general programming skills required, the scientific background you will be expected to have and the specialized know-how you need to possess related to data analysis and data engineering.

5.1 General Programming

Unlike other branches of science, programming is a must have for any data scientist. Professionals in academia may be able to get by without knowing any coding, but in data science you need to know languages that are:

  • Robust
  • Popular in the industry
  • Scalable, especially when it comes to large data sets

The (general purpose) languages that appear most commonly in data scientist job openings are:

  • Java
  • Python
  • C++ / C#
  • Perl

SQL is also required, but this is a more specialized language. You can get by in data science without knowing Perl or Java, but you won’t manage without SQL, since at one point or another you will need to access a database and run queries on it. In addition, SQL is the foundation for other languages related to databases, so knowing it will enable you to work with somewhat similar languages such as Hive Query Language, NoSQL, AQL, BigSQL, etc.

Notice that the aforementioned general programming languages are all object-oriented (OO) languages; this is not a coincidence. There are other great languages (the most widespread of which is C) that may not work for you as a data scientist because the trend for the past few years is towards OO languages. (Fortunately, C has an OO counterpart, C++.) One of the main reasons for this is that an OO language enables you to create more sophisticated projects quite easily and then combine your code with others’ code very effectively. That’s a really big plus when working on a team tackling big data when agility is key.

Although there is a lot of interest in Python, it is by no means better as a language than any of the other ones mentioned. It does have a wide variety of libraries though, so it is relatively easy for someone with no programming experience to pick up. If you have done programming before, you may want to consider a more robust language such as Java. It is a good idea to master at least one language, but it doesn’t hurt to familiarize yourself with more than one since you never know when they will come in handy.

Note that knowing how to program well in one or more of these languages may not be enough. You will need to have some data processing experience with them, particularly with large data sets. After all, that’s what you’ll be using them with!

5.2 Scientific Background

This is a key aspect of the qualifications bundle of a data scientist, differentiating him from other IT professionals. A data scientist has at least a master’s degree in a technical field (usually computer science, statistics, mathematics, systems engineering, or something along these lines). Alternatively, a background in a non-technical field but with sufficient technical experience from previous jobs is also an acceptable option. Having a PhD, though, is a major advantage, regardless of your background, especially if your research has a quantitative component to it and if you are looking into a position with a higher salary. There are data scientists out there who have PhDs in very diverse disciplines such as psychology and physics.

A PhD can provide considerable experience in data analysis, especially if the research done for it is on real-world datasets. Acquiring a PhD is not considered to be formal work experience, but in reality, the experience and skills gained can be as useful as actual work experience. In fact, most of the professional attributes that real-world experience provides you with (time management, reliability, teamwork, etc.) are also skills you learn working on a PhD, especially if you are part of a research lab. So if you have a PhD that has provided you with applicable skills, you may want to refer to them in your resume as well as in interviews during the hiring process, depending on the people you’ll be working for. That’s a judgment call you’ll need to make since not everyone values PhDs the same way.

A solid theoretical understanding and practical know-how of various advanced analytical techniques is also required as part of a scientific background. If you lack knowledge in advanced analytics, be prepared to offer something that no one else can such as state-of-the-art knowledge of processing data effectively. The aforementioned techniques include (but are not limited to) data mining, machine learning and predictive modeling (aka predictive analytics).

All of the above techniques are great tools that you need to know intimately. However, what binds them all together is a strong mathematics and statistics background, which is also an essential qualification that employers are looking for. This doesn’t mean that you need to know all theorems and their proofs, but you do need to be familiar with most of them and, most importantly, know how to use them with the data you have available. Not everything will work with all types of data, of course. Overall, you need to know enough to be able to do the following in a way that’s second nature to you:

  • Discern which tool to use when.
  • Fine-tune the tool you decide to use, customizing it to the problem at hand.
  • Know what to do with the results your tool yields.
  • Think of alternative approaches to solving a problem and be able to rank them in terms of resource requirements.

A solid understanding of the theory behind the techniques you are applying is crucial. To gain this understanding, you need to have taken several classes on mathematics and statistics and not be intimidated by anything in those fields. If you don’t know something, you need to be able to learn it by leveraging what you have already learned. You can do this by taking a seminar, an online class or even just reading a couple of books.

The scientific background you are expected to have as a data scientist will also enable you to formulate testable hypotheses, apply a reproducible methodology to the data at hand, make good use of the data science process (see Chapter 11) and have a thorough understanding of the results. Moreover, you will be able to fine-tune your methods, know where something has gone wrong and come up with alternative approaches to a problem. It is very hard to overestimate the importance of having a scientific background.

5.3 Specialized Know-How

Being a data scientist requires some specialized know-how that distinguishes him from other professionals. It is important that you have mastery of at least one of these statistics tools:

  • R (the most advanced statistical analysis platform; open-source)
  • SPSS (another great statistical tool; proprietary)
  • SAS (a very popular statistical tool in the industry; proprietary)
  • Stata (another good statistical tool; proprietary)

Some employers might also include Matlab in the list, since Matlab enables you to do any data analysis conceivable with minimal code and comes with its own advanced integrated development environment (IDE) that makes debugging and development a walk in the park. The big drawback of Matlab is that its license is quite expensive, especially for commercial applications.

If you are not sure on which tool to focus, it is recommended that you go with R. Over the past few years, R has become more popular for several good reasons: R is open-source (and therefore completely free), it has a very large user-community, it is easy to install and customize, it is fairly easy to learn, there is ample documentation for it as well as several books for a variety of levels, and it comes with a wide variety of libraries (known as packages) that enable you to do many complex tasks easily without having to do much coding. Note that although R has all the characteristics of an OO language (and all of the data structures in its workspace are treated as objects), it is still considered by most people to be a statistics tool.

If you already know Matlab reasonably well, you may want to learn another tool just in case an employer is not familiar with it or is unwilling to purchase a license or two. Note that the transition from Matlab to R and vice versa is quite easy, especially if you are somewhat familiar with OO programming.

Experience with big data storage frameworks is also an essential qualification. As we saw in a previous chapter, big data requires a different set of paradigms, one of which is novel database schemas. So, large-scale data frameworks like Hadoop, Hive, large-scale partitioned relational databases, etc., are something you need to be familiar with as a data scientist.

Finally, some experience in working with large datasets (TB class) is also very useful. Although this experience may not be required, it is something you can gain very quickly and doesn’t entail any additional know-how. Other qualifications that may be required include:

  • Visualization – this is an important aspect of the data science process, which has to do with the creation of graphics (usually plots, heat maps, graphs, etc.) that aim to help the user get a good idea of what the data illustrates without having to look at tables or statistics. Visualization is oftentimes done through the data analysis tool you are using.
  • Relational databases – depending on your project, you may need to work with relational databases. It will be useful to become familiar with them, especially if you do that while learning SQL.
  • Consumer modeling – this is a type of modeling that has to do with creating and using consumer profiles in order to understand the company’s target group better and facilitate all the marketing endeavors that employ this information. It is particularly useful if you are working for a company in the retail industry.
  • Big data integrated processing system (e.g., IBM’s BigInsights, Knime, Alpine, and Pivotal, just to name a few) – although it is not likely that this will be a requirement for a data scientist job posting, being familiar with a system like this provides you with a better understanding of the bigger picture of big data processing and allows you to focus on the most creative aspects of your job since it does all the low-level work for you and helps you deal with the problem using a high-level approach.

As the data science field matures, it is likely that additional specialized know-how will be required in order to be a data scientist. However, the qualifications identified in this chapter are bound to remain essential, particularly the data analysis tools. It is recommended that you keep up to date with the newest developments in the field so that you know how to adjust your training strategy and avoid wasting your resources on things that you may not need.

5.4 Key Points

  • As a data scientist, you need a specific set of technical skills that are the tools you will use in your everyday job.
  • You need to be familiar with one or more object-oriented programming languages such as Java or Perl. Having mastery of at least one of them is imperative.
  • You need to have a solid scientific background (even if your education is non-technical), making you adept in the following:
    • The scientific process
    • The theory behind various data analysis techniques
    • Using the above techniques in practice
    • Formulating and testing various hypotheses
    • Understanding the results of a data analysis method
  • Having a PhD in a technical discipline can be quite useful when it comes to data science, as it can compensate for lack of work experience, but it is not a prerequisite.
  • You need to have some specialized knowledge that is particular to the job of a data scientist, including:
    • Sufficient knowledge of one or more data analysis tools (e.g., R, SPSS, SAS, Stata, or Matlab) and mastery of at least one of them.
    • Experience with big data storage frameworks (e.g., Hadoop, Hive, etc.).
    • Other know-how that may or may not be a prerequisite for getting a data science job, such as visualization, relational databases, consumer modeling, a big data integrated processing system, and, of course, experience working with datasets in the big data domain.
  • The data science field evolves rapidly, so you need to keep up with the changes, particularly in the tools used so that you can adjust your training strategy accordingly.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset