Chapter 12
Specific Skills Required

Migrating to a data scientist role takes more than having the right mindset and knowing what the different data science tools are. It takes a certain kind of skill-set which needs to be both present in your mind and well-presented in your resume. In this chapter we will look into the specifics of that skill-set and how you can develop it coming from four major backgrounds: programming, statistics or machine learning, data modeling, and studentship. Furthermore, you will get acquainted with the field from a practical perspective, seeing how the data scientist’s skills relate to what you are already doing and making the whole transition not only an educational but also an enjoyable process.

12.1 The Data Scientist’s Skill-Set in the Job Market

As a data scientist candidate you are expected to possess a variety of technical skills, briefly described in Chapter 5. These are the so-called hard skills on the recruiter’s check list and therefore play a major role in the job-hunting process. However, it would unwise to try to develop them in the order they are listed as there are certain subtleties that need to be taken into account. Since you are most likely already versed in some of these skills, depending on your current professional status, it would be best to gradually expand your current skill-set so that it ends up including all of the following skills:

  • Data analysis skills
    • Cleaning data
    • Creating models
    • Applying statistics on data
    • Applying and developing machine learning algorithms
    • Validating models
    • Performing data visualization
  • Programming skills
    • One or more of the specialized data analysis platforms (R / Matlab / SPSS / SAS)
    • One or more OOP languages (Python, C++, Java, C#, Perl, etc.)
    • Other programming skills relevant to the industry (e.g., familiarity with HTML/CSS in the case of the Web industry)
  • Data management skills (particularly for big data)
    • Hadoop (particularly Hive/HBase, HDFS and MapReduce)
    • SQL
    • NoSQL
    • Other data management skills relevant to the company
  • Business skills
    • Familiarity with the Waterfall or the Agile frameworks
    • Understanding of how a company operates
    • Knowledge of the industry sector
    • Other business skills relevant to the company and industry
  • Communication skills (technically a soft skill)
    • Delivering engaging presentations / storytelling
    • Report-writing
    • Listening skills
    • Being able to translate customer requirements into specific action items
    • Other communication skills relevant to the company

It is tempting to think that one group of skills is more important than everything else and choose to focus on just that. However, as a data scientist you’ll need all of them even if you don’t have them in balance. In fact, it is quite unlikely that you’ll have them all in balance from the very beginning unless you start from a clean slate and gradually develop them through a university degree or a meticulously planned set of courses and books.

12.2 Expanding Your Current Skill-Set as a Programmer / SW Developer

Being a programmer or a software developer is actually relatively close to being a data scientist. However, depending on how experienced you are with data analysis and big data technologies, you may need to pick up a few skills to ensure that you are marketable as a data science professional. The exact skills and knowledge you’ll need depend on your particular profession as well as on your experience.

In this section, we will look into the skills you need to develop, the knowledge you need to acquire and how you can do all that, depending on where you come from. In particular, we’ll examine it from the perspective of you being an OO programmer, a software prototype developer or something else in that industry (e.g., a programming architect or a project manager). Whatever the case, we’ll make sure that you understand what it takes to migrate from your current professional state into a promising and fulfilling career in the data science field.

12.2.1 OO Programmer

If you are already in the object-oriented programming game, you are familiar with data structures and how to implement an algorithm efficiently in one or more OOP languages. You may even be adept at conserving resources and optimizing your code to meet a particular objective. So you have a decent head start towards becoming a data scientist since, as we have already seen, these are some of the essential skills you need to be a player in the data science game.

Unless you already have experience with Matlab or R, vectorization is something you need to learn, especially if you plan to work with one of these data analysis tools. Vectorization involves processing one operation on multiple pairs of operands at the same time, writing code that is loop-free, instead of processing one pair of operands at a time and looping around to the next pair. The fewer loops in your code, the faster it will run on a Matlab or R platform as well as on any other data analysis tools that employ vectorization. This is because vectorized functions are built-in programs that are optimized and implemented in C or some other low-level language, enabling them to run super-fast. This is a great point to remember, especially when you are dealing with large datasets. A vectorized approach may be many times faster than one using loops even if your lines of code are kept to a minimum. If you learn R, you will naturally learn vectorization because most tutorials don’t cover loops; if they do, they do so briefly at a later stage of the tutorial. Also, R has a large variety of built-in functions that save you the trouble of having to create loops doing the same thing on your own. So it lends itself to cleaner, faster vectorized scripts.

But the purpose of this chapter is not to broadcast the merits of the R language; R can do that for itself. The point is that an OO programmer will be able to quickly assimilate the data analysis software used in data science, whether this is R, Matlab or any other software. The mental discipline that is required for effective OO programming work can be applied to any other software required. Even big data technologies, such as Hadoop, are not going to be a challenge for you if you have this quality. You will need to learn all these technologies, though, and it may be somewhat time consuming. For this, you can use the resources in Appendix 1 as well as all the other sources mentioned in the first part of Chapter 9, Learning New Things and Tackling Problems. How long it will take will depend on how dedicated you are and how much time you can devote to it.

You will want to pay close attention to the data visualization software as this is probably something that you are the least familiar with in your current work. It shouldn’t pose much of a challenge as all the programming required in such a piece of software is minimal, if not non-existent. Just familiarize yourself with one or more data visualization packages, and you will be good to go.

You will need to study the data analysis literature and mine it for know-how that you will need as a data scientist. You’ll particularly need to study statistics, if you haven’t taken a course on this subject already, and most importantly machine learning. You may not have time to go very deep on either one of them, but at least make sure you know enough to ace a statistics or machine learning course.

Finally, you need to learn more about how the end-user thinks, what he requires, how to interpret these requirements and how to communicate effectively in a non-technical language. Basically, hone the soft skills that can make you a software developer or a systems engineer (though the latter requires more than just this stuff). This is very important in cultivating the data scientist mindset and performing this role, as we’ve seen in Chapter 4.

Naturally, once you’ve learned all these things, you need to practice. You can start with the Kaggle challenges or the datasets available in the UCI machine learning repository. Just be sure that you acquire some hands-on experience before putting yourself out there as a data scientist for hire.

12.2.2 Software Developer

As a software developer, you are bound to be familiar with GUIs and the importance of the (usually non-technical) user of your work. This familiarity is invaluable. Being able to think as the user thinks allows you to appreciate their point of view and understand their concerns. Therefore, for a role in data science, you will need to focus your attention on your other technical skills.

As a developer you must be already familiar with two or more programming languages, most likely the .NET framework and C# or possibly C++ and Java. That’s a great starting point. Just like your OO programming colleague, you have all the programming background to be a data scientist, so you should expand this by incorporating knowledge of big data technology and data analysis tools.

Your programming background and familiarity with the end-user will allow you to focus your efforts on the gaps in your knowledge. Similar to the OO programmer, you will need to develop your knowledge of visualization software and statistics.

You will also need to go deeper on the machine learning know-how as this is something many data scientists (all those not belonging to the researcher category) often lack. If you don’t know about clustering and pattern recognition, you need to gain an understanding of them as well as deep learning and other state-of-the-art machine learning techniques. Joining a relevant group is one strategy for achieving that objective.

As in the case of the programmer, you will need some hands-on experience with these new skills before being marketable as a data scientist. The methods described in the previous subchapter for acquiring this experience are applicable here, too. For more details about how to get the initial experience, you can revise Section 6.3.

12.2.3 Other Programming-Related Career Tracks

Of course, you may be in the IT sector and not be an OO programmer or a software developer. For example, you may be a web designer, a web programmer, a QA analyst or a database administrator (this particular case we’ll cover in a separate chapter as it has a special relationship with the data science world). What can you do, then, in order to become a data scientist?

In these cases you should expand your technical skills, focusing on the data science skills you need to cultivate the most. If you are a web programmer, you may want to work on your data analysis know-how, while if you are a QA analyst, you may want to refresh your programming skills first.

Whatever the case, you’ll need to brush up on your statistics, read up on the machine learning literature, get up to date on the latest developments in these fields and familiarize yourself with all the relevant software (which we have already examined thoroughly in Chapter 9), for starters.

Similar to software developers, you may have experience dealing with users directly (e.g., if you are a web developer), which can aid you in your communication skills and requirements interpretation, giving you a chance to focus on the development of the hard skills.

Again, it’s essential that you become acquainted with all the big data technologies and get plenty of practice before marketing yourself as a data scientist.

12.3 Expanding Your Current Skill-Set as a Statistician or Machine Learning Practitioner

The transition from a statistician or a machine learning/A.I. practitioner to a data scientist is fairly smooth. That’s because someone in this field already has a working knowledge of the core of data science, i.e., data analysis and some object oriented programming. Even if you are not familiar with all the specialized theory and know-how, or if your programming skills are very basic, you can easily pick up what you are missing and attain all the formal qualifications needed. What’s more, all the experience you have in your current field may be considered relevant data science experience (especially if you combine both statistics and machine learning). Unlike the other professions described in the previous chapter, a statistician’s or machine learning practitioner’s way of thinking is quite close to that of the data scientist and can easily evolve into that of a data science role.

Let us now examine what you will need to learn in order to make this transition. We’ll examine coming from a statistics background (with packages like R being your bread and butter), a machine learning / A.I. background (with intelligent information systems being your playmates) or a mixed background, which is often the case for many new professionals. Let us get started now and see how you can organically transform your skills into those of a data scientist.

12.3.1 Statistics Background

Coming from a statistics background, you have a clear advantage over the classic programmer and the data-related professional: you already know the theory behind the majority of the data analysis needed, and you have some hands-on experience with it. Statistics may be a fairly straightforward subject to learn, but it takes a lot of work to master its use for real-world problems. Having succeeded at this daunting task, at least to some extent, means that you are good at learning challenging material, and so everything else in data science should be feasible for you. Before you know it, you can expand your skill-set so that it more closely resembles that of the data scientist as described in Chapter 5.

First, you will need to expand your theoretical knowledge so that it includes the details of the modern datasets known as the big data domain. Then you’ll have to get acquainted with the relevant paradigms that have been developed for it and invest some time in learning at least a couple of programming languages. If you are already familiar with R, SPSS, SAS or some other data analysis package, this shouldn’t be too difficult. Finally, you’ll need to learn some supplementary material to complete your skill set, making your resume similar to a data scientist’s. Let’s look into each one of these in detail.

12.3.1.1 Theoretical Material to Learn

Statistics may be a great tool, but the data it deals with, at least in most cases, is somewhat limited in size. That’s not to say that with a statistician never deals with large datasets, but the data that defines the data world today is whole new ballgame. As mentioned in the first part of this book, big data is a big deal and a big challenge. So unless you have studied some of the new literature of data analysis, you are in for a big surprise. Big data requires a whole new approach, usually the MapReduce paradigm and several tools for dealing with unstructured as well as semi-structured data streams as we saw in Chapter 8. First things first though. Try to find as many sources as you can on MapReduce as well as on its ecosystems, the group of toolboxes that have been developed to tackle big data using this paradigm.

Once you understand the concepts of the MapReduce architecture, you can learn about the ecosystems that employ it: mainly Hadoop and Spark. Familiarize yourself with their toolboxes and dig into their technical aspects (see subchapter 8.1 for details). If you don’t want to get your hands dirty, you may want to look into an integrated platform such as IBM InfoSphere’s BigInsights (commercial license). It all depends on how technical you want to get.

At a minimum, you will need to know all about big data, MapReduce and some software that employs these technologies. If you are not sure about how technical you want to get, it is recommended that you start with a high-level MapReduce platform. Once you are confident with it, proceed to learn lower-level details. All this material may be a bit daunting for someone who is not a computer scientist, so it is recommended that you take a university course on the subject (or a good MOOC) and get some hands-on experience in the field.

In parallel, you will need to get acquainted with some computer science theory. Pay close attention to algorithm complexity, algorithm design and data structures in particular. You don’t need to be an expert in information theory, though, as this is not too relevant to data science. The key thing to remember is that your resources are limited, and even if you are using a large cluster of computers, you’ll need to be able to run efficient programs on it. So don’t jump straight to coding; learn a few things about program design first.

All this may seem a bit overwhelming, particularly if you haven’t ever taken any computer science courses. However, there are several courses you can take on these subjects; consult Appendix 1 for details. In addition, you can read one or more of the books listed in Appendix 3 or at least buy them to use as reference material.

12.3.1.2 Languages to Learn

The ability to write programs is probably the most important skill you will need in your transformation to a data scientist. Unlike other scientists, the data scientist must be a mean programmer who can efficiently implement his ideas into working programs (even if he is not as adept as a professional programmer). To do this, you will need to learn at least one programming language, preferably a powerful one, unless you are not familiar with R, in which case you will need to learn at least two of them.

Which specific languages you learn are not all that important as long as at least one of them is an object-oriented one. A very popular choice for this type of language is Python due to its simplicity and the fact that there is a plethora of packages available for use in your programs (saving you a lot of time in development and debugging). However, if you are up for it there are better options, the most popular of which are Java, C# and C++ (not to be confused with its predecessor, C, which is not an OO language). All of these languages (especially C++) are quite fast and therefore any programs written in them are quite scalable. You can find a large community of users for any one of these four OO languages, making the process of learning them much easier than it would be by reading a book or taking a course on them.

Ideally, you will employ several sources when learning any one of these languages. For someone who is not familiar with programming, this can be time consuming and frustrating at times. It is recommended that you pick a language, take a course on it (online or offline), read a book or two and practice a lot. You don’t have to design any fancy algorithms at this stage, just learn the syntax and the various functions of the language. You can implement some of the algorithms you already know from statistics and expand from there.

A very important aspect of programming is the development environment you use (often referred to as an IDE). An excellent choice for this is Eclipse, which has various versions for a variety of programming languages. Python has its own development environment, but it is very basic so you may want to consider getting acquainted with Eclipse even if you decide on that language. Just like any one of these languages, Eclipse is freeware, and there are plenty of tutorials for it. Consult the relevant appendices for details.

If you decide to get more technical with Hadoop, you will also need to learn Pig and Hive. The former is used for the MapReduce programming, while the latter is for creating and running queries for data that is spread over the cluster. If you decide to avoid low-level programming in MapReduce, you will still need to learn the high-level big data platform of your choice well enough to be able to customize it. This entails learning some programming for it. If you know R already, that’s a big plus.

If all this programming sounds intimidating, don’t let it scare you off. Once you know one programming language, it is significantly easier to learn other ones. The logic is usually very similar and you only need to familiarize yourself with its particular characteristics: syntax, packages, functions, etc. Besides, there is nothing that’s too difficult to overcome with enough practice.

12.3.1.3 Other Material to Get Acquainted With

Apart from the above, you will need to learn a query language such as SQL. This is much simpler than the aforementioned languages, and it shouldn’t take you more than a couple of weeks although you may want to practice a bit after that to make it second nature for you. SQL is designed to work with structured data such as the data that you find in databases. This is not something akin to big data, which is usually unstructured, but it may be that you will need to retrieve some data from the company’s database, which you’ll need to be able to do on your own. Also, knowing SQL is very useful as a basis since there are SQL-like languages that are used with big data, e.g., AQL (Annotated Query Language) and NoSQL (Not Only SQL) among others.

Another useful thing to learn, which does not fit in any of bins above, is graph analysis. A relatively old field of mathematics, it has recently experienced a resurgence as its benefits have found fertile ground in the big data world. In fact, GraphLab is specialized software (which is also free) that deals with graphing data for processing and visualization. Even though it is not a necessary thing to learn, it would be useful to at least learn it at a basic level since you already have the background for it.

If you learn all the above, you have a fighting chance of entering the data science field gracefully and without feeling inferior to the more technical professionals who aspire to the same thing. Learning what you need may take from a few months to two years or more. During that time you may also acquire some useful experience, especially if you plan your course carefully. A small bonus is all this will beef up your resume with useful skills that are in demand beyond the data science field. Sounds like a good tradeoff for your time, wouldn’t you agree?

12.3.2 Machine Learning / A.I. Background

A background like this is ideal for a data scientist. It combines programming, some knowledge of statistics, and often some hacking skills. More importantly, it provides you with access to the state-of-the-art research in the technology that constitutes the heart of data science: machine learning.

Even an experienced machine learning / A.I. practitioner is bound to have some gaps in his knowledge when entering the data science domain. In that case, here are some useful things to consider looking into before marketing yourself as a data scientist.

12.3.2.1 Theoretical Material to Learn

Being a machine learning / A.I. practitioner may give you some statistical knowledge, but unless you are doing research on hybrid approaches to machine learning, your statistics may need some upgrading. You may want to expand your know-how to include less well-known methods and become familiar with the theory behind the methods you already know. It would be useful to do this while learning R, if you are not already familiar with it. There are other statistical packages, of course, but if you have to choose one to learn well for regular use, it should be R. In addition to being a very robust tool for statistical analysis (and data analysis in general), it can provide you with insight into how certain methods work. Moreover, it is designed for a wide variety of users and does not assume more than a basic knowledge of statistics, which as a machine learning / A.I. practitioner you should have. Furthermore, you may already be familiar with Matlab, so learning R should be a piece of cake for you.

Learning statistics in depth can be a daunting task, so it is recommended that you look into university courses on the subject even online ones. A class or MOOC that has some hands-on practice on a statistical package would be most beneficial.

However, statistics is merely the beginning. You will need to get acquainted with the MapReduce paradigm, as well as distributed computing in general. If you feel comfortable enough with the paradigm, consider designing a mapper and a reducer for practice in a language of your choice. Your learning style will dictate if you need to take a course on the subject or if a good book and some tutorials will be sufficient.

Finally, you will need to learn a few things about databases, if you are not already familiar with them. A good working knowledge of the various types of data structures can go a long way. Most languages, even the less robust ones like Python, can handle multiple data structures. However, if you don’t know them in some depth, you may not be able to take advantage of these features. Data science involves a lot of resource management, and using the right data structures can help you create efficient programs and access diverse data sources more easily.

12.3.2.2 Languages to Learn

Many machine learning and A.I. practitioners get by using Matlab, Octave or Python. All of these are great tools, but not sufficient for a data scientist. Invest some time to learn one of the more robust languages, such as Java, Scala, C++ or C#, for starters. If languages are not your thing and you had a hard time learning Matlab, then Python is always a popular option. There are also several courses available. Python is not as intuitive as Matlab, but it has a wide variety of packages and is completely free. It is also adept at handling large datasets.

In addition, you should have at least some working knowledge of R. Apart from being a very intuitive and high-level way to implement any statistical method, it has robust parallel computing capabilities, good memory management and a wide variety of packages including one for large datasets. There is also a very big user community for R. New statistical methods created by researchers are usually first implemented and made available in R. Finally, unlike other statistical programs, R (and all its packages) is completely free.

Finally, you will need to get familiar with at least one of the big data integrated platforms, such as IBM’s BigInsights, and with the underlying technology, Hadoop. If you are more interested in working on Hadoop without relying on a platform, you may want to learn about Pig and Hive, for starters, so that you can create your own Hadoop code.

Learning a language can be time consuming, but as a machine learning/A.I. practitioner you are already good with algorithms. Therefore, learning a language should be manageable for you. If you have time, you may want to learn more than two languages, including R, as different employers often value different languages. So once you master one of the robust languages, get at least some working knowledge of another language.

12.3.2.3 Other Material to Get Acquainted With

As mentioned in Section 12.3.1.3, some working knowledge of SQL is essential to your role as a data scientist. If you haven’t touched this language since your university days, you may want to refresh your skills and practice on a variety of datasets. While you are at it, you may want to get acquainted with NoSQL, as well, since that’s what is used for data in the big data domain.

If you are confident about your programming skills and enjoy algorithm design, you may prefer to write your programs in R or some other language, linking them to the platform you are using. Make sure you learn how this is done and practice on some benchmark datasets. This could be more useful than mastering a language since there is no one particular language (yet) that you can rely on completely for all your data science endeavors.

In addition, you will need to practice building data analysis models using your newly acquired knowledge of statistics. These models don’t have to be purely statistical, but the more statistical elements you incorporate in them the better. Also, look into ways to combine different techniques in these models and test them on some benchmark models. Machine learning and A.I. techniques are excellent, but statistical techniques can be quite good, especially when dealing with numeric data. Since data science applications usually deal with this kind of data, you may want to make use of statistics more in your models.

12.3.3 Mixed Background

Coming from a mixed background in statistics and machine learning or A.I. has a lot of advantages and makes the transition to data science easier than from any other background. You should already be familiar with statistics theory, various machine learning techniques and may know a programming language or two, but you need to become acquainted with parallel computing, the MapReduce paradigm and databases. If you don’t know an OOP language well, you will need to expand your knowledge of programming. Also, you will need to work on your data mining techniques, practice with various benchmark datasets and refine your R programming.

If you come from a mixed background, you may want to invest in gaining more experience in data analysis because that is your forte. All the experience you already have is an invaluable asset, so make sure you integrate that into the data scientist core that you are building. Don’t hesitate to practice on datasets you have already worked with by incorporating the new methods you learn. Focus on better resource management and honing your hacking skills. Review subchapter 6.3 on how to get initial experience, and study the data science process (Chapter 11) so that it becomes second nature to you.

Whether you are a statistics person, a machine learning practitioner, or a combination of both, you will need to enhance your skills before you are ready to enter the data science market. Focus on your strengths and think of ways to complement them with the knowledge and skills you are missing. Just be sure that along with all of the skills you learn, you also develop the mindset needed for success that we saw in Chapter 4 so that you are more than a moving data science library. It is best to view all of these skills as ways to expand your thinking and develop yourself to tackle data science problems. As the years go by, the tools are bound to change, but what you’ve learned through cultivating them will remain. It’s this aspect of education that will define you as a data scientist and, if you play your cards right, it’s what will help you land your first data scientist job.

12.4 Expanding Your Current Skill-Set as a Data-Related Professional

As a professional in a data related field, you are already familiar with data types and structured data, so your emphasis should be on other aspects of data science such as OO programming, data visualization, data analysis, honing your communication skills and getting some hands-on experience.

In this section, we’ll look into three main data-related jobs: database administrator, data architect/modeler and BI analyst as well as how you can make the transition to data science from each one of them. For a full description of the required skills for a data scientist role, you may refer to the corresponding chapters (Chapter 8 for the necessary software and Chapters 4-7 for all the soft skills and practices).

12.4.1 Database Administrator

If you are a database administrator, you have a clear understanding of what a clean and ordered dataset looks like, how data can be gathered from a variety of sources, the different types of data that exist and you have expertise in one or more database management systems and SQL-based software.

You are probably familiar with user requirements and are able to interpret what their requirements mean. You are familiar with querying strategies and are confident about importing and exporting (mainly structured) data in various formats from a database or a data warehouse.

In order to migrate to the data science world effectively, you’ll need to get acquainted with the big data technologies, starting with Hadoop. This shouldn’t be difficult, considering that at least one of the components of Hadoop (Hive) is similar to SQL. In addition, if you are familiar with database schemas, creating an HBase database shouldn’t be much of a challenge. Finally, the NoSQL language is, in a way, an extension of SQL (although it includes much more), making it somewhat easier for you to learn.

Learning programming may be more challenging for you, but chances are that you are already somewhat familiar with programming even if you are not a master of an OO language. After taking a couple of courses (or reading a few good books), you should be able to handle that aspect of data science, too. If you haven’t done any programming before, Python may be a good place to start as it’s probably the simplest OO language (though not the most powerful one).

You will need to invest time to learn visualization. Programs like Tableau and Flare, among many others, are great options for this task. Although they are not too difficult, they will require some time to learn and practice.

Once you’ve got your mind adjusted to this learning sprint, you may want to step it up a notch by tackling the statistics and machine learning aspect of the field. Regardless of your technical background, you’ll need to devote quite some time to this, especially if you haven’t seen a statistics book since your university days. In order to save time and to make the whole process more interesting, you may want to learn this material while taking up R, Matlab or some other data analysis package. You’ll find that a lot of it will make more sense once you see it in practice. You’ll also come to appreciate these wonderful pieces of software. Note that you’ll need to learn about vectorization as well. Some hands-on knowledge of linear algebra can be very beneficial towards that.

Moreover, you may want to hone your communications skills, cultivating the art of storytelling. If you find that challenging, learn about the business world, read a few articles about investments and other business-related topics, speak to various business people and familiarize yourself with the business mindset. (Good documentaries on the topic may be useful as well.) You don’t have to get an MBA in order to do that although an intro to economics or finance course may be very helpful. Even if you are great at communicating technical details efficiently, you’ll need to be able to communicate holistically as well, establishing links with non-technical people and expressing things in an engaging way, as if you are telling a very interesting story to them. Report-writing may also help in that aspect, especially if your reports are targeted at people unfamiliar with the technical aspect of your field.

Finally, it’s a good idea to acquire some experience by applying what you have learned on benchmark problems and/or datasets from online competitions.

12.4.2 Data Architect/Modeler

If you are a data architect (data modeler), you are probably familiar with the business side of the database world and have substantial experience with requirements and planning. You already possess many of the skills of a database administrator, and you probably have hands-on experience dealing with a variety of data types. All these are essential skills that can give you a good head start in your data science endeavors.

To take better advantage of this head start, you may want to invest in the scientific know-how related to the field or expand the business skills you have, aiming at a senior data scientist post. The former includes statistics and machine learning, which you can learn through courses and reading a few books. As for business skills, you may want to invest in developing project management skills, getting acquainted with the corresponding software and learning more about the data science process and how it can be broken down into specific independent tasks that can be delegated to others. Naturally, you’ll need to be comfortable doing each one of these tasks yourself, so some hands-on experience with the data science process is also essential (refer to Chapter 11 for details).

In order to gain this experience, you need to familiarize yourself with the big data technologies. The best place to start would be the Hadoop databases, namely Hive and HBase. NoSQL is also useful to know and in demand. As you are already familiar with database design, understanding big data technologies should come more naturally to you, making your transition to the data science world smoother. You’ll need to expand your knowledge to include MapReduce and all the other aspects of Hadoop.

As a data architect/modeler, you may already be familiar with some OO programming. If not, you will need to learn at least one OO language fluently. This can facilitate the next step: data analysis tools.

Just like the database admin, you’ll need to spend quite some time on learning data analysis tools, perhaps while you learn about the statistics and machine learning methods you will be using. You don’t need to know everything about these packages, but knowing some programming can be to your advantage. Try implementing various programming methods in your data analysis package of choice (e.g., R). Note that even though R and Matlab are not advertised as OO languages, they do support classes and every single thing in their workspaces is treated as an object. So don’t be fooled by their simple interfaces – they each have a real beast in their core! Unfortunately, there is no way around learning vectorization as this is essential if you want to create efficient programs in either one of these packages. A solid understanding of linear algebra can be quite helpful for that.

Finally, although your communication skills are probably quite decent, you may want to practice presenting things, such as the models you develop and plots of the data, in a storytelling fashion, as this is something very useful in a data scientist role. This can be effectively combined with learning about data visualization.

12.4.3 Business Intelligence Analyst

Working in business intelligence (BI) gives you a firm grasp of the value of data, particularly in a business setting. If you are in that field, you can readily see how the big data movement can benefit the business world through BI. It is assumed that you already know about data types and that data visualization is your bread and butter. You may not be familiar with the particular data analysis packages that were described in this book, but it shouldn’t take you long to familiarize yourself with any one of them. You are probably comfortable with statistics although you should expand your repertoire of statistical methods and add some machine learning techniques into the mix as well. Start from what you already know, or something you are somewhat familiar with, and you can’t go wrong.

If you aren’t much of a programmer, you will want to take a couple of courses in an OO language of choice (Python is probably the easiest option). Courses like “Intro to Programming” would be ideal. If you are so inclined, to save some time you can study the book Machine Learning in Action, where a variety of machine learning techniques are introduced and implementations in Python are made available. You’ll still need to read up on machine learning from other sources or take a course on the subject to ensure you know enough. In parallel, you can expand your statistical knowledge as this is bound to have a synergistic effect that will compliment your studies in machine learning.

Making the transition to the data science world will, of course, entail familiarizing yourself with the big data technology. If you are not entirely comfortable with SQL, master it. Then learn about NoSQL, Hive and HBase. Afterwards, learning about HDFS, MapReduce and the other Hadoop components should be easier.

Learning a data analysis package is the next logical step for your transition to the data science world. Both R and Matlab/Octave are great places to start since learning either one of them will make you capable of running any kind of data analysis method you’ll ever need, but if you are already working with SPSS or SAS, you can expand your data analysis skills through them. Keep in mind that an open source alternative is often preferred by companies, unless, of course, they already have a license for some proprietary software.

Finally, you will want to polish your communication skills, particularly when it comes to presenting what you have found to people who are unfamiliar with the problem domain and lack technical expertise. This is common in the BI world, so it shouldn’t be too challenging for you. Still, more practice would be useful. You’ll also need to get some experience using benchmark datasets or competition problems (see subchapter 6.3 for details).

12.5 Developing the Data Scientist’s Skill-Set as a Student

If you are a student you may feel left behind due to your limited experience in a field related to data science. However, you may have an advantage over the professionals who are about to enter the field. This is because you have the opportunity to cultivate a more balanced skill-set from the very beginning. If you take full advantage of this, it may save you time in your development as a data scientist since you’ll be honing your skills in a more organic way.

The best way to start would be to figure out your strengths and weaknesses in relation to the skill-set described in the beginning of this chapter. Afterwards, you can develop a plan to cultivate an existing strength and a weakness at the same time so that you are not overwhelmed (in the case of working only on a weakness) or overconfident (in the case of working only on a strength). So if you are good with programming but not so good with business concepts, you can work on these two skills in parallel. Namely, you can expand your programming skill by either learning the language you know in more depth or by learning a new language that is used in data science. At the same time, you can take a course on business models, micro-economics, finance or business administration to get a feel of how companies and the economy in general function. Reading business articles can also be very helpful in that respect.

You need to be sure to incorporate a lot of hands-on exercises while you are developing those skills. All the data analysis techniques you learn are useless if you don’t know how to apply them for a specific data analysis problem. Even the most mundane things can become quite interesting if you engage with them with a hands-on approach. Enjoying the whole process of learning can actually be a great benefit to your development as a data scientist. If you feel uninspired sometimes, it would be useful to read up on stories of successful data scientists and talk to them if possible (via a Meetup group, for example). The more tangible the role of data scientist is for you, the easier it will be to eventually adopt. The skills you cultivate will then make more sense and be more meaningful. This is key in the long run since the initial enthusiasm does not always linger till the end of your training. However, if you are committed to your goal, you may find ways to revive it and develop a more lasting passion for the role, something that will reflect in the quality of your work.

12.6 Key Points

  • The transition from being a student, OO programmer, software developer or other related career tracks to data science can be smooth and relatively easy given that you have the focus, discipline and determination to do it.
  • As an OO programmer, you need to make sure that you do the following (preferably in this order):
    • Learn vectorization
    • Learn about data analysis tools such as Matlab, R, etc.
    • Study statistics and machine learning
    • Get acquainted with big data tech
    • Get acquainted with how end-users think and understand them
  • As a software prototype developer, you need to include in your to-do list all the things mentioned for the OO programmer (except the last one).
  • If you are in another career track related to the above, you need to adjust the list of things to do in the OO programmer section to your specific needs, giving emphasis to programming and data analysis tools and theory.
  • As an OO programmer you’ll need to get plenty of practice on large datasets such as the ones found in the Kaggle site and the UCI machine learning repository.
  • As a statistician you’ll need to learn more about machine learning and programming, get acquainted with big data technologies and expand your business skills, preferably in that order.
  • As a ML/A.I. practitioner, you’ll need to learn more about statistics, expand your programming skills, learn about big data technologies and expand your business skills, preferably in that order.
  • As someone who knows both statistics and ML/A.I., you’ll need to focus on expanding your programming skills, getting acquainted with big data technologies and learning more about the business world, preferably in that order.
  • As a professional in a data-related field, you are already familiar with data types and structured data, so your emphasis should be on other aspects of data science such as OO programming, data visualization, data analysis, honing your communication skills and getting some hands-on experience.
  • If you are a DB administrator, you’ll need to add the following to your to-do list (recommended in this particular order):
    • Big data technology, starting with Hive and NoSQL
    • OO programming language, possibly Python since it’s easier
    • Data visualization software such as Tableau, Flare, etc.
    • Statistics and machine learning, preferably parallel to a data analysis package such as R or Matlab
    • Communication skills, focusing on storytelling and presentations
    • Practice with benchmarks or competition datasets
  • If you are a data architect/modeler, you’ll need to focus on the following things (preferably in this order):
    • Statistics and machine learning
    • Project management, if you aspire to get a senior data scientist position
    • Big data technology, starting with Hive, HBase and NoSQL
    • OO programming (expand existing knowledge so that you have fluency in at least one language)
    • Data analysis packages such as R or Matlab (possibly combined with OO programming)
    • Data visualization and presentation skills (storytelling)
    • Practice with benchmarks or competition datasets
  • If you are a BI analyst, you’ll need to concentrate on the following skills (in the suggested order, if possible):
    • OO Programming with a simple language like Pythonif you are new to programming. (Possibly combine this with studying machine learning via Machine Learning in Action.)
    • SQL and big data technology, starting with Hive, NoSQL and HBase.
    • Data analysis tools such as R or Matlab (though expanding the knowledge of one you already know is also an option).
    • Communication skills (storytelling).
    • Practice with benchmarks or competition datasets.
  • If you are a student, you need to learn all of the core skills of data science, data analysis, programming, big data technologies, business skills, etc., in a balanced manner. It is recommended that you identify your strengths and weaknesses first.
  • Regardless of which discipline you are coming from, it is very important to pay close attention to communications skills as they are crucial for a data science role.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset