Chapter 9
Learning New Things and Tackling Problems

Learning new things is an integral part of being a data scientist, especially when new innovations are fairly common in the field. However, if you are new to data science, you are bound to have a lot of gaps in your knowledge, so learning the missing material is essential. Of course, if you want to evolve professionally, this is something you would do in any profession. However, in data science there are new programs coming out all the time, so even if you were the best data scientist in the world right now, your skills would be bound to be somewhat obsolete in a few years if you decided not to keep abreast of the developments in the field.

Tackling problems is similar to learning new things in that it requires the same flexibility and mental agility. Although this is common with many IT-related professions, in data science problems are a bit more commonplace, mainly because it’s an interdisciplinary field. However, by tackling the problems that arise with a positive attitude and a creative approach, you’ll also learn more new things than you’d normally be able to learn otherwise.

In this chapter we’ll examine various ways that you can upgrade your knowledge and, more importantly, your skill-set right now as well as while on the job. In the first four subchapters you’ll find out about how you can learn from workshops, conferences, online courses (often referred to as MOOCs) and data science groups. In the later subchapters, you’ll learn about the various problems that may arise in your work as a data scientist: namely, resource issues, requirements issues, insufficient know-how for a task you undertake, and integration issues.

9.1 Workshops

Workshops are the most efficient way to learn something new, especially when it comes to technical know-how. Fortunately, due to the increased popularity of the data science field there are numerous workshops available from which to learn any aspect of the field.

Workshops tend to be somewhat expensive (several hundred dollars each) but they are a good investment, especially if you are good at picking up new knowledge and know-how. Free alternatives for learning new things will be covered in subchapter 9.3. How to find the best workshops will be discussed later in this section.

So why bother with workshops if there are other ways to learn new things? Well, workshops provide networking opportunities, can enhance your resume (if you have no other data science related qualifications), and often provide more useful knowledge and know-how than university courses, regardless of the university. This is because university courses are often based on the available literature in scientific books, journal papers and conference proceedings and are designed to give students the foundation on which to build more advanced knowledge.

Workshops are also very time efficient, squeezing into a few hours material that would normally take days to learn on your own. They are often hard and demand all of your concentration, but they enable you to learn something you would normally not have the time or resources to learn on your own.

The key things to keep in mind when choosing to register for a workshop are what you are going to learn and how it can be useful for your job as a data scientist. This sounds obvious, but it is really easy to get sold on workshops that you don’t need since they all appear quite appealing at the sites that promote them.

To ensure that you stay focused on the appropriate workshops, make a list of the skills and knowledge that you want or need, then research workshops that are being offered. Update your list if you find workshops that offer something you haven’t thought of; if there are several workshops that offer it, it is usually something useful to know in the industry. Finally, pick the workshop that is most suitable for what you want or need, taking into account its location, the time of the year it’s offered and, of course, its price. You can’t go wrong with a strategy like that.

9.2 Conferences

Conferences are like workshops but are designed for larger groups of people. They offer some innovative pieces of knowledge based on research and case studies as well as more foundational information for those who are newer to the subject of the conference. More often than not, conferences offer workshops to attract more people. Note that in this book we are referring to non-academic conferences, since the academic ones have a different mission and scope.

Conferences are a great way to learn a variety of new things in a short period of time, meet new people, exchange war stories and get acquainted with other challenges in the field. Conferences are quite interactive and provide great mental stimulation, very similar to some good university classes, but without the stress of exams and written assignments. They are usually costly, making them a viable option mainly for full-time professionals. However, given the benefits they can provide, they are a worthy alternative for anyone interested in expanding his skill-set and data science knowledge. Fortunately, companies often cover at least some (if not all) of the expenses of their employees who are participating in such conferences.

The big advantage of this option for learning new things is that it is very time efficient, especially when combined with a couple of workshops. If you can relate this new knowledge to an existing problem you are facing, that’s even better. The bottom line is that if you are open to new things, a conference can prove to be a very fruitful experience that may enrich your understanding of data science and your particular role, too. You can find out about the various conferences that are being offered by searching the web directly or through the various data science groups (see subchapter 9.4).

9.3 Online Courses

Although the world today has a lot of issues, it’s also the first time in our history that refined knowledge23 on a large variety of subjects is publically available at no cost. This is through the various online courses, particularly MOOCs24.

The first MOOCs appeared in 2008 and have grown in popularity and in variety since then. The largest MOOC provider, Coursera, is an initiative of two faculty members of Stanford University, Prof. Daphne Koller and Prof. Andrew Ng. The courses on this site span from calculus to philosophy to history of art. Since one of the founders, Prof. Ng, is a leading machine learning expert, there are several worthwhile courses on data science (Prof. Ng’s course “Machine Learning” is one of the best MOOCs out there, not just within the Coursera site). Coursera’s website (www.coursera.org) is user-friendly and straightforward, and so are its applications for smartphones and tablets to facilitate the use of the site’s content while you’re on the move.

There are several other places where you can find MOOCs, the most well-known of which are:

  • Udacity – this MOOC site covers a variety of courses on science (especially computer science), design and business.
  • edX – focusing mainly on science, this site also offers some courses on humanities and business/economics.
  • Khan Academy – a great site for the younger students, this is a good resource for mathematics and science courses.
  • Codeacademy – if you want to learn or just practice programming, this is a good place to start. The focus is on web-based programming, though.
  • Open Learning Initiative – having a relatively short collection of MOOCs, this site focuses on quality and variety. It is still quite new but appears to be promising.
  • Open Yale Courses – a very well-organized archive of courses offered in the famous university, this is a place where you can find good quality material (videos, transcripts, etc.) on various subjects to download and use at your own pace.
  • OpenLearn (Open University) – one of the most established free learning resources, which offers high-veracity information. It has a variety of courses (even language-related ones) and a large community of users. Worth looking into if you have time. Note that this is a serious MOOC provider, requiring a certain level of commitment.
  • canvas.net – a great MOOC site with a large variety of subjects. Somewhat irrelevant to data science, however.
  • openHPI – a relatively small MOOC provider focusing on web-related technologies. Still, it is quite relevant to data science.
  • NovoED – an interesting place for MOOCs, mainly on business and management as well as a few other subjects. Despite its limited variety, it has some very good courses from Stanford University.
  • MongoDB – this is a highly specialized MOOC provider focusing on the MongoDB database framework. Still, it is very relevant to data science, especially if you are interested in this particular piece of big data technology.
  • Open2Study – this is an excellent resource for small courses (lasting 4 weeks). It focuses on business and management MOOCs, but has also some computer science courses and a few other kinds of MOOCs.

All the above alternatives are great, but it’s good to keep in mind that none of them come anywhere close to the Coursera site in terms of quality and popularity (a typical data science course at Coursera has 50000-100000 students enrolled). In addition, the courses of Coursera are quite interactive, and if you commit to them, they can be a very enjoyable experience. However, if you can’t find the course you are looking for on that particular site, it is worth taking a look at the alternative MOOC providers to supplement your learning.

The (Coursera) MOOCs on data science that are definitely worth looking into are:

  • Web Intelligence and Big Data (Indian Institute of Technology) – a great place to learn about big data, the MapReduce algorithm and how they relate to the Web (which is one of the main sources of big data nowadays). The instructor, Prof. Gautam Shroff, is very knowledgeable and methodical, making this course a must for any aspiring data scientist. A Certificate of Accomplishment (CA) is available.
  • Statistics One (Princeton) – a great course for learning the basics and more of this fundamental subject. Also, a great place to practice using the R data analysis platform. No CA is available, though.
  • Computing for Data Analysis (Johns Hopkins) – a great place to learn about R and use it in various data analysis applications. Difficult if you are unfamiliar with R since it is quite short (4 weeks). CA is available.
  • Machine Learning (Stanford) – as mentioned earlier, this is the course given by one of the founders of Coursera, Prof. Ng. One of the best courses available on this subject, with a variety of topics related to data science. Note that this is just an introductory course, though, so it doesn’t go into depth on any of the methods presented in it. Programming language used: Matlab/Octave. CA is available.
  • Data Analysis (Johns Hopkins) – this course focuses on data analysis methodology using R, so some familiarity with it is very useful. CA is available.
  • Statistics: Making Sense of Data (University of Toronto) – a great place to learn the basics of statistics and have a good time doing so. Plenty of examples, interesting case studies and very charismatic instructors. CA is available.
  • Introduction to Data Science (University of Washington) – this is a must for any aspiring data scientist. Unfortunately, there have been no upcoming sessions of this MOOC for months, yet the lectures of the course are available at the MOOC’s webpage. Programming languages used in this course: Python (ver. 2.7), R and SQL. CA is available.
  • Machine Learning (University of Washington) – this is different from Prof. Ng’s course, but it is quite good. It covers topics that the other machine learning course doesn’t, and you can use whatever programming language you prefer for the assignments. CA is available.
  • Passion Driven Statistics (Wesleyan) – an interesting course covering topics beyond an introductory statistics course. However, the statistics package used is SAS (not the easiest tool to learn), and it is not available in certain countries due to licensing issues. CA is available.
  • Introduction to Databases (Stanford) – one of the first courses on Coursera, this MOOC provides a good introduction to the subject although it is recommended that you have a solid background in computer science already. This is a self-study course, so you can do it at your own pace. No information about CA availability.

Note that as more and more universities develop MOOCs, there may be new data science courses that are not on this list. So keep your eyes open and ask around. Oftentimes, the Coursera forums are a great place to get informed about courses similar to the ones you are taking, plus you can get some useful feedback on how good they are from classmates of yours who have taken them. A great place to get additional evaluations of the various courses is Coursetalk (coursetalk.org), so check this out too before enrolling for a course to make the most of your time. Finally, lately Coursera offers specializations, which are basically amalgamations of courses from a university with an exam or project at the end and a specialized certificate if you pass all the classes. Not all specializations are free. Currently there is one specialization for data science, offered by the Johns Hopkins University.

9.4 Data Science Groups

One of the most enjoyable ways to learn, especially if you like socializing and networking, data science groups are popping up all over the place. If you live in a large city, the chances are that you will find one in your area. Data science groups are a great place to network and make acquaintances that may lead to job opportunities. You can read more about that in Chapter 13.

Since data science is a buzz word, some data science groups use the name in order to get a lot of people involved without living up to their promise. So always check out a group’s organizer(s) before joining it since time is a very valuable resource and there may be better ways of using it for your data science endeavors. If the organizer is an actual data scientist or someone who strikes you as very knowledgeable on the subject, you can hop onboard. In addition, check out the events that the group hosts. If they include a lot of talks by respectable professionals who are related to the field, it is a worthwhile choice. If most of the meetings are just conversations among the members, maybe you can skip that one. Finally, make sure that there are several members in the group (the more the better) to ensure that you will meet lots of interesting professionals, improving your chances for learning. Note that a group about machine learning or data mining is also relevant, so don’t consider only groups with “data science” in their names.

You can learn from a data science group in (at least) two ways. First of all, you can attend the events where a knowledgeable speaker presents a data science topic. This person could be a researcher in the field or an industry professional, possibly even a developer of some promising new big data program. We already talked a bit about Storm and how great an alternative it is to Hadoop. This piece of software became popular through the developers’ various presentations. Imagine attending one such presentation and being one of the first people to learn about the software. If you played your card right and acted on the knowledge, you would have an edge when it came to this program. And if a company was looking for someone who was familiar with it, you’d be one of the people to be shortlisted.

The other way to learn with a data science group is through active conversation with the other members of the group. (This approach is useful for all kinds of professional events, by the way.) This means actively participating in a conversation, asking meaningful questions, providing brief and focused replies, etc. If you enter a conversation and let yourself vent about the problems you are facing at work or about topics that are of no interest to the others, don’t expect to learn much or keep the other participants of the conversation interested for long. Listening is the key here as well as being able to ask questions that will make the other person think and offer meaningful responses, engaging them in a creative debate.

Apart from all these sources of learning, there are also the various data science websites and blogs, which you are probably somewhat familiar with. Books may be great at providing you with some reliable fundamental knowledge about the field, but when it comes to staying updated, nothing beats the Web. It would be a futile task to try to list all the various online resources on data science, especially considering how quickly the Web is changing. However, there are a few that are definitely worth looking into (see Appendices 1 and 2). One that seems particularly easy to digest is the Data Science 101 blog – http://datascience101.wordpress.com.

9.5 Requirements Issues

Requirements issues are a type of problem you may encounter although this greatly depends on the company you are in (or your clients, if you are working as an independent consultant or freelancer). Many IT professionals encounter problems with requirements, and it is not unusual to see similar problems in a data science setting. Issues with requirements have to do with the miscommunication and misunderstanding of a project’s requirements as well as how they are implemented in a working prototype. That’s surely something that some good communication can fix, right?

Well, it’s a bit more complicated than that. When two parties of completely different backgrounds and priorities communicate, even if they communicate well, there may be subtle differences in how things settle after they are filtered by what’s feasible (taking into account the resource limitations described previously). For example, your manager may want you to create a prediction model for a particular dataset, but after examining it you realize that this may not be feasible with the data you have or the tools you can muster (let alone the timeframe in which this needs to be done). The requirements need to be reasonable, but reasonable is a relative term when it comes to something that has not been created yet. Creating a data product may not work as you imagine it to work, so you and your manager or client need to agree on a set of requirements that also outline the desired end result. You need to be able to manage your client’s expectations, help them understand the limitations of your tools and hardware and find a solution that is mutually satisfactory. That takes a lot of creativity, diplomacy and communication, not to mention patience.

9.6 Insufficient Know-How Issues

Lack of knowledge is the most commonly encountered issue for people who are new in a data science job though it can be encountered by more established data scientists as well, especially when changing industries. If you have insufficient know-how, it’s better to admit it and offer a strategy for overcoming the issue rather than hiding it and pretending you know everything. Try to augment your knowledge from one or more of the following sources:

  • Reliable article – an article written by someone knowledgeable. Reliable articles are usually found on established data science portals and journals as well as on LinkedIn.
  • Relevant technical book – avoid books that are not written on the topic on which you want to expand your knowledge. Technics Publications has many reliable books on a variety of data related topics, so that would be a good place to start. In general, try to avoid books that are written for non-technical people when seeking a specific piece of know-how. Even this book may be less than ideal for filling most of your technical know-how gaps.
  • Reliable website – try a technical blog or forum rather than a generic one that is there just to generate traffic for its ads. What’s important is that it’s run by someone who knows what he is talking about. If you are familiar with SEO, look into the site’s source code to get hints of whether it is worth the traffic it receives and whether or not it has been developed by a professional.
  • Worthwhile workshop – this source was covered in subchapter 9.1.
  • Specialist – if the above options don’t work out, or if, for whatever reason, you prefer to bypass them, consult a fellow data scientist. Here’s where networking pays off. You can find a fellow data scientist in a data science group (see subchapter 9.4), physically or online. It would be best to avoid shooting an email to the author of this book asking technical questions, though, and you should always be considerate of others’ schedule before contacting them. Note that in the initial stages of your data science career, you may want to have such a specialist as a mentor since you may have a number of things to ask that may not be readily answered in the aforementioned sources.

Issues related to insufficient know-how may be challenging, but with enough humility, open-mindedness and research, it’s only a matter of time before you resolve them. No one was born knowing everything, and no one knows all there is to know in this ever-changing field. So don’t hesitate to seek the missing pieces of your know-how puzzle and evolve as a professional through this experience.

9.7 Tool Integration Issues

Tool integration issues are common in the work of a data scientist. You may have a great tool (e.g., Matlab), but the other software you work with doesn’t integrate with its scripts. Or you may develop a great data analysis script in R (which pretty much every data science software integrates with), but the format of the data file or data stream you have to process is not recognizable by your script.

In general, when dealing with an integration issue you need to research the technologies used and how other people have resolved (or at least tried to resolve) the problem at hand. Clever use of a search engine is a big asset here, so make use of it extensively. You will also need to employ your communication skills, your creativity and, of course, patience. Here’s another instance when the contacts you’ve made through networking can be useful. Solving tool integration issues will enable you to become more intimately acquainted with the software you use and allow you to exercise your creativity and research skills, both of which are essential to your role.

9.8 Key Points

  • Keeping your knowledge up to date is a very important aspect of being a data scientist, especially when new innovations come about in the field.
  • Workshops are the most efficient way to learn something new, particularly about technical subjects. Workshops may be expensive, but they can be a worthwhile investment of your resources because they can enhance your resume and often provide more useful knowledge than university courses.
  • Conferences are an excellent way to expand your understanding of data science, get updated on recent innovations, meet lots of interesting people in the field and learn about useful things you can apply to the problems you are facing in your data science endeavors.
  • Online courses, particularly MOOCs, are one of the best ways to increase and refine your knowledge of a variety of topics. There are several data science related courses available on Coursera, the most established MOOC provider.
  • Data science groups are a great and quite enjoyable way to learn new things about the field. You need to find a group that hosts a lot of educational events, has many members in the data science field (not just beginners), and practice active conversation when socializing with the other members of the group.
  • Resource issues are quite common and involve dealing with the limited resources that are available for data analysis tasks.
  • Requirements issues are commonplace and involve miscommunication, misunderstanding and misinterpretation in terms of implementation of the requirements your manager or client has. These issues can be effectively tackled by employing creativity, diplomacy and communication as well as patience.
  • Insufficient know-how for your work can be overcome by reading a good article/book/website or by consulting a specialist.
  • Integration issues are quite common in the IT world and involve getting different programs as well as datasets of various formats to work together. This is particularly difficult when working with newly developed programs. You can overcome this kind of issue by employing good communication, creativity and patience.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset