You’ve now finished a bootcamp, a degree program, a set of online courses, or a series of data projects in your current job. Congratulations—you’re ready to get a data scientist job! Right?
Well, maybe. Part 2 of this book is all about how to find, apply for, and get a data science position, and you can certainly start this process now. But another step can really help you be successful: building a portfolio. A portfolio is a set of data science projects that you can show to people so they can see what kind of data science work you can do.
A strong portfolio has two main parts: GitHub repositories (repos for short) and a blog. A GitHub repo hosts the code for a project, and the blog shows off your communication skills and the non-code part of your data science work. Most people don’t want to read through thousands of lines of code (your repo); they want a quick explanation of what you did and why it’s important (your blog). And who knows—you might even get data scientists from around the world reading your blog, depending on the topic. As we discuss in the second part of this chapter, you don’t have to just blog about analyses you did or models you built; you could also explain a statistical technique, write a tutorial for a text analysis method, or even share career advice (such as how you picked your degree program).
This isn’t to say that you need to have a blog or GitHub repos filled with projects to be a successful data scientist. In fact, the majority of data scientists don’t, and people get jobs without a portfolio all the time. But creating a portfolio is a great way to help you stand out and to practice your data science skills and get better. We hope that it’s fun, too!
This chapter walks you through the process of building a good portfolio. The first part is about doing a data science project and organizing it on GitHub. The second part goes over best practices for starting and sharing your blog so that you get the most value out of the work you’ve done. Then we walk you through two real projects we’ve done so that you can see the process end to end.
A data science project starts with two things: a dataset that’s interesting and a question to ask about it. You could take government census data, for example, and ask “How are the demographics across the country changing over time?” The combination of question and data is the kernel of the project (figure 4.1), and with those two things, you can start doing data science.
When you’re thinking about what data you want to use, the most important thing is finding data that’s interesting to you. Why do you want to use this data? Your choice of data is a way to show off your personality or the domain knowledge you have from your previous career or studies. If you’re in fashion, for example, you can look at articles about Fashion Week and see how styles have changed in the past 20 years. If you’re an enthusiastic runner, you can show how your runs have changed over time and maybe look to see whether your running time is related to the weather.
Something you shouldn’t do is use the Titanic dataset, MNIST, or any other popular beginning datasets. It’s not that these learning experiences aren’t good; they can be, but you’re probably not going to find anything novel that would surprise and intrigue employers or teach them more about you.
Sometimes, you let a question lead you to your dataset. You may be curious, for example, about how the gender distribution of college majors has changed over time and whether that change is related to median earnings after graduation. Then you’d take to Google and try to find the best source of that data.
But maybe you don’t have a burning question that you’ve just been waiting to have data science skills to answer. In this case, you can start by browsing datasets and seeing whether you can come up with any interesting questions. Here are a few suggestions for where you might start:
What makes a side project interesting? Our recommendation is to pick an exploratory analysis in which any result will probably teach the reader something or demonstrate your skills. You might create an interactive map of 311 calls in Seattle, color-coded by category; this map clearly demonstrates your visualization skills and shows that you can write about the patterns that emerge. On the other hand, if you try to predict the stock market, you likely won’t be able to, and it’s hard for an employer to assess your skills if you have a negative outcome.
Another tip is to see what comes up when you Google your question. If the first results are newspaper articles or blog posts that answer exactly the question you were asking, you may want to rethink your approach. Sometimes, you can expand on someone else’s analysis or bring in other data to add another layer to the analysis, but you may need to start the process over.
Building a portfolio doesn’t need to be a huge time commitment. The perfect is definitely the enemy of the good here. Something is better than nothing; employers are first and foremost looking for evidence that you can code and communicate about data. You may be worried that people will look and laugh at your code or say, “Wow, we thought this person might be okay, but look at their terrible code!” It’s very unlikely that this will happen. One reason is that employers tailor their expectations to seniority level: you won’t be expected to code like a computer science major if you’re a beginning data scientist. Generally, the bigger worry is that you can’t code at all.
It’s also good to think about the areas of data science we covered in chapter 1. Do you want to specialize in visualization? Make an interactive graph using D3. Do you want to do natural language processing? Use text data. Machine learning? Predict something.
Use your project to force yourself to learn something new. Doing this kind of hands-on analysis will show you the holes in your knowledge. When data you’re really interested in is on the web, you’ll learn web scraping. If you think that a particular graph looks ugly, you’ll learn how to make better visualizations. If you’re self-studying, doing a project is a nice way to overcome the paralysis of not knowing what to learn next.
A common problem with self-motivated projects is overscoping. Overscoping is wanting to do everything or keep adding more stuff as you go. You can always keep improving/editing/supplementing, but then you never finish. One strategy is to think like Hollywood and create sequels. You should set yourself a question and answer it, but if you think you may want to revisit it later, you can end your research with a question or topic for further investigation (or even “To be continued . . .?”, if you must).
Another problem is not being able to pivot. Sometimes, the data you wanted isn’t available. Or there’s not enough of it. Or you’re not able to clean it. These situations are frustrating, and it can be easy to give up at this point. But it’s worth trying to figure out how you can salvage the project. Have you already done enough work to write a blog post tutorial, maybe on how you collected the data? Employers look for people who learn from their mistakes and aren’t afraid to admit them. Just showing what went wrong so that others might avoid the same fate is still valuable.
Maybe you’re in a bootcamp or a degree program in which you’re already doing your own projects. You’ve even committed your code to GitHub. Is that enough?
Nope! A minimal requirement for a useful GitHub repository is filling out the README file. You have a couple of questions to answer:
But although doing a project and making it publicly available in a documented GitHub repo is good, it’s very hard to look at code and understand why it’s important. After you do a project, the next step is writing a blog post, which lets people know why what you did was cool and interesting. No one cares about pet_name_analysis.R, but everyone cares about “I used R to find the silliest pet names!”
Blogs allow you to show off your thinking and projects, but they can also offer a nontechnical view of your work. We know, we know—you just learned all this great technical stuff! You want to show it off! But being a data scientist almost always entails communicating your results to a lay audience, and a blog will give you experience translating your data science process into business language.
Suppose that you’ve created a blog. Are people really going to be interested in your projects? You don’t even have a data scientist title yet; how can you teach anyone anything?
Something good to remember is that you are best positioned to teach the people a few steps behind you. Right after you’ve learned a concept, such as using continuous integration for your package or making a TensorFlow model, you still understand the misconceptions and frustrations you’ve had. Years later, it’s hard to put yourself in that beginner’s mindset. Have you ever had a teacher who was clearly very smart and yet couldn’t communicate concepts at all? You didn’t doubt that they knew the topic, but they couldn’t break it down for you and seemed to be frustrated that you didn’t just get it right away.
Try thinking of your audience as the you of six months ago. What have you learned since then? What resources do you wish had been available? This exercise is also great for celebrating your progress. With so much to learn in data science, it’s easy to feel that you’ve never done enough; pausing to see what you’ve accomplished is nice.
You can group data science blog posts into four categories:
But where should you put your interesting writing? For a blog, you have two main options:
One common question people have about blogging is how often they need to post and how long those posts should be. These things are definitely personal choices. We’ve seen people who have micro blogs, publishing short posts multiple times a week. Other people go months between posts and publish longer articles. There are some limitations; you do want to make sure that your posts don’t start to resemble Ulysses. If your post is very long, you can split it into parts. But you want to show that you can communicate concisely, as that’s one of the core data science skills. Executives and even your manager probably don’t want or need to hear all the false starts you had or the 20 different things you tried. Although you may choose to include a brief summary of your false starts, you need to get to the point and your final path quickly. One exception, though, is if your final method is going to surprise readers. If you didn’t use the most popular library for a problem, for example, you may want to explain that you didn’t because you found that the library didn’t work.
What if you’re worried that no one will read your blog and that all your work will be for nothing? Well, one reason to have a blog anyway is that it helps your job applications. You can put links to your blog posts on your résumé when you reference data science projects and even show them to people in interviews, especially if the posts have nice interactive visualizations or dashboards. It’s not important to have hundreds or thousands of readers. It can be nice if you get claps on Medium or if you’re featured in a data science company’s newsletter, but it’s more important to have an audience that will read, value, and engage with the material than to have high metrics.
That’s not to say there’s nothing you can do to build readership. For one thing, you should advertise yourself; although it’s a cliché, having a #brand is useful for building a network in the long term. Even if something seems to be simple, it’s probably new to a bunch of practicing data scientists just because the field is so big. People at the companies you want to work for may even read your stuff! Twitter is a good place to start; you can share the news when you release a post and use the appropriate hashtags to get wider readership.
But your blog is valuable even if no one (besides your partner and pet) reads it. Writing a blog post is good practice; it forces you to structure your thoughts. Just like teaching in person, it also helps you realize when you don’t know something as well as you thought you did.
In this section, we walk you through two example projects, from the initial idea through the analysis to a final public artifact. We’ll use real projects that the authors of this book did: creating a web application for data science freelancers to find the best-fit jobs and learning neural networks by training one on a dataset of banned license plates.
Emily Robinson
When I was an aspiring data scientist, I became interested in one way some data scientists make extra money: freelancing. Freelancing is doing projects for someone you’re not employed by, whether that’s another person or a large company. These projects range from a few hours to months of full-time work. You can find many freelancing jobs posted on freelancing websites like UpWork, but because data science is a very broad field, jobs in that category could be anything from web development to an analysis in Excel to natural language processing on terabytes of data. I decided to see whether I could help freelancers wade through thousands of jobs to find the ones that are the best fit for them.
To gather the data, I used UpWork’s API to pull currently available jobs and the profiles of everyone in the Data Science and Analytics category. I ended up with 93,000 freelancers and 3,000 jobs. Although the API made it relatively easy to access the data (as I didn’t have to do web scraping), I still had to make functions to do hundreds of API calls, handle when those API calls failed, and then transform the data so I could use it. But the advantage of this process was that because the data wasn’t readily available, there weren’t hundreds of other people working on the same project, as there would have been if I’d used data from a Kaggle competition.
After I got the data in good shape, I did some exploratory analysis. I looked at how education levels and country affected how much freelancers earned. I also made a correlation graph of the skills that freelancers listed, which showed the different types of freelancers: web developers (PHP, jQuery, HTML, and CSS), finance and accounting (financial accounting, bookkeeping, and financial analysis), and data gathering (data entry, lead generation, data mining, and web scraping), along with the “traditional” data science skill set (Python, machine learning, statistics, and data analysis).
Finally, I created a similarity score between profile text and the job text, and combined that score with the overlap in skills (both freelancers and jobs listed skills) to create a matching score for a freelancer and a job.
In this case, I didn’t end up writing a blog post. Instead, I made an interactive web application in which someone could enter their profile text, skills, and requirements for jobs (such as a minimum feedback score for the job poster and how long the job would take), and the available jobs would be filtered to meet those requirements and sorted by how well they fit the user.
I didn’t let the perfect be the enemy of the good here; there are plenty of ways I could have made the project better. I pulled the jobs only once, and because I did this project four years ago, the application still works, but none of those jobs are available anymore. To make the application valuable over the long term, I’d need to pull jobs nightly and update the listings. I also could have made a more sophisticated matching algorithm, sped up the initial loading time of the app, and made the appearance fancier. But despite these limitations, the project met a few important goals. It showed that I could take a project and allow people to interact with it rather than be limited to static analyses that lived on my laptop. It had a real-world use case: helping freelancers find jobs. And it took me through the full data science project cycle: gathering the data, cleaning it, running exploratory analyses, and producing a final output.
Jacqueline Nolis
As I grew as a data scientist, I was always frustrated when I saw hilarious blog posts in which people trained neural networks to generate things like new band names, new Pokémon, and weird cooking recipes. I thought these projects were great, but I didn’t know how to make them myself! One day, I remembered that I had heard of a dataset of all the custom license plates that were rejected by the state of Arizona for being too offensive. If I could get that dataset, it would be perfect for finally learning how to make neural networks—I could make my own offensive license plates (figure 4.2)!
After submitting a public records request to the Arizona Department of Transportation, I got a list of thousands of offensive license plates. I didn’t know anything about neural networks, so after receiving the data, I started scouring the internet for blog posts describing how to make one. As primarily an R user, I was happy to find the Keras package by RStudio for making neural networks in R.
I loaded the data into R and then checked out the RStudio Keras package example for generating text with neural networks. I modified the code to work with license-plate data; the RStudio example was for generating sequences of long text, but I wanted to train on seven-character license plates. This meant creating multiple training data points for my model from each license plate (one data point to predict each character in the license plate).
Next, I trained the neural network model, although it didn’t work at first. After putting the project down for a month, I came back and realized that my data wasn’t being processed quite right. When I fixed this problem, the results that the neural network generated were fantastic. Ultimately, even though I didn’t change the RStudio example much, by the end, I felt much more confident in creating and using neural networks.
I wrote a blog post about the project that walks through how I got the data, the act of processing it to be ready for the neural network, and how I modified the RStudio example code to work for me. The blog post was very much an “I’m new at neural networks, and here is what I learned” style of post; I didn’t pretend that I already knew how all this worked. As part of the blog post, I made an image that took the text output of my neural model and made it look like Arizona license plates. I also put the code on GitHub.
Since I wrote that blog post and made my code available, numerous other people have modified it to make their own funny neural networks. What I learned from this goofy project eventually helped me to make high-impact machine learning models for important consulting engagements. Just because the original work isn’t serious doesn’t mean that there isn’t value in it!
David Robinson is the co-author (with Julia Silge) of the tidytext package in R and the O’Reilly book Text Mining with R. He’s also the author of the self-published e-book Introduction to Empirical Bayes: Examples from Baseball Statistics and the R packages broom and fuzzyjoin. He holds a PhD in quantitative and computational biology from Princeton University. Robinson writes about statistics, data analysis, education, and programming in R on his popular blog: varianceexplained.org.
I first started blogging when I was applying for jobs near the end of my PhD, as I realized that I didn’t have a lot out on the internet that showed my skills in programming or statistics. When I launched my blog, I remember having the distinct fear that once I wrote the couple of posts I had ready, I would run out of ideas. But I was surprised to find that I kept coming up with new things I wanted to write about: datasets I wanted to analyze, opinions I wanted to share, and methods I wanted to teach. I’ve been blogging moderately consistently for four years since then.
I did get my first job from something I wrote publicly online. Stack Overflow approached me based on an answer I’d written on Stack Overflow’s statistics site. I’d written that answer years ago, but some engineers there found it and were impressed by it. That experience really led me to have a strong belief in producing public artifacts, because sometimes benefits will show up months or years down the line and lead to opportunities I never would have expected.
People whose résumés might not show their data science skills and who don’t have a typical background, like having a PhD or experience as a data analyst, would particularly benefit from public work. When I’m evaluating a candidate, if they don’t have those kinds of credentials, it’s hard to say if they’ll be able to do the job. But my favorite way to evaluate a candidate is to read an analysis they’ve done online. If I can look at some graphs someone created, how they explained the story, and how they dug into the data, I can start to understand whether they’re a good fit for the role.
The way I used to view projects is that you made steady progress as you kept working on something. In graduate school, an idea wasn’t very worthwhile, but then it became some code, a draft, a finished draft, and finally a published paper. I thought that along the way, my work was getting slowly more valuable.
Since then, I’ve realized I was thinking about it completely wrong. Anything that is still on your computer, however complete it is, is worthless. If it’s not out there in the world, it’s been wasted so far, and anything that’s out in the world is much more valuable. What made me realize this is a few papers I developed in graduate school that I never published. I put a lot of work into them, but I kept feeling they weren’t quite ready. Years later, I’ve forgotten what’s in them, I can’t find them, and they haven’t added anything to the world. If along the way I’d written a couple of blog posts, sent a couple of tweets, and maybe made a really simple open source package, all of those would have added value along the way.
I’ve built up a habit that every time I see a dataset, I’ll download it and take a quick look at it, running a few lines of code to get a sense of the data. This helps you build up a little of data science taste, working on enough projects that you get a feel for what pieces of data are going to yield an interesting bit of writing and which might be worth giving up on.
My advice is that whenever you see the opportunity to analyze data, even if it’s not in your current job or you think it might not be interesting to you, take a quick look and see what you can find in just a few minutes. Pick a dataset, decide on a set amount of time, do all the analyses that you can, and then publish it. It might not be a fully polished post, and you might not find everything you’re hoping to find and answer all the questions you wanted to answer. But by setting a goal of one dataset becoming one post, you can start getting into this habit.
Don’t get stressed about keeping up with the cutting edge of the field. It’s tempting when you start working in data science and machine learning to think you should start working with deep learning or other advanced methods. But remember that those methods were developed to solve some of the most difficult problems in the field. Those aren’t necessarily the problems that you’re going to face as a data scientist, especially early in your career. You should start by getting very comfortable transforming and visualizing data; programming with a wide variety of packages; and using statistical techniques like hypothesis tests, classification, and regression. It’s worth understanding these concepts and getting good at applying them before you start worrying about concepts at the cutting edge.
Practical Data Science with R, 2nd ed., by Nina Zumel and John Mount (Manning Publications)
This book is an introduction to data science that uses R as the primary tool. It’s a great supplement to the book you’re currently holding because it goes much deeper into the technical components of the job. It works through taking datasets, thinking about the questions you can ask of them and how to do so, and then interpreting the results.
Doing Data Science: Straight Talk from the Frontline, by Cathy O’Neil and Rachel Schutt (O’Reilly Publications)
Another introduction to data science, this book is a mixture of theory and application. It takes a broad view of the field and tries to approach it from multiple angles rather than being a set of case studies.
R for Everyone, 2nd ed., by Jared Lander, and Pandas for Everyone, by Daniel Chen (Addison-Wesley Data and Analytics)
R for Everyone and Pandas for Everyone are two books from the Addison-Wesley Data and Analytics Series. They cover using R and Python (via pandas) from basic functions to advanced analytics and data science problem-solving. For people who feel that they need help in learning either of these topics, these books are great resources.
Think Like a Data Scientist: Tackle the Data Science Process Step-by-Step, by Brian Godset (Manning Publications)
Think Like a Data Scientist is an introductory data science book structured around how data science work is actually done. The book walks through defining the problem and creating the plan, solving data science problems, and then presenting your findings to others. This book is best for people who understand the technical basics of data science but are new to working on a long-term project.
Getting What You Came For: The Smart Student’s Guide to Earning an M.A. or a Ph.D., by Robert L. Peters (Farrar, Straus and Giroux)
If you’ve decided to go to graduate school to get a master’s or PhD degree, you’re in for a long, grueling journey. Understanding how to get through exams and qualifications, persevering through research, and finishing quickly are all things that are not taught directly to you. Although this book is fairly old, the lessons it teaches about how to succeed still apply to grad school today.
Bird by Bird: Some Instructions on Writing and Life, by Anne Lamott (Anchor)
Bird by Bird is a guide to writing, but it’s also a great guide for life. The title comes from something Anne Lamott’s father said to her brother when he was freaking out about doing a report on birds he’d had three months to do but had left to the last night: “Bird by bird, buddy. Just take it bird by bird.” If you’ve been struggling with perfectionism or figuring out what you can write about, this book may be the one for you.
Bootcamp rankings, by Switchup.org
https://www.switchup.org/rankings/best-data-science-bootcamps
Switchup provides a listing of the top 20 bootcamps based on student reviews. Although you may want to take the reviews and orderings with a grain of salt, this blog is still a solid starting point for choosing which bootcamps to apply to.
What’s the Difference between Data Science, Machine Learning, and Artificial Intelligence?, by David Robinson
http://varianceexplained.org/r/ds-ml-ai
If you’re confused about what is data science versus machine learning versus artificial intelligence, this post offers one helpful way to distinguish among them. Although there’s no universally agreed-upon definitions, we like this taxonomy in which data science produces insights, machine learning produces predictions, and artificial intelligence produces actions.
What You Need to Know before Considering a PhD, by Rachel Thomas
https://www.fast.ai/2018/08/27/grad-school
If you’re thinking that you need a PhD to be a data scientist, read this blog first. Thomas lays out the significant costs of getting a PhD (in terms of both potential mental health costs and career opportunity costs) and debunks the myth that you need a PhD to do cutting-edge research in deep learning.
Thinking of Blogging about Data Science? Here Are Some Tips and Possible Benefits, by Derrick Mwiti
If chapter 4 didn’t convince you of the benefits of blogging, maybe this post will. Mwiti also offers some great tips on making your posts engaging, including using bullet points and new datasets.
How to Build a Data Science Portfolio, by Michael Galarnyk
This is an excellent, detailed post on how to make a data science portfolio. Galarnyk shows not only what types of projects to include (and not include) in a portfolio, but also how to incorporate them into your résumé and share them.