Chapter 4. Building a portfolio

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 4. Building a portfolio

This chapter covers

Creating a compelling data science project
Starting a blog
Full walkthroughs of example projects

You’ve now finished a bootcamp, a degree program, a set of online courses, or a series of data projects in your current job. Congratulations—you’re ready to get a data scientist job! Right?

Well, maybe. Part 2 of this book is all about how to find, apply for, and get a data science position, and you can certainly start this process now. But another step can really help you be successful: building a portfolio. A portfolio is a set of data science projects that you can show to people so they can see what kind of data science work you can do.

A strong portfolio has two main parts: GitHub repositories (repos for short) and a blog. A GitHub repo hosts the code for a project, and the blog shows off your communication skills and the non-code part of your data science work. Most people don’t want to read through thousands of lines of code (your repo); they want a quick explanation of what you did and why it’s important (your blog). And who knows—you might even get data scientists from around the world reading your blog, depending on the topic. As we discuss in the second part of this chapter, you don’t have to just blog about analyses you did or models you built; you could also explain a statistical technique, write a tutorial for a text analysis method, or even share career advice (such as how you picked your degree program).

This isn’t to say that you need to have a blog or GitHub repos filled with projects to be a successful data scientist. In fact, the majority of data scientists don’t, and people get jobs without a portfolio all the time. But creating a portfolio is a great way to help you stand out and to practice your data science skills and get better. We hope that it’s fun, too!

This chapter walks you through the process of building a good portfolio. The first part is about doing a data science project and organizing it on GitHub. The second part goes over best practices for starting and sharing your blog so that you get the most value out of the work you’ve done. Then we walk you through two real projects we’ve done so that you can see the process end to end.

4.1. Creating a project

A data science project starts with two things: a dataset that’s interesting and a question to ask about it. You could take government census data, for example, and ask “How are the demographics across the country changing over time?” The combination of question and data is the kernel of the project (figure 4.1), and with those two things, you can start doing data science.

Figure 4.1. The flow of creating a data science project

4.1.1. Finding the data and asking a question

When you’re thinking about what data you want to use, the most important thing is finding data that’s interesting to you. Why do you want to use this data? Your choice of data is a way to show off your personality or the domain knowledge you have from your previous career or studies. If you’re in fashion, for example, you can look at articles about Fashion Week and see how styles have changed in the past 20 years. If you’re an enthusiastic runner, you can show how your runs have changed over time and maybe look to see whether your running time is related to the weather.

Something you shouldn’t do is use the Titanic dataset, MNIST, or any other popular beginning datasets. It’s not that these learning experiences aren’t good; they can be, but you’re probably not going to find anything novel that would surprise and intrigue employers or teach them more about you.

Sometimes, you let a question lead you to your dataset. You may be curious, for example, about how the gender distribution of college majors has changed over time and whether that change is related to median earnings after graduation. Then you’d take to Google and try to find the best source of that data.

But maybe you don’t have a burning question that you’ve just been waiting to have data science skills to answer. In this case, you can start by browsing datasets and seeing whether you can come up with any interesting questions. Here are a few suggestions for where you might start:

Kaggle.com— Kaggle started as a website for data science competitions. Companies post a dataset and a question, and usually offer a prize for the best answer. Because the questions entail machine learning models that try to predict something, such as whether someone would default on a loan or how much a house would sell for, users can compare models based on their performance on a holdout test set and get a performance metric for each one. Kaggle also has discussion forums and “kernels” in which people share their code so you can learn how others approached the dataset. As a result, Kaggle has thousands of datasets with accompanying questions and examples of how other people analyzed them. The biggest benefit of Kaggle is also its biggest drawback: by handing you a (generally cleaned) dataset and problem, it’s done a lot of the work for you. You also have thousands of people tackling the same problem, so it’s difficult to make a unique contribution. One way to use Kaggle is to take a dataset but pose a different question or do an exploratory analysis. But generally, we think that Kaggle is best for learning by tackling a project and then seeing how you performed compared with others, thus learning from what their models did, rather than as a piece of your portfolio.
Datasets in the news— Recently, many news companies have started making their data public. FiveThirtyEight.com, for example, a website that focuses on opinion-poll analysis, politics, economics, and sports blogging, publishes data it can use for articles and even links to the raw data directly from the article website. Although these datasets often require manual cleaning, the fact that they’re in the news means that an obvious question is probably associated with them.
APIs— APIs (application programming interfaces) are developer tools that allow you to access data directly from companies. You know how you can type in a URL and get to a website? APIs are like URLs, but instead of a website, you get data. Some examples of companies with helpful APIs are The New York Times and Yelp, which let you pull their articles and reviews, respectively. Some APIs even have R or Python packages that specifically make it easier to work with them. rtweet for R, for example, lets you pull Twitter data quickly so that you can find tweets with a specific hashtag, what the trending topics in Kyoto are, or what tweets Stephen King is favoriting. Keep in mind that there are limitations and terms of service to how you can use these APIs. Right now, for example, Yelp limits you to 5,000 calls a day, so you won’t be able to pull all reviews ever. APIs are great for providing extremely robust, organized data from many sources.
Government open data— A lot of government data is available online. You can use census data, employment data, the general social survey, and tons of local government data such as New York City’s 911 calls or traffic counts. Sometimes, you can download this data directly as a CSV file; at other times, you need to use an API. You can even submit Freedom of Information Act requests to government agencies to get data that isn’t publicly listed. Government information is great because it’s often detailed and deals with unusual subjects, such as data on the registered pet names of every animal in Seattle. The downside of government information is that it often isn’t well formatted, such as tables stored within PDF files.
Your own data— There are many places where you can download data about yourself; social media websites and email services are two big ones. But if you use apps to keep track of your physical activity, reading list, budget, sleep, or anything else, you can usually download that data as well. Maybe you could build a chatbot based on your emails with your spouse. Or you could look at the most common words you use in your tweets and how those words have changed over time. Perhaps you could track your caffeine intake and exercise for a month to see whether you can predict how much and well you sleep. The advantage of using your own data is that your project is guaranteed to be unique: no one else will have looked at that data before!
Web scraping— Web scraping is a way to extract data from websites that don’t have an API, essentially by automating visiting web pages and copying the data. You could create a program to search a movie website for a list of 100 actors, load their actor profiles, copy the lists of movies they’re in, and put that data in a spreadsheet. You do have to be careful, though: scraping a website can be against the website’s terms of use, and you could be banned. You can check the robots.txt file of a website to find out what is allowed. You also want to be nice to websites: if you hit a site too many times, you can bring it down. But assuming that the terms of service allow it and you build in time between your hits, scraping can be a great way to get unique data.

What makes a side project interesting? Our recommendation is to pick an exploratory analysis in which any result will probably teach the reader something or demonstrate your skills. You might create an interactive map of 311 calls in Seattle, color-coded by category; this map clearly demonstrates your visualization skills and shows that you can write about the patterns that emerge. On the other hand, if you try to predict the stock market, you likely won’t be able to, and it’s hard for an employer to assess your skills if you have a negative outcome.

Another tip is to see what comes up when you Google your question. If the first results are newspaper articles or blog posts that answer exactly the question you were asking, you may want to rethink your approach. Sometimes, you can expand on someone else’s analysis or bring in other data to add another layer to the analysis, but you may need to start the process over.

4.1.2. Choosing a direction

Building a portfolio doesn’t need to be a huge time commitment. The perfect is definitely the enemy of the good here. Something is better than nothing; employers are first and foremost looking for evidence that you can code and communicate about data. You may be worried that people will look and laugh at your code or say, “Wow, we thought this person might be okay, but look at their terrible code!” It’s very unlikely that this will happen. One reason is that employers tailor their expectations to seniority level: you won’t be expected to code like a computer science major if you’re a beginning data scientist. Generally, the bigger worry is that you can’t code at all.

It’s also good to think about the areas of data science we covered in chapter 1. Do you want to specialize in visualization? Make an interactive graph using D3. Do you want to do natural language processing? Use text data. Machine learning? Predict something.

Use your project to force yourself to learn something new. Doing this kind of hands-on analysis will show you the holes in your knowledge. When data you’re really interested in is on the web, you’ll learn web scraping. If you think that a particular graph looks ugly, you’ll learn how to make better visualizations. If you’re self-studying, doing a project is a nice way to overcome the paralysis of not knowing what to learn next.

A common problem with self-motivated projects is overscoping. Overscoping is wanting to do everything or keep adding more stuff as you go. You can always keep improving/editing/supplementing, but then you never finish. One strategy is to think like Hollywood and create sequels. You should set yourself a question and answer it, but if you think you may want to revisit it later, you can end your research with a question or topic for further investigation (or even “To be continued . . .?”, if you must).

Another problem is not being able to pivot. Sometimes, the data you wanted isn’t available. Or there’s not enough of it. Or you’re not able to clean it. These situations are frustrating, and it can be easy to give up at this point. But it’s worth trying to figure out how you can salvage the project. Have you already done enough work to write a blog post tutorial, maybe on how you collected the data? Employers look for people who learn from their mistakes and aren’t afraid to admit them. Just showing what went wrong so that others might avoid the same fate is still valuable.

4.1.3. Filling out a GitHub README

Maybe you’re in a bootcamp or a degree program in which you’re already doing your own projects. You’ve even committed your code to GitHub. Is that enough?

Nope! A minimal requirement for a useful GitHub repository is filling out the README file. You have a couple of questions to answer:

What is the project? What data does it use? What question is it answering? What was the output: A model, a machine learning system, a dashboard, or a report?
How is the repository organized? This question implies, of course, that the repo is in fact organized in some manner! There are lots of different systems, but a basic one is dividing your script into parts: getting (if relevant) your data, cleaning it, exploring it, and the final analysis. This way, people know where to go depending on what they’re interested in. It also suggests that you’ll keep your work organized when you go to work for a company. A company doesn’t want to risk hiring you and then, when it’s time to hand off a project, you give someone an uncommented, 5,000-line script that may be impossible for them to figure out and use. Good project management also helps the you of the future: if you want to reuse part of the code later, you’ll know where to go.

But although doing a project and making it publicly available in a documented GitHub repo is good, it’s very hard to look at code and understand why it’s important. After you do a project, the next step is writing a blog post, which lets people know why what you did was cool and interesting. No one cares about pet_name_analysis.R, but everyone cares about “I used R to find the silliest pet names!”

4.2. Starting a blog

Blogs allow you to show off your thinking and projects, but they can also offer a nontechnical view of your work. We know, we know—you just learned all this great technical stuff! You want to show it off! But being a data scientist almost always entails communicating your results to a lay audience, and a blog will give you experience translating your data science process into business language.

4.2.1. Potential topics

Suppose that you’ve created a blog. Are people really going to be interested in your projects? You don’t even have a data scientist title yet; how can you teach anyone anything?

Something good to remember is that you are best positioned to teach the people a few steps behind you. Right after you’ve learned a concept, such as using continuous integration for your package or making a TensorFlow model, you still understand the misconceptions and frustrations you’ve had. Years later, it’s hard to put yourself in that beginner’s mindset. Have you ever had a teacher who was clearly very smart and yet couldn’t communicate concepts at all? You didn’t doubt that they knew the topic, but they couldn’t break it down for you and seemed to be frustrated that you didn’t just get it right away.

Try thinking of your audience as the you of six months ago. What have you learned since then? What resources do you wish had been available? This exercise is also great for celebrating your progress. With so much to learn in data science, it’s easy to feel that you’ve never done enough; pausing to see what you’ve accomplished is nice.

You can group data science blog posts into four categories:

Code-heavy tutorials— Tutorials show your readers how to do things like web scraping or deep learning in Python. Your readers will generally be other aspiring or practicing data scientists. Although we call the tutorials code-heavy, you’ll usually still want there to be as many lines of text as code, if not more. Code generally isn’t self-explanatory; you need to walk a reader through what each part does, why you’d want to do it, and what the results are.
Theory-heavy tutorials— These tutorials teach your readers a statistical or mathematical concept, such as what empirical Bayes is or how principal component analysis works. They may have some equations or simulations. As with code-heavy tutorials, your audience is usually other data scientists, but you should write so that anyone who has some mathematics background can follow along. Theory-heavy tutorials are especially good for demonstrating your communication skills; there’s a stereotype that many technical people, especially if they have PhD degrees, can’t explain concepts well.
A fun project you did— As we hope that we convinced you in section 4.1, you don’t need to work only on groundbreaking medical image recognition. You can also find out which of the Twilight movies used only words from Shakespeare’s The Tempest. Julia Silge, for example, whom we interviewed in chapter 3, used neural networks to generate text that sounds like Jane Austen. These blog posts can focus more on the results or the process, depending on what the most interesting part of your project was.
Writing up your experience— You don’t have to just blog about tutorials or your data science projects. You can talk about your experience at a data science meetup or conference: what talks you found interesting, advice for people going to their first one, or some resources that the speakers shared. This type of post can be helpful to people who are considering attending the same event the following year or who can’t attend conferences for logistical or financial reasons. Again, these types of blog posts give potential employers insight into how you think and communicate.

4.2.2. Logistics

But where should you put your interesting writing? For a blog, you have two main options:

Make your own website. If you work in R, we suggest using the package blogdown, which lets you create a website for a blog using R code (wild, right?). If you use Python, Hugo and Jekyll are two options, both of which let you create static blog websites and come with a bunch of themes that other people have built, letting you write blog posts in markdown. We suggest that you don’t worry much about your theme and style; just pick one you like. Nothing is worse than not writing blog posts because you got too distracted by changing the blog’s appearance. Simple is probably best; it can be a pain to change your theme, so it’s better not to pick one you may get sick of in six months.
Using Medium or another blogging platform. Medium is a free, online publishing platform. The company generally doesn’t write content; instead, it hosts content for hundreds of thousands of authors. Medium and sites like it are good options if you want to start quickly, because you don’t have to worry about hosting or starting a website; all you do is click “New Post,” start writing, and publish. You can also get more traffic when people search the blogging site for terms such as data science or Python. But one concern is that you’re at the platform’s mercy. If the company changes its business model and puts everything behind a paywall, for example, you can’t do anything to keep your blog posts free. You also don’t get to create a real biography section or add other content, such as a page with links to talks you’ve given.

One common question people have about blogging is how often they need to post and how long those posts should be. These things are definitely personal choices. We’ve seen people who have micro blogs, publishing short posts multiple times a week. Other people go months between posts and publish longer articles. There are some limitations; you do want to make sure that your posts don’t start to resemble Ulysses. If your post is very long, you can split it into parts. But you want to show that you can communicate concisely, as that’s one of the core data science skills. Executives and even your manager probably don’t want or need to hear all the false starts you had or the 20 different things you tried. Although you may choose to include a brief summary of your false starts, you need to get to the point and your final path quickly. One exception, though, is if your final method is going to surprise readers. If you didn’t use the most popular library for a problem, for example, you may want to explain that you didn’t because you found that the library didn’t work.

What if you’re worried that no one will read your blog and that all your work will be for nothing? Well, one reason to have a blog anyway is that it helps your job applications. You can put links to your blog posts on your résumé when you reference data science projects and even show them to people in interviews, especially if the posts have nice interactive visualizations or dashboards. It’s not important to have hundreds or thousands of readers. It can be nice if you get claps on Medium or if you’re featured in a data science company’s newsletter, but it’s more important to have an audience that will read, value, and engage with the material than to have high metrics.

That’s not to say there’s nothing you can do to build readership. For one thing, you should advertise yourself; although it’s a cliché, having a #brand is useful for building a network in the long term. Even if something seems to be simple, it’s probably new to a bunch of practicing data scientists just because the field is so big. People at the companies you want to work for may even read your stuff! Twitter is a good place to start; you can share the news when you release a post and use the appropriate hashtags to get wider readership.

But your blog is valuable even if no one (besides your partner and pet) reads it. Writing a blog post is good practice; it forces you to structure your thoughts. Just like teaching in person, it also helps you realize when you don’t know something as well as you thought you did.

4.3. Working on example projects

In this section, we walk you through two example projects, from the initial idea through the analysis to a final public artifact. We’ll use real projects that the authors of this book did: creating a web application for data science freelancers to find the best-fit jobs and learning neural networks by training one on a dataset of banned license plates.

4.3.1. Data science freelancers

Emily Robinson

The question

When I was an aspiring data scientist, I became interested in one way some data scientists make extra money: freelancing. Freelancing is doing projects for someone you’re not employed by, whether that’s another person or a large company. These projects range from a few hours to months of full-time work. You can find many freelancing jobs posted on freelancing websites like UpWork, but because data science is a very broad field, jobs in that category could be anything from web development to an analysis in Excel to natural language processing on terabytes of data. I decided to see whether I could help freelancers wade through thousands of jobs to find the ones that are the best fit for them.

The analysis

To gather the data, I used UpWork’s API to pull currently available jobs and the profiles of everyone in the Data Science and Analytics category. I ended up with 93,000 freelancers and 3,000 jobs. Although the API made it relatively easy to access the data (as I didn’t have to do web scraping), I still had to make functions to do hundreds of API calls, handle when those API calls failed, and then transform the data so I could use it. But the advantage of this process was that because the data wasn’t readily available, there weren’t hundreds of other people working on the same project, as there would have been if I’d used data from a Kaggle competition.

After I got the data in good shape, I did some exploratory analysis. I looked at how education levels and country affected how much freelancers earned. I also made a correlation graph of the skills that freelancers listed, which showed the different types of freelancers: web developers (PHP, jQuery, HTML, and CSS), finance and accounting (financial accounting, bookkeeping, and financial analysis), and data gathering (data entry, lead generation, data mining, and web scraping), along with the “traditional” data science skill set (Python, machine learning, statistics, and data analysis).

Finally, I created a similarity score between profile text and the job text, and combined that score with the overlap in skills (both freelancers and jobs listed skills) to create a matching score for a freelancer and a job.

The final product

In this case, I didn’t end up writing a blog post. Instead, I made an interactive web application in which someone could enter their profile text, skills, and requirements for jobs (such as a minimum feedback score for the job poster and how long the job would take), and the available jobs would be filtered to meet those requirements and sorted by how well they fit the user.

I didn’t let the perfect be the enemy of the good here; there are plenty of ways I could have made the project better. I pulled the jobs only once, and because I did this project four years ago, the application still works, but none of those jobs are available anymore. To make the application valuable over the long term, I’d need to pull jobs nightly and update the listings. I also could have made a more sophisticated matching algorithm, sped up the initial loading time of the app, and made the appearance fancier. But despite these limitations, the project met a few important goals. It showed that I could take a project and allow people to interact with it rather than be limited to static analyses that lived on my laptop. It had a real-world use case: helping freelancers find jobs. And it took me through the full data science project cycle: gathering the data, cleaning it, running exploratory analyses, and producing a final output.

4.3.2. Training a neural network on offensive license plates

Jacqueline Nolis

The question

As I grew as a data scientist, I was always frustrated when I saw hilarious blog posts in which people trained neural networks to generate things like new band names, new Pokémon, and weird cooking recipes. I thought these projects were great, but I didn’t know how to make them myself! One day, I remembered that I had heard of a dataset of all the custom license plates that were rejected by the state of Arizona for being too offensive. If I could get that dataset, it would be perfect for finally learning how to make neural networks—I could make my own offensive license plates (figure 4.2)!

Figure 4.2. Sample output of the offensive license plate generator neural network

The analysis

After submitting a public records request to the Arizona Department of Transportation, I got a list of thousands of offensive license plates. I didn’t know anything about neural networks, so after receiving the data, I started scouring the internet for blog posts describing how to make one. As primarily an R user, I was happy to find the Keras package by RStudio for making neural networks in R.

I loaded the data into R and then checked out the RStudio Keras package example for generating text with neural networks. I modified the code to work with license-plate data; the RStudio example was for generating sequences of long text, but I wanted to train on seven-character license plates. This meant creating multiple training data points for my model from each license plate (one data point to predict each character in the license plate).

Next, I trained the neural network model, although it didn’t work at first. After putting the project down for a month, I came back and realized that my data wasn’t being processed quite right. When I fixed this problem, the results that the neural network generated were fantastic. Ultimately, even though I didn’t change the RStudio example much, by the end, I felt much more confident in creating and using neural networks.

The final product

I wrote a blog post about the project that walks through how I got the data, the act of processing it to be ready for the neural network, and how I modified the RStudio example code to work for me. The blog post was very much an “I’m new at neural networks, and here is what I learned” style of post; I didn’t pretend that I already knew how all this worked. As part of the blog post, I made an image that took the text output of my neural model and made it look like Arizona license plates. I also put the code on GitHub.

Since I wrote that blog post and made my code available, numerous other people have modified it to make their own funny neural networks. What I learned from this goofy project eventually helped me to make high-impact machine learning models for important consulting engagements. Just because the original work isn’t serious doesn’t mean that there isn’t value in it!

4.4. Interview with David Robinson, data scientist

David Robinson is the co-author (with Julia Silge) of the tidytext package in R and the O’Reilly book Text Mining with R. He’s also the author of the self-published e-book Introduction to Empirical Bayes: Examples from Baseball Statistics and the R packages broom and fuzzyjoin. He holds a PhD in quantitative and computational biology from Princeton University. Robinson writes about statistics, data analysis, education, and programming in R on his popular blog: varianceexplained.org.

How did you start blogging?

I first started blogging when I was applying for jobs near the end of my PhD, as I realized that I didn’t have a lot out on the internet that showed my skills in programming or statistics. When I launched my blog, I remember having the distinct fear that once I wrote the couple of posts I had ready, I would run out of ideas. But I was surprised to find that I kept coming up with new things I wanted to write about: datasets I wanted to analyze, opinions I wanted to share, and methods I wanted to teach. I’ve been blogging moderately consistently for four years since then.

Are there any specific opportunities you have gotten from public work?

I did get my first job from something I wrote publicly online. Stack Overflow approached me based on an answer I’d written on Stack Overflow’s statistics site. I’d written that answer years ago, but some engineers there found it and were impressed by it. That experience really led me to have a strong belief in producing public artifacts, because sometimes benefits will show up months or years down the line and lead to opportunities I never would have expected.

Are there people you think would especially benefit from doing public work?

People whose résumés might not show their data science skills and who don’t have a typical background, like having a PhD or experience as a data analyst, would particularly benefit from public work. When I’m evaluating a candidate, if they don’t have those kinds of credentials, it’s hard to say if they’ll be able to do the job. But my favorite way to evaluate a candidate is to read an analysis they’ve done online. If I can look at some graphs someone created, how they explained the story, and how they dug into the data, I can start to understand whether they’re a good fit for the role.

How has your view on the value of public work changed over time?

The way I used to view projects is that you made steady progress as you kept working on something. In graduate school, an idea wasn’t very worthwhile, but then it became some code, a draft, a finished draft, and finally a published paper. I thought that along the way, my work was getting slowly more valuable.

Since then, I’ve realized I was thinking about it completely wrong. Anything that is still on your computer, however complete it is, is worthless. If it’s not out there in the world, it’s been wasted so far, and anything that’s out in the world is much more valuable. What made me realize this is a few papers I developed in graduate school that I never published. I put a lot of work into them, but I kept feeling they weren’t quite ready. Years later, I’ve forgotten what’s in them, I can’t find them, and they haven’t added anything to the world. If along the way I’d written a couple of blog posts, sent a couple of tweets, and maybe made a really simple open source package, all of those would have added value along the way.

How do you come up with ideas for your data analysis posts?

I’ve built up a habit that every time I see a dataset, I’ll download it and take a quick look at it, running a few lines of code to get a sense of the data. This helps you build up a little of data science taste, working on enough projects that you get a feel for what pieces of data are going to yield an interesting bit of writing and which might be worth giving up on.

My advice is that whenever you see the opportunity to analyze data, even if it’s not in your current job or you think it might not be interesting to you, take a quick look and see what you can find in just a few minutes. Pick a dataset, decide on a set amount of time, do all the analyses that you can, and then publish it. It might not be a fully polished post, and you might not find everything you’re hoping to find and answer all the questions you wanted to answer. But by setting a goal of one dataset becoming one post, you can start getting into this habit.

What’s your final piece of advice for aspiring and junior data scientists?

Don’t get stressed about keeping up with the cutting edge of the field. It’s tempting when you start working in data science and machine learning to think you should start working with deep learning or other advanced methods. But remember that those methods were developed to solve some of the most difficult problems in the field. Those aren’t necessarily the problems that you’re going to face as a data scientist, especially early in your career. You should start by getting very comfortable transforming and visualizing data; programming with a wide variety of packages; and using statistical techniques like hypothesis tests, classification, and regression. It’s worth understanding these concepts and getting good at applying them before you start worrying about concepts at the cutting edge.

Summary

Having a portfolio of data science projects shared in a GitHub repo and a blog can help you get a job.
There are many places to find good datasets for a side project; the most important thing is to choose something that’s interesting to you and a little bit unusual.
You don’t have to blog only about your side projects; you can also share tutorials or your experience with a bootcamp, conference, or online course.

Chapters 1–4 resources

Books

Practical Data Science with R, 2nd ed., by Nina Zumel and John Mount (Manning Publications)

This book is an introduction to data science that uses R as the primary tool. It’s a great supplement to the book you’re currently holding because it goes much deeper into the technical components of the job. It works through taking datasets, thinking about the questions you can ask of them and how to do so, and then interpreting the results.

Doing Data Science: Straight Talk from the Frontline, by Cathy O’Neil and Rachel Schutt (O’Reilly Publications)

Another introduction to data science, this book is a mixture of theory and application. It takes a broad view of the field and tries to approach it from multiple angles rather than being a set of case studies.

R for Everyone, 2nd ed., by Jared Lander, and Pandas for Everyone, by Daniel Chen (Addison-Wesley Data and Analytics)

R for Everyone and Pandas for Everyone are two books from the Addison-Wesley Data and Analytics Series. They cover using R and Python (via pandas) from basic functions to advanced analytics and data science problem-solving. For people who feel that they need help in learning either of these topics, these books are great resources.

Think Like a Data Scientist: Tackle the Data Science Process Step-by-Step, by Brian Godset (Manning Publications)

Think Like a Data Scientist is an introductory data science book structured around how data science work is actually done. The book walks through defining the problem and creating the plan, solving data science problems, and then presenting your findings to others. This book is best for people who understand the technical basics of data science but are new to working on a long-term project.

Getting What You Came For: The Smart Student’s Guide to Earning an M.A. or a Ph.D., by Robert L. Peters (Farrar, Straus and Giroux)

If you’ve decided to go to graduate school to get a master’s or PhD degree, you’re in for a long, grueling journey. Understanding how to get through exams and qualifications, persevering through research, and finishing quickly are all things that are not taught directly to you. Although this book is fairly old, the lessons it teaches about how to succeed still apply to grad school today.

Bird by Bird: Some Instructions on Writing and Life, by Anne Lamott (Anchor)

Bird by Bird is a guide to writing, but it’s also a great guide for life. The title comes from something Anne Lamott’s father said to her brother when he was freaking out about doing a report on birds he’d had three months to do but had left to the last night: “Bird by bird, buddy. Just take it bird by bird.” If you’ve been struggling with perfectionism or figuring out what you can write about, this book may be the one for you.

Blog posts

Bootcamp rankings, by Switchup.org

https://www.switchup.org/rankings/best-data-science-bootcamps

Switchup provides a listing of the top 20 bootcamps based on student reviews. Although you may want to take the reviews and orderings with a grain of salt, this blog is still a solid starting point for choosing which bootcamps to apply to.

What’s the Difference between Data Science, Machine Learning, and Artificial Intelligence?, by David Robinson

http://varianceexplained.org/r/ds-ml-ai

If you’re confused about what is data science versus machine learning versus artificial intelligence, this post offers one helpful way to distinguish among them. Although there’s no universally agreed-upon definitions, we like this taxonomy in which data science produces insights, machine learning produces predictions, and artificial intelligence produces actions.

What You Need to Know before Considering a PhD, by Rachel Thomas

https://www.fast.ai/2018/08/27/grad-school

If you’re thinking that you need a PhD to be a data scientist, read this blog first. Thomas lays out the significant costs of getting a PhD (in terms of both potential mental health costs and career opportunity costs) and debunks the myth that you need a PhD to do cutting-edge research in deep learning.

Thinking of Blogging about Data Science? Here Are Some Tips and Possible Benefits, by Derrick Mwiti

http://mng.bz/gVEx

If chapter 4 didn’t convince you of the benefits of blogging, maybe this post will. Mwiti also offers some great tips on making your posts engaging, including using bullet points and new datasets.

How to Build a Data Science Portfolio, by Michael Galarnyk

http://mng.bz/eDWP

This is an excellent, detailed post on how to make a data science portfolio. Galarnyk shows not only what types of projects to include (and not include) in a portfolio, but also how to incorporate them into your résumé and share them.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 4. Building a portfolio

Create new playlist

Sign In

Sign Up

Chapter 4. Building a portfolio

4.1. Creating a project

Figure 4.1. The flow of creating a data science project

4.1.1. Finding the data and asking a question

4.1.2. Choosing a direction

4.1.3. Filling out a GitHub README

4.2. Starting a blog

4.2.1. Potential topics

4.2.2. Logistics

4.3. Working on example projects

4.3.1. Data science freelancers

The question

The analysis

The final product

4.3.2. Training a neural network on offensive license plates

The question

Figure 4.2. Sample output of the offensive license plate generator neural network

The analysis

The final product

4.4. Interview with David Robinson, data scientist

How did you start blogging?

Are there any specific opportunities you have gotten from public work?

Are there people you think would especially benefit from doing public work?

How has your view on the value of public work changed over time?

How do you come up with ideas for your data analysis posts?

What’s your final piece of advice for aspiring and junior data scientists?

Summary

Chapters 1–4 resources

Books

Blog posts

Table of Contents for
Chapter 4. Building a portfolio