Introduction

FOR THE THIRD YEAR RUNNING, we at O’Reilly Media have collected survey data from data scientists, engineers, and others in the data space about their skills, tools, and salary. Some of the same patterns we saw last year are still present—newer, scalable open source tools in general correlate with higher salaries, Spark in particular continues to establish itself as a top tool. Much of this is apparent from other sources: large software companies that traditionally produced only proprietary software have begun to embrace open source; Spark courses, training programs, and conference talks have sprung up in great numbers. But who actually uses which tools (and are the old ones really disappearing)? Which tools do the highest earners use, and is it fair to attribute a particular variation in salary to using a certain tool? We hope that the findings in this iteration of the Data Science Salary Survey will go beyond what is already obvious to any data scientist or Strata attendee.

Preliminaries

This report is based on an online survey open from November 2014 to July 2015, publicized to the O’Reilly audience but open to anyone who had the link. Of the 820 respondents who answered at least one question, about a quarter dropped out before completing the survey and have been excluded from all segments of analysis except for those showing responses to single questions. We should be careful when making conclusions about survey data from a self-selecting sample—it is a major assumption to claim it is an unbiased representation of all data scientists and engineers—but with a little knowledge about our audience, the information in this report should be sufficiently qualified to be useful. As is clear from the survey results, the O’Reilly audience tends to use more newer, open source tools, and underrepresents non-tech industries such as insurance and energy. O’Reilly content—in books, online, and at conferences—is focused on technology, in particular new technology, so it makes sense that our audience would tend to be early adopters of some of the newer tools.

A final word on the self-selecting nature of the sample: differences between results in this survey and other surveys may simply arise from the samples’ idiosyncrasies and not from any meaningful difference. Findings from other salary survey reports—there have been a few recently in the data space—sometimes conflict directly with our findings, but this doesn’t necessarily imply that one set of findings are erroneous. Likewise, discrepancies between our own salary surveys don’t necessarily imply a trend. The methodology between this year’s survey and last year’s is close enough to allow us to make some conclusions based on year-to-year differences, but only when the numbers are very strong.

Introducing the Sample: Basic Demographics

Before we discuss salary we should describe who exactly took the survey. Despite the fact that this is a “data science” survey, only one-quarter of the respondents have job titles that explicitly identify them as “data scientists.” Of course, it is debatable how much meaning can be assumed simply from a job title—more on that later—but it’s safe to say that the data science world is inhabited by people who call themselves something else: by job title, 14% of the sample are analysts, 10% are engineers (usually “data,” “software,” or “analytics” engineers), 6% are programmers/developers, 3% are architects (of various kinds), 4% are in the business intelligence sector, and 1% are statisticians. Management is also present in the sample: managers (9%) and directors (5%) are the most significant groups, with a handful of VPs, CxOs, and founders as well. The rest of the sample comprised mostly of students, postdocs, professors, and consultants. Judging by the tools used by the sample, the vast majority—even the managers—had some technical side to their role, regardless of job title.

Beyond job title, the sample includes respondents from 47 countries and 38 states across multiple industries, including software, banking, retail, healthcare, publishing, and education. Two-thirds of the survey sample is based in the US, and compared to its share in population, California is disproportionately represented (22% of the US respondents, 15% of the total sample). The software industry’s 23% share is the largest among industries, and this excludes other “tech” industries such as IT consulting, computers/hardware, cloud services, search, and (computer) security; when considered in aggregate, these account for 40% of the sample. A third of the sample is from companies with over 2,500 employees, while 29% comes from companies with fewer than 100 employees. One-third of the sample is age 30 or younger, while less than 10% is older than 45.

In terms of education, 23% of the sample hold a doctorate degree, and 44% (not including the PhDs) hold a master’s. Many respondents reported to be a “student, full- or part-time, any level”: aside from the 3% who gave job titles indicating full-time study (usually at the graduate level), 15% of the sample—data scientists, analysts, and engineers—said they were students. Two-thirds of respondents had academic backgrounds in computer science, mathematics, statistics, or physics.

image
image

Salary: The Big Picture

The median annual base salary of the survey sample is $91,000, and among US respondents is $104,000. These figures show no significant change from last year.[1] The middle 50% of US respondents earn between $77,000 and $135,000. For understanding how salary varies over features we introduce a linear model; for now we only consider basic demographic variables, but later we will introduce others that describe respondents’ work and skills in more detail. While looking at median salaries for a particular slice of respondents gives a general idea of how much a certain demographic might influence salary, a linear model is a simple way of isolating and estimating the “effect” of a certain variable.[2]

Management

Because the directors, VPs and CxOs, and founders, in this order, come from companies of decreasing size, their actual hierarchal level is more or less even (and, it turns out, so are their salaries), and we group them together when constructing salary models. We call this group “upper management” to distinguish them from regular “managers” (who include project and product managers), although it should be remembered that few, if any, respondents come from large companies above the director level. For the basic model we will ignore job title distinctions except for the two management categories. That is, the first model treats data “scientists” and data “analysts” the same. However, we exclude those respondents who are students.[3]

A basic, parsimonious linear model

We created a basic, parsimonious linear model using the lasso with R2 of 0.382.[4] Most features were excluded from the model as insignificant:

70577 intercept
 +1467 age (per year above 18; e.g., 28 is +14,670)
 –8026 gender=Female
 +6536 industry=Software (incl. security, cloud ser-
vices)
–15196 industry=Education
 -3468 company size: <500
  +401 company size: 2500+
–15196 industry=Education
+32003 upper management (director, VP, CxO)
 +7427 PhD
+15608 California
+12089 Northeast US
  –924 Canada
–20989 Latin America
–23292 Europe (except UK/I)
–25517 Asia

image
Base pay

Starting at a base salary of $70,577, we add $1,467 for every year of age past 18 (so the base for a 48-year-old is $114,587). Salaries at larger companies tend to be high-er—add another $401 if your company has more than 3,000 employees, but subtract $3,468 if it has fewer than 500[5]—and the software industry is the only one to have a significant positive coefficient. Education has a negative coefficient—presumably, these are largely respondents who work at a university. Those in upper management take home an average of $32,000 extra in their base salary.

Gender

Just as in the 2014 survey results, the model points to a huge discrepancy of earnings by gender, with women earning $8,026 less than men in the same locations at the same types of companies. Its magnitude is lower than last year’s coefficient of $13,000, although this may be attributed to the differences in the models (the lasso has a dampening effect on variables to prevent over-fitting), so it is hard to say whether this is any real improvement.

Geography

In terms of geography, the top-earning locations are California (+$16,000) and the Northeast (+$12,000; from NY/NJ into New England), while the rest of the country, as well as UK/Ireland and Australia/NZ, are estimated to be roughly equal. The rest of Europe, meanwhile, is much lower (–$23,000), not far off from Asia (–$26,000) and Latin America (also –$21,000). Making reliable distinctions in salary between countries, as opposed to the continental aggregates, is not possible due to the relatively small non-US sample.

Education

According to this model, a PhD is worth $7,500 (each year) to a data scientist. As for a master’s degree—its estimated contribution to salary was not significant enough for the algorithm to make it into this first model.

image
image
image
image
image
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset