Chapter 2. Data science companies

This chapter covers

  • Types of companies hiring data scientists
  • The pros and cons of each company type
  • The tech stacks you may see at different jobs

As discussed in chapter 1, data science is a wide field with lots of different roles: research scientist, machine learning engineer, business intelligence analyst, and more. Although the work you do as a data scientist depends on your role, it is equally influenced by the company where you’re working. Big company versus small, tech versus traditional industry, and young versus established can influence project focus, supporting technology, and team culture. By understanding a few archetypes of companies, you’ll be better prepared when you’re looking at places to work, either for your first data science job or your nth one.

The aim of this chapter is to give you an understanding of what some typical companies are like to work at each day. We’re going to present five fictional companies that hire data scientists. None of these companies are real, but all are based on research and our own work experiences, and they illustrate basic principles that can be broadly applied. Although no two companies are exactly alike, knowing these five archetypes should help you assess prospective employers.

Even though these stereotypes are based on what we’ve seen as trends in these industries, they’re certainly not gospel. You may find a company that totally breaks the mold of what we say here—or a specific team in the company that’s unlike the company itself.

Although the companies in this chapter are fake, all the blurbs you’ll see are by real data scientists working at real companies!

2.1. MTC: Massive Tech Company

  • Similar to: Google, Facebook, and Microsoft
  • Company age: 20 years
  • Employees: 80,000

MTC is a tech company with a massive footprint, selling cloud services, consumer productivity software such as a text editor, server hardware, and countless one-off business solutions. The company has amassed a large fortune and uses it to fund unusual research and development (R&D) projects such as self-driving scooters and virtual reality (VR) technology. Their R&D makes the news, but most members of the technical workforce are engineers who make incremental improvements in their existing products, add more features, improve the user interface, and launch new versions.

2.1.1. Your team: One of many in MTC

MTC has nearly a thousand data scientists spread across the company. These data scientists are largely grouped into teams, each supporting a different product or division, or individually placed within a non-data science team to fully support it. There are, for example, VR headset data scientists on one team, marketing data scientists on a second team, and VR-headset marketing data scientists on a third team, while the VR-headset supply chain team has their own data scientist too.

If you were a member of one of those data science teams, when you joined, you would have been onboarded quickly. Large organizations hire new people every day, so the company should have standard processes for getting you a laptop and access to data, as well as training you on how to use any special tools. On the team, you’d be tasked with doing data science for your particular area of focus. That area could include creating reports and charts that executives could use to justify funding projects. It could also be building machine learning models that would be handed off to software developers to put into production.

Your team is likely to be large and full of experienced people. Because MTC is a large, successful tech company, it has the broad footprint to draw in many good recruits to hire. Your team will be large, so people within it may be working on nearly unrelated tasks; one person could be doing an exploratory analysis for a director in R, for example, and another could be building a machine learning model in Python for a sister team. The size of the team is a blessing and a curse: you have a large body of expert data scientists to discuss ideas with, but most of them probably don’t have familiarity with the particular tasks you are working on. Also, there is an established hierarchy on your team. The people with the more senior positions tend to be listened to more because they have more experience in the field and more experience with dealing with different departments at MTC.

The work your team does is likely a healthy balance of keeping the company running, such as making monthly reports and providing quarterly machine learning model updates, as well as doing new projects, such as creating a forecast that has never been done before. The team’s manager has to balance the flood of requests for data science work from other teams, which help those teams in the short term, with the desire to do innovative but unrequested work that may provide long-term benefits. With MTC’s large cash stores, the company can afford to do a lot more innovation and R&D than other companies, a fact that trickles down into willingness to try interesting new data science projects.

2.1.2. The tech: Advanced, but siloed across the company

MTC is a massive organization, and with organizations of this size, it’s impossible to avoid using different types of technology throughout the company. One department may store order and customer data in a Microsoft SQL Server database; a different department may keep records in Apache Hive. Worse, not only is the technology to store data disjointed, but also, the data itself may be. One department may keep customer records indexed by phone number; a different department may use email addresses to index customers.

Most MTC-size companies have their own homemade technology stacks. Thus, as a data scientist at MTC, you have to learn the specific ways to query and use data that are particular to MTC. Learning these specialized tools is great for getting you more access within MTC, but the knowledge you gain can’t be transferred to other companies.

As a data scientist, you will likely use several possible tools. Because MTC is so big, it has plenty of support for major languages such as R and Python, which many people use. Some teams may also use paid languages such as SAS or SPSS, but this situation is a bit rarer. If you want to use an unusual language that you enjoy but that few other people use, such as Haskell, you may or may not be able to, depending on your manager.

The machine learning stack varies dramatically depending on what part of the company you are in. Some teams use microservices and containers to deploy models efficiently, whereas others have antiquated production systems. The diversity in tech stack for deploying software makes it difficult to connect to other teams’ APIs; there is no one central location for learning about and understanding what is going on.

2.1.3. The pros and cons of MTC

Being a data scientist at MTC means having an impressive job at an impressive company. Because MTC is a tech company, people know what a data scientist is and what helpful things you can do. Having a universal understanding of your role makes the job a lot easier. The high number of data scientists in the company means that you have a large support network you can rely on if you are struggling, as well as smooth processes for joining the company and gaining access to required resources. You’ll rarely find yourself stuck and on your own.

Having lots of data scientists around you comes with cons as well. The tech stack is complex and difficult to navigate because so many people have built it up in so many ways. An analysis you’ve been asked to re-create may be written in a language you don’t know by a person who’s no longer around. It’ll be harder to stand out and be noticed because there are so many other data scientists around you. And you may find it difficult to find an interesting project to work on because so many of the obvious projects have already been started by other people.

Because MTC is an established company, working there gives you more job security. There is always the risk of layoffs, but working for MTC isn’t like working for a startup, where funding could dry up at any moment. Also, at large companies, managers lean more toward finding a new team for someone to work on rather than firing them; firing opens all sorts of legal complications that require thorough backup support for the termination decision.

Something that’s both a pro and con of MTC is that people serve in many specialized roles within the company. Data engineers, data architects, data scientists, market researchers, and more all perform different roles that relate to data science, which means you’ll have lots of people to pass work off to. You have a low chance of being forced to create your own database, for example. This situation is great for passing off work outside your expertise, but it also means that you can’t stretch your skills.

Another con of MTC is the bureaucracy. In a large company, getting approvals for things like new technology, trips to conferences, and starting projects can require going far up the chain of command. Worse, the project you’ve been working on for years could be canceled because two executives are fighting, and your project is collateral damage.

MTC is a great company for data scientists who are looking to help solve big problems by using cutting-edge techniques—both decision scientists who want to do analyses and machine learning engineers who want to build and deploy models. Large companies have lots of problems to solve and a budget that allows for trying new things. You may not be able to make big decisions yourself, but you’ll know that you’ve contributed.

MTC is a poor choice for a data scientist who wants to be the decision-maker and call the shots. The large company has established methods, protocols, and structures that you have to follow.

2.2. HandbagLOVE: The established retailer

  • Similar to: Payless, Bed Bath & Beyond, and Best Buy
  • Company age: 45 years
  • Size: 15,000 employees (10,000 in retail stores, 5,000 in corporate)

HandbagLOVE is a retail chain with 250 locations across the United States, all selling purses and clutches. The company has been around for a long time and is filled with experts on how to lay out a store and improve the customer experience. The company has been slow to adopt new technology, taking plenty of time before getting its first website and its first app.

Recently, HandbagLOVE has seen its sales drop, as Amazon and other online retailers have eaten away at its market share. Knowing that the writing is on the wall, HandbagLOVE has been looking to improve via technology, investing in an online app and an Amazon Alexa skill, and trying to use the value of its data. HandbagLOVE has had financial analysts employed for many years calculating high-level aggregate statistics on its orders and customers, but only recently has the company considered hiring data scientists to help them understand customer behavior better.

The newly formed data science team was built on a base of financial analysts who previously made Excel reports on performance metrics for the company. As HandbagLOVE supplemented these people with trained data scientists, the team started to provide more-sophisticated products: monthly statistical forecasts on customer growth in R, interactive dashboards that allow executives to understand sales better, and a customer segmentation that buckets customers into helpful groups for marketing.

Although the team has made machine learning models to power the new reports and analyses, HandbagLOVE is far from deploying machine learning models into continuously running production. Any product recommendations on its website and app are powered by third-party machine learning products rather than having been built within the company. There is talk on the data science team about changing this situation, but no one knows how many years away that is.

2.2.1. Your team: A small group struggling to grow

The team leans heavily toward data scientists who can do reporting rather than being trained in machine learning because machine learning is so new. When members of the team have needed modern statistical and machine learning methods, they’ve had to teach themselves because no one around already knew them. This self-teaching is great in that people get to learn new techniques that are interesting to them. The downside is that some of the technical methods used may be inefficient or even wrong because there are no experts to check the work.

HandbagLOVE has laid out general paths for data scientists to progress into senior roles. Unfortunately, these career paths aren’t specific to data science; they’re high-level goals copied and pasted from other positions, such as software development, because no one really knows what the metrics should be. To progress in your career, you have to convince your manager that you’re ready, and with luck, your manager can get approval to promote you. On the plus side, if the team ends up growing, you’ll quickly become a senior person on the team.

Because the data science team provides reports and models for departments throughout the company (such as marketing, supply chain, and customer care), the members of the data science team are well known. This fact has given the team a great deal of respect within the company, and in turn, the data science team has a lot of camaraderie within it. The combination of the size of the team and the level of influence within the company allows data scientists to have far more influence than they would in other companies. It’s not unusual for someone on the data science team to meet with top-level executives and contribute to the conversation.

2.2.2. Your tech: A legacy stack that’s starting to change

A common phrase you hear when talking about technology at HandbagLOVE is “Well, that’s how it’s always been.” Order and customer data are stored in an Oracle database that’s directly connected to the cash register technology and hasn’t changed in 20 years. The system has been pushed well past its limits and has had many modifications bolted on. All that being said, the system still works. Other data is collected and stored in the central database as well: data collected from the website, data from the customer care calls, and data from promotions and marketing emails. All these servers live on the premises (on-prem), not in the cloud, and an IT team keeps them maintained.

By having all the data stored in one large server, you have the freedom to connect and join the data however you want. And although your queries sometimes take forever or overload the system, usually you can find a workaround to get you something usable. The vast majority of analyses are done on your laptop. If you need a more powerful computer to train a model, getting it is a hassle. The company doesn’t have a machine learning tech stack because it doesn’t have any in-house machine learning.

2.2.3. The pros and cons of HandbagLOVE

By being at HandbagLOVE, you have a lot of influence and ability to do what you think is wise. You can go from proposing making a customer lifetime value model, building it, and using it within the company without having to persuade too many people to let you run with your idea. This freedom, which is due to a combination of the size of the company and the newness of data science, is very rewarding; you’re incredibly empowered to do what you think is best. The downside of this power is that you don’t have many people to call on for help. You’re responsible for finding a way to make things work or dealing with the fallout when things don’t work.

The tech stack is antiquated, and you’ll have to spend a lot of time making workarounds for it, which is not a great use of time. You may want to use a newer technology for storing data or running models, but you won’t have the technical support to do it. If you’re not able to set up any new technology entirely by yourself, you’ll just have to get by without using it.

A data scientist’s salary won’t be as high as it would be at bigger companies, especially tech ones. HandbagLOVE just doesn’t have the cash available to pay high data science paychecks. Besides, the company doesn’t need the best of the best anyway—just people who can do the basics. That being said, the salary won’t be terrible; it’ll certainly be well above what most people at the company make with similar years of experience.

HandbagLOVE is a good company to work at for data scientists who are excited to have the freedom to do what they think is right but perhaps aren’t interested in using the most state-of-the-art methods. If you’re comfortable using standard statistical methods and making more mundane reporting, HandbagLOVE should be a comfortable place to grow your career. If you’re really interested only in using start-of-the-art machine learning methods, you won’t find many projects to do at HandbagLOVE; neither will you find many people there who know anything about what you’re talking about.

2.3. Seg-Metra: The early-stage startup

  • Similar to: a thousand failed startups you haven’t heard of
  • Company age: 3 years
  • Size: 50 employees

Seg-Metra is a young company that sells a product that helps client companies optimize their website by customizing for unique segments of customers. Seg-Metra sells its product to businesses, not consumers. Early in its brief history, Seg-Metra got a few big-name clients to start using the tool, which helped the company get more funding from venture capitalists. Now, with millions of dollars at hand, the company is looking to scale in size quickly and improve the product.

The biggest improvement that the founders have been pitching to investors is adding basic machine learning methods to the product. This improvement was pitched to investors as “cutting-edge AI.” With this new funding in hand, the founders are looking for machine learning engineers to build what was pitched. They also need decision scientists to start reporting on the use of the tool, allowing the company to better understand what improvements to make in the product.

2.3.1. Your team (what team?)

Depending on when a data scientist gets hired, they may very well be the first data scientist in the company. If they’re not the first, they’ll be among the first few data science hires and likely report to the one who was hired first. Due to the newness of the team, there will be few to no protocols—no established programming languages, best practices, ways of storing code, or formal meetings.

Any direction will come from that first data scientist hire. The culture of the team will likely be set by their benevolence. If that person is open to group discussion and trust of the other team members, the data science team as a whole will decide things such as what language to use. If that person is controlling and not open to listening, they will make these decisions themselves.

Such an unstructured environment can create immense camaraderie. The whole data science team works hard, struggles to get new technologies, methods, and tools working, and can form deep bonds and friendships. Alternatively, those who have power could inflict immense emotional abuse on those who don’t have power, and because the company is so small, there is little accountability. Regardless of exactly how Seg-Metra’s growth shakes out, the data scientists at this early-stage company are in for a bumpy and wild ride.

The work of the team can be fascinating or frustrating, depending on the day. Oftentimes, data scientists are doing analyses for the first time ever, such as making the first attempt to use customer purchase data to segment customers or deploying the first neural network to production. These first-time analyses and engineering tasks are exciting because they’re uncharted territory within the company, and the data scientists get to be the pioneers. On other days, the work can be grueling, such as when a demo has to be ready for an investor and the model still isn’t converging the day before. Even if the company has data, the infrastructure may be so disorganized that the data can’t feasibly be used. Although the work is chaotic, all these tasks mean that the data scientists learn lots of skills very quickly while working at Seg-Metra.

2.3.2. The tech: Cutting-edge technology that’s taped together

By being a young company, Seg-Metra isn’t constrained by having to maintain old legacy technology. Seg-Metra also wants to impress its investors, which is a lot easier to do when your technology stack is impressive. Thus, Seg-Metra is powered by the most recent and greatest methods of software development, data storage and collection, and analysis and reporting. Data is stored in an assortment of modern cloud technologies, and nothing is done on-prem. The data scientists connect directly to these databases and build machine learning neural network models on large Amazon Web Services (AWS) virtual machine instances with GPU processing. These models are deployed by means of modern software engineering methods.

At first glance, the tech stack is certainly impressive. But the company is so young and growing so fast that issues continually arise with the different technologies working together. When the data scientists suddenly notice missing data in the cloud storage, they have to wait for the overworked data engineer to fix it (and that’s if they’re lucky enough to have a data engineer). It would be great if Seg-Metra had a dedicated development operations (DevOps) team to help keep everything running, but so far, the budget has been spent elsewhere. Further, the technology was installed so quickly that even though the company is young, it would be difficult to monitor it all.

2.3.3. Pros and cons of Seg-Metra

As a growing startup, Seg-Metra has a lot of appeal. The growth of the company is providing all sorts of interesting data science work and an environment in which data scientists are forced to learn quickly. These sorts of positions can teach skills that jump-start a career in data science—skills like working under deadlines with limited constraints, communicating effectively with non-data scientists, and knowing when to pursue a project or to decide that it’s not worthwhile. Especially early in a career, developing these skills can make you much more attractive as an employee than people who have worked only at larger companies.

Another pro of working at Seg-Metra is that you get to work with the latest technologies. Using the latest tech should make your job more enjoyable: presumably, the new technologies coming out are better than the old technologies. By learning the latest tech, you should also have a more impressive résumé for future jobs. Companies looking to use newer technology will want you to help guide the way.

Although the pay is not as competitive as at larger companies, especially tech companies, the job does provide stock options that have the potential to be enormously valuable. If the company eventually goes public or gets sold, those options could be worth hundreds of thousands of dollars or more. Unfortunately, the odds of that happening are somewhere between getting elected to city council and getting elected to the U.S. Congress. So this fact is a pro only if you enjoy gambling.

One con of working at Seg-Metra is that you have to work very hard. Having 50- to 60-hour work weeks is not uncommon, and the company expects everyone to contribute everything they can. In the eyes of the company, if everyone isn’t working together, the company won’t succeed, so are you really going to be the one person to use all their vacation time in a year? This environment can be hugely toxic, ripe for abuse and a lot of employee burnout.

The company is volatile, relying on finding new clients and help from investors to stay afloat, giving Seg-Metra the con of low job security. It’s possible that in any year, the company could decide to lay off people or go under entirely. These changes can happen without warning. Job insecurity is especially difficult for people who have families, which causes the demographics of the company to skew younger. A young workforce can also be a con if you want to work with a more diverse, experienced team.

Overall, working at Seg-Metra provides a great opportunity to work with interesting technology, learn a lot quickly, and have a small chance of making a ton of money. But doing so requires an immense amount of work and potentially a toxic environment. So this company is best for data scientists who are looking to gain experience and then move on.

Rodrigo Fuentealba Cartes, lead data scientist at a small government consulting company

The company I work at provides analytics, data science, and mobile solutions for governmental institutions, armed and law enforcement forces, and some private customers. I am the lead data scientist, and I am the only one in charge of data science projects in the entire company. We don't have data engineers, data wranglers, or any other data science roles there because the department is relatively new. Instead, we have database administrators, software developers, and systems integrators, and I double as a system/software architect and open source developer. That might look odd and definitely puts a strain on me, but it works surprisingly well.

One strange story from my job: I was working in a project that involved using historical information from many environmental variables, such as daily weather conditions. There was a lack of critically needed data because an area of study didn’t have weather stations installed. The project was in jeopardy, and the customer decided to shut the project down in a week if their people could not find the information.

I decided to fly to the area and interview some fishermen, and I asked how they knew that it was safe to sail. They said they usually sent a ship that transmitted the weather conditions over the radio. I visited a radio station, and they had handwritten transcripts of communications since 1974. I implemented an algorithm that could recognize handwritten notes and extract meaningful information, and then implemented a natural language processing pipeline that could analyze the strings. Thanks to going out to the field and finding this unusual data, the project was saved.

Gustavo Coelho, data science lead at a small startup

I have been working for the last 11 months in a relatively new startup which focuses on applying AI to HR management. We predict future performance of candidates or their likelihood of being hired by a certain company. Those predictions are aimed at helping speed up the hiring process. We rely heavily on bias mitigation in our models. It’s a small company; we have 11 people; and the data science team makes up five of them, including me. The whole of the company is dedicated to helping the data science team deliver the trained models into production.

Working at a small startup gives me the chance to learn new concepts and apply them every day. I love thinking about the best way to set up our data science processes so we can scale and give more freedom to our data scientists to focus on data science. HR is not a tech-savvy field, so more than half of the project length is spent explaining the solution to our clients and helping them get comfortable with the new concepts. And then when we finally get the go-ahead, there is also a lot of time spent coordinating with the client’s IT department to integrate into our data pipeline.

2.4. Videory: The late-stage, successful tech startup

  • Similar to: Lyft, Twitter, and Airbnb
  • Company age: 8 years
  • Size: 2,000 people

Videory is a late-stage, successful tech startup that runs a video-based social network. Users can upload 20-second videos and share them with the public. The company has just gone public, and everyone is ecstatic about it. Videory isn’t close to the size of MTC, but it’s doing well as a social network and growing the customer base each year. It’s data-savvy and has probably had data analysts or scientists for a few years now or even since the start. The data scientists on the team are very busy doing analyses and reporting to support the business, as well as creating machine learning models to help pair people with artists to commission work.

2.4.1. The team: Specialized but with room to move around

Videory is still at the point where you can gather all the data scientists in an extra-large conference room. Given the size of the company, the team may be organized in a centralized model. Every data science person reports to a data science manager, and all are in a single large department of the organization. The central data science team helps other groups throughout the company, but ultimately, the team sets its own priorities. Some data scientists are even working on internal long-term academic research projects that have no immediate benefits.

There’s specialization among the data science team at Videory, given the size of the company. There’s also some delineation among people who do the heavy machine learning, statistics, or analytics. Videory is small enough that it’s possible to switch between these groups over time. The data scientists usually have some interaction—such as training sessions, monthly meetings, and a shared Slack channel—that you wouldn’t find at companies like MTC, which are too big for everyone to talk together. The subteams are likely to use different tools, and a group of people with PhDs publish academic papers and do more theoretical work.

2.4.2. The tech: Trying to avoid getting bogged down by legacy code

Videory has a lot of legacy code and technology, and probably at least a few tools that were developed internally. The company is likely trying to keep up with tech developments, and it plans to switch over to a new system or supplement the existing ones with new technologies. As in most companies, a data scientist will almost definitely query a SQL database to get data. The company probably has some business intelligence tools as well, because there are a lot of non-data science consumers.

As a data scientist at Videory, you’ll definitely get to learn something new. All these companies have big data and systems to deal with it. SQL won’t be enough; the company needs to process billions of events every month. You may be able to try Hadoop or Spark when you need to pull out some custom data that’s not stored in the SQL database, however.

The data science is typically done in R or Python, with plenty of experts available to provide assistance if things prove to be difficult. The machine learning is deployed through modern software development practices such as using microservices. Because the company is well known as a successful startup, lots of talented people work there, using their cutting-edge approaches.

2.4.3. The pros and cons of Videory

Videory can be a good size for data scientists; enough other data scientists are around to provide mentorship and support, but the team is still small enough that you can get to know everyone. Data science is recognized on the company level as being important, which means that your work can get recognition from vice presidents and maybe even the C suite (CEO, CTO, and so on.). You’ll have data engineers to support your work. The data pipelines may get slow sometimes or even break, but you won’t be responsible for fixing them.

In an organization of more than 1,000 people, you’ll need to deal with inevitable political issues. You may be pressured to generate numbers that match what people want to hear (and can tell their bosses in order to get a bonus) or face unrealistic expectations about how fast something can be developed. You can also end up working on things that the business doesn’t really need because your manager asked you to. Sometimes, you’ll end up feeling that you’ve had no direction or your time was wasted. While it won’t change as much as at an early-stage startup, the organization will still change a lot; what’s a priority one quarter can be totally ignored the next.

Although other data scientists at Videory will be more knowledgeable than you on most data science topics, you might quickly become the expert on a specific one, such as time series analysis. This situation can be great if you like mentoring and teaching others, especially if your work supports taking time to learn more about your particular field of expertise by reading papers or taking courses. But it can be hard when you feel that no one can check your work or push you to learn new things. You’ll always have more to learn, but what you learn may not be in the area you want to focus on.

Overall, Videory provides a nice blend of some of the benefits of the other archetypes. It’s large enough that there are people around to provide help and assistance when needed, but not so large that requests get stuck in bureaucratic madness or departments overlap in scope. Data scientists who work at the company get plenty of chances to learn, but due to the specialization of roles, they don’t get the opportunity to try everything. This company is a great place for data scientists who are looking for a safe bet that provides chances to grow, but not an overwhelming number of chances.

Emily Bartha, the first data scientist at a midsize startup

I work at a midsize startup that has a product focused on insurance. As the first data scientist, I get to help define our strategy around using data and introducing machine learning into our product. I sit on the data team in the company, so I work very closely with data engineers, as well as our data product manager.

A day in my life at work starts with morning standup with the data team. We talk about what we have planned for the day and any blockers or dependencies. I spend a lot of time digging through data: visualizing, creating reports, and investigating quality issues or quirks in the data. I spend a lot of time on documentation too. When I code, I use GitHub, like the rest of the engineering team, and have team members review my code (and I review theirs). I also spend a good chunk of the day in meetings or side-of-desk collaboration with members of my team.

Having worked at bigger companies in the past, I love working at a small company! There is a lot of freedom to take initiative here. If you have an idea and want to work to make it a reality, no one will get in your way. Look for a company that has already made an investment in data engineering. When I arrived, there were already several data engineers and a strategy for instrumentation, data collection, and storage. When you work at a small company, things are constantly changing and priorities are shifting, which makes it important to be adaptable. People who enjoy diving deep on a project and working on it for months may not enjoy working at a startup, because it often requires developing solutions that are good enough and moving on to the next thing.

2.5. Global Aerospace Dynamics: The giant government contractor

  • Similar to: Boeing, Raytheon, and Lockheed Martin
  • Company age: 50 years
  • Size: 150,000 people

Global Aerospace Dynamics (GAD) is a huge and rich company, bringing in tens of billions of dollars in revenue each year through various government contracts. The company develops everything from fighter jets and missiles to intelligent traffic-light systems. The company is spread across the country through various divisions, most of which don’t talk to one another. GAD has been around for decades, and many people who work there have been there for decades too.

GAD has been slow on the uptake when it comes to data science. Most of the engineering divisions have been collecting data, but they struggle to understand how it can be used in their very regimented existing processes. Because of the nature of the work, code needs to be extremely unlikely to have bugs and ruthlessly tested, so the idea of implementing a machine learning model, which has limited predictability when live, is dicey at best. In general, the pace of work at the company is slow; the tech-world motto “Move fast and break things” is the polar opposite of the mentality at GAD.

With the number of articles on artificial intelligence, the rise of machine learning, and the need to use data to transform a business, the executives of GAD are ready to start hiring data scientists. Data scientists are showing up on teams throughout the organization, performing tasks such as analyzing engineering data for better reporting, building machine learning models to put into products, and working as service providers to help GAD customers troubleshoot problems.

2.5.1. The team: A data scientist in a sea of engineers

Although their roles depends on where in GAD they are and what project they’re working on, the average data scientist is a single person on a team of engineers. At best, there may be two or three data scientists on your team. The data scientist has the job of supporting the engineers with analysis, model building, and product delivery. Most of the engineers on the team have only a very loose understanding of data science; they remember regressions from college but don’t know the basics of collecting data or feature engineering, the difficulties of validating a model, or how models get deployed. You’ll have few resources to help you when things go wrong, but because so few people understand your job, no one else might notice that things are going wrong.

Many of the engineers on the team will have been with the company for ten or more years, so they’ll have plenty of institutional knowledge. They’ll also be more likely to have the mindset “We’ve been doing things this way since I’ve been here, so why should we change?” That attitude will make it more difficult for ideas proposed by data scientists to be implemented. The slower nature of the defense industry means that people tend to work less hard than in other places; people clock in for 40 hours a week, but casually slipping down below that total isn’t unusual. At other companies, you can be overwhelmed by having too many tasks, whereas at GAD, the stress comes from not having enough work to do and being bored.

Promotions and raises are extremely formulaic, because managers must follow rules to reduce bias (and thus be less likely to get GAD sued) and also because that’s how things have been done for decades. Getting raises and promotions largely has to do with how many years you’ve worked at the company. Being an extremely hard worker may make your next promotion come a year earlier or earn you a somewhat higher bonus, but there’s little chance that a junior data scientist will rise quickly to become a lead data scientist. The flip side is that employees rarely get fired.

2.5.2. The tech: Old, hardened, and on security lockdown

Although the technology stack varies greatly between groups in GAD, it all tends to be relatively old, on-prem instead of in the cloud, and covered in security protocols. Because the data involved covers topics like fighter-jet performance, it’s essential for the company that the data isn’t leaked. Further, the company needs legal accountability for any technology it uses in case something goes wrong, so open source tends to be frowned upon. Whereas Microsoft SQL Server is more expensive than PostGRES SQL, for example, GAD is happy to pay Microsoft the extra money, knowing that if there’s a security bug, they can call Microsoft to deal with it.

In practice, this setup looks like data being stored in SQL Server databases run by an IT team that’s extremely stingy about who has access to what. The data scientists are allowed to access the data, but they have to run Python on special servers that have limited internet access so that any libraries don’t secretly send data to foreign countries. If the data scientists want to use special open source software, there’s little chance that IT and security will approve it, which makes it much more difficult for the data scientists to work.

If code needs to be deployed to production systems, it tends to be deployed in traditional ways. GAD is just beginning to adopt modern methods of putting machine learning code into production.

2.5.3. The pros and cons of GAD

The pros of working at GAD are that the data science jobs are slow, comfortable, and secure. The less rigorous pace of the job means that you’re more likely to have energy left over when you get home for the evening. You’ll often find yourself with free time when you’re working, which you can spend reading data science blogs and articles without anyone complaining. The fact that few other people know the basics of data science means that you’ll have fewer people questioning you. And because GAD is a massive organization that’s worried about legal liabilities, you’d have to really underperform to get fired.

The downsides of working at GAD are that you’re less likely to learn new skills than you would be at other companies. You’ll likely be assigned to a single project for years, so the technologies and tools used for that project will quickly become mundane. Worse, the skills you do learn will be for outdated technology that isn’t transferrable to other institutions. And although you won’t get fired easily, you also won’t get promoted easily.

GAD is a great place to work if you find a team doing projects that you find enjoyable and you don’t want work to be your life. Many people work for GAD for decades because it’s comfortable, and they’re happy with being comfortable. But if you demand challenges to keep you going, GAD might not be a good fit.

Nathan Moore, data analytics manager for a utilities company

The company I work at provides and sells power for hundreds of thousands of people, and the company is partially owned by the government. The company itself has around 1,000 employees spread across many different functions. My job involves investigating and prototyping new data sources and working with the database specialists to clean and document current data sources. We’ve got a bunch of legacy systems and new initiatives happening, so there’s always something to do.

At the moment, a day in the life involves meetings, reviewing specifications for ETL, trying out a new machine learning technique I found on Twitter, giving feedback on reporting, learning to use JIRA and Confluence, and answering many emails. In the past I’ve been involved in model development and assessment, data analysis when some overnight processing fails, and submissions to government on an industrywide review of the sector.

We’re large enough that we’ve got a good team of analysts to work on a variety of projects, from day-to-day reporting to large customer segmentation projects. I’ve had lots of opportunities to move around in the business and have worked here for 11 years. But since we have billions of dollars of assets, risk aversion is high within the company, and pace of change is a little bit slow. We have a large-enough IT department that can support everyday functions, but any significant project, like the systems upgrade, means resources are scarce for any nonpriority improvements. Everything needs to be justified and budget set aside, and there is plenty of politics to navigate.

2.6. Putting it all together

When you’re looking at companies to work for, you’ll find that many of them are similar to these companies in various ways. As you go through job applications and interviews, it can be helpful to try to understand the strengths and weaknesses of working at these companies (table 2.1).

Table 2.1. A summary of companies that hire data scientists

Criteria

MTC

HandbagLOVE

Seg-Metra

Videory

GAD

Massive tech

Retailer

Startup

Mid-tech

Defense

Bureaucracy Lots Little None Some Lots
Tech stack Complex Old Fragile Mixed Ancient
Freedom Little Lots TONS Lots None
Salary Amazing Decent Poor Great Decent
Job security Great Decent Poor Decent Great
Chances to learn Lots Some Lots Lots Few

2.7. Interview with Randy Au, quantitative user experience researcher at Google

Randy Au works on the Google Cloud team. Having worked in data science with a focus on human behavior for more than a decade, he blogs about how to think about working at startups and different types of companies at https://medium.com/@randy_au.

Are there big differences between large and small companies?

Yes. Usually, it’s more organizational and structural. There are points in a company where culture changes because of the scale. At a 10-person startup, everyone does everything because everyone’s wearing all the hats. Meanwhile, around 20 people, things start specializing. You start getting three- or four-person teams dedicated to specific things. People can think harder about certain things, and you don’t have to learn everything about a company. At around 80 to 100 people, the existing teams don’t scale anymore. Then there’s a lot more process around things. You don’t know everyone in the company anymore. You don’t know what everyone’s up to, and so there’s a lot more overhead to reach common understanding. Beyond that, after about 150 to 200 people, it’s impossible to know what’s going on around the company, so the bureaucracy has to exist. Then you go to Google, which is 100,000 people. There, you have no idea what most of the company is doing.

The smaller the company, the more likely you’re going to interact with everyone in the company. At a 40-person company, I would have the CEO sitting at my desk as we’re both exploring a dataset together. That will never happen in Google. But are you okay with the situation that happens in a lot of startups where you’re building an F1 car and you’re driving it at the same time, and everyone’s arguing whether you should have a steering wheel? When you’re the data person at a small company, the methods don’t really matter as much; you’re just trying to squeeze all the data and get some insights out of it. It’s okay to not be as rigorous so you can make decisions more quickly.

Are there differences based on the industry of the company?

Some industries have historically had math or data people. An insurance company has actuaries, for example. Those people have been around for a hundred years, and they really know their stats. If an insurance company is going to bring in data scientists, they come with a slightly different view on it. They already have this built-in structure for extremely talented stats people. They’re going to be filling a gap somewhere else: there’s going to be a gap in their big data or in optimizing their website or something.

Finance also has a long tradition of having quants. I remember failing a quant finance interview once because they did a code test. But as a data scientist, I just make sure my code is functional and gives the correct answer; I don’t think too hard about performance until it becomes a problem. Their coding test literally tested you on performance and dinged you points for not being performant automatically. I was like, “Oh, yeah, you guys are in finance. I get it.”

I think if you talk to everyone who’s doing data science work, the vast but silent majority are people who are doing this kind of grunt work that’s not sexy at all. I got a ridiculous amount of responses to the article I wrote about data science at startups that were people saying, “Yeah, this is my life.” This is not what people talk about when they talk about data science. It’s not the sexy “Here’s a new shiny algorithm I applied from this arXiv paper.” I don’t think I’ve applied anything in an arXiv paper in the 12 years I’ve worked. I’m still using regression because regression really works! I think that is the reality of it.

You’re going to be cleaning up your data; I don’t think there’s anyone even at the Facebooks and the Googles who doesn’t have to clean data. It might be slightly easier to clean up your data because there’s structure around it. But no, you’re going to have to clean up your data. It’s a fact of life.

What’s your final piece of advice for beginning data scientists?

Know your data. This does take a long time—six months to a year or more if it’s a complicated system. But your data quality is the foundation of your universe. If you don’t know your data, you’re going to make a really bizarre statement about something that your data just can’t let you say. Some people will say, “Oh, I have the number of unique cookies visiting my website, and that’s equal to the number of unique people.” But that’s not true. What about those people who are using multiple devices or browsers?

To really know your data, you need to make friends with the people with domain knowledge. When I was doing financial reports, I made friends with the finance people so I could learn the conventions accounting has about how they name things and the order of how things are subtracted. Maybe you got 50 million pages from this one IP, and someone else will realize that’s IBM. You won’t know all this stuff, but someone probably will.

Summary

  • Many types of companies hire data scientists.
  • Data science jobs vary, largely based on each company’s industry, size, history, and team culture.
  • It’s important to understand what kind of company you’re considering.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset