Chapter 2. Case Study 1: Moonshot

In this chapter, we discuss a large-scale project called Moonshot. We share several examples of tools, processes, and techniques that pushed the project forward and conclude with a postmortem of lessons learned from the project.

Overview

In 2010, the senior Storage SRE leadership declared that the Moonshot project would soon be underway. This project required teams to migrate all of the company’s systems from GFS1 to its successor, Colossus, by the end of 2011. At the time, Colossus was still in prototype, and this migration was the largest data migration in the history of Google. This mandate was so ambitious that people dubbed the project Moonshot. As an internal newsletter to engineers put it:

If migrating all of our data in 2010 still sounds like a pretty aggressive schedule, well, yes it is! Will there be problems such as minor outages? Probably. However, the Storage teams and our senior VPs believe that it’s worth the effort and occasional hiccup, and there are plenty of incentives for early adopters, including reduced quota costs, better performance, and lots of friendly SRE support.

The initial communication completely undersold the effort, complexity, and difficulty of this project. In reality, it took a full two years to migrate all of Google’s services from GFS to Colossus.

GFS was designed in 2001 as Google’s first cluster-level file system. It supported many petabytes of storage and allowed thousands of servers to interact with thousands of clients. Machines would run a daemon called a chunkserver to store blocks of data (chunks) on the local filesystem while the GFS client code split the files into a series of chunks and stored them on the chunkservers, replicating the chunks to other chunkservers for redundancy and bandwidth. The “GFS Master” kept the list of chunks for a given file along with other file metadata. GFS created shadow replicas so a new GFS Master could be selected if the primary one was down.

GFS limitations only started to surface roughly six years later. These were some of the limitations encountered:

  • Google production clusters were larger, holding more than just thousands of machines.

  • User-facing, “serving” systems like Gmail increasingly used GFS as the backend storage. Failures lasting minutes resulted in outages to these systems that were no longer acceptable.

  • RAM stored the chunk locations and was limited by the maximum amount of memory you could physically put in a single machine.

  • There was no persistent index of chunk locations, so restarting the GFS Master required recomputing the map of chunk locations. When a GFS cell (a virtual unit of measurement that represents a group of datacenter machines all managed by the same process) “restarted” for any maintenance reason, the master took 10–30 minutes to retrieve the full inventory of chunks.

  • The GFS Master itself ran on a single machine as a single process. Only the primary master processed mutations of file system metadata. Shadow master processes running on other machines helped share the workload by offering read-only access to the GFS metadata.

  • The GFS Master software couldn’t take full advantage of SMP hardware because portions of the GFS Master software were single threaded. Plans to make software multithreaded were in the works but this option would still not be enough to meet the growing demand within a few years.

In early 2006, Google developers made available the initial implementation of Colossus—the eventual, though not originally intended, replacement for GFS. The developers had built Colossus with the specific purpose of being a backing store for giant BigTables.2 By the summer of 2007, Colossus was developing into a proper cluster file system that could replace GFS as the cluster-level filesystem for BigTable. The developers set up the first production Colossus cell in January 2008, and videos began streaming directly from Colossus two months later.

Initially, many engineers were hesitant with the change to Colossus. Some resented the “mandate from above,” feeling that they “had no choice.” “This is a huge headache for us,” another said in an interview.

Other engineers, however, were more optimistic. As Ben Treynor, VP of engineering, pointed out, “Moonshot is an extremely important initiative to Google. It is important enough that in order to make the fast progress necessary, we are willing to take greater risk in order to accomplish its goals—even to the point of taking outages, so long as we don’t lose Customer data.” This sentiment was summed up neatly by another engineer when they remarked, “It’s crazy but it might just work.”

Every year, Google’s user base and services increased, and GFS could no longer support this ever-evolving ecosystem. We needed to move our systems to a cluster-level file system that was fault tolerant, designed for large-scale use, enabled more efficient use of machine resources, and provided minimum disruption to user-facing services.

To say that the migration was complex is an understatement. What started as a four-person team operation later turned into engineers volunteering their time as 20%ers3 and, eventually, a dedicated 14–18 SRE team members per site to support the Colossus storage layer after the migration.

Pushing the project forward required most service owners to manually change their job configs to work with the new storage system. In other words, it required work from several SREs, software engineers (SWEs), TPMs, SRE managers, product managers (PMs), and engineering directors. Moonshot also exposed issues with systemic resource usage spurring the Steamroller project, a split effort to reorganize how machine-level resources were allocated across the entire Google production fleet.

The following diagram shows a high-level, very simplified view of what the overall Moonshot migration entailed.

Tools

As with any large-scale infrastructure change, creating the appropriate change structure and policies became critical to ensuring successful implementation of the change. Google SREs created migration tools that automated as much work as possible. These tools helped teams migrate successfully to Colossus with less effort, and some are described in detail next.

Quota and Storage Usage Dashboard

A custom-built dashboard used to identify how much quota each team historically used.

This was instrumental in showing the trending resource usage across machines, teams, and PAs. Note that “resource” here is defined as both storage and compute resources. For the Moonshot project, this identified where a GFS quota could be reclaimed for Colossus. This dashboard became so effective that it eventually turned into a widely accessed and supported tool for viewing resource utilization across the machine fleet.

Quota Move Service

A custom-built service designed specifically for Moonshot, to periodically free up quota from a source cell up to a minimum threshold, and add such freed quota to the destination cell.

This service enabled automatic, fine-grained moves of quota from GFS to D (discussed later in “ The Steamroller Project ”) such that as the storage usage changed, the quota adjusted and kept appropriate headroom for the service maintained on both sides, to avoid disruptions. Think of the analogy of moving water from one bucket to another, except instead of just moving the water, the buckets were also resized (destination bucket grew bigger, source bucket grew smaller) so that each bucket had a similar amount of empty space in it at all times.

Migration Planning and Scheduling Tool

A custom-built tool that provided a free quota loan when you moved, and created your scheduled migration window, intended at a time that was least disruptive to your service.

This tool analyzed your file’s directory structure, determined how to separate your data into chunks for moving, generated the namespace mapping files needed to copy data from GFS to Colossus, and generated the commands to perform the bulk data migration.

Migration Tracking Tool

A custom-built, frontend webserver, used to keep track of all GFS to D migrations for Moonshot, create or update migrations for users, and identify available capacity of each datacenter, for migration needs.

The Moonshot team used this frequently for the execution and monitoring of the project throughout its phases and to communicate the data back to relevant stakeholders, for review.

Bulk Data Migration Service

An internally built service used for bulk copying files from one location on GFS/Colossus to another destination in production.

This is still used to this day for data copies of arbitrary sizes. The Moonshot team used it to move files off GFS to Colossus, one directory at a time.

These tools automated a non-trivial amount of work, and helped the team manage and track the migration, as well as the quota, so that team members would not have to do so manually themselves. The tools made the migration less troublesome and minimized human error. Besides tools, however, there were additional ways to make the migration easier. One way was to use processes to manage the migration.

Processes

We define processes as predefined methods of engagement, or methods used to execute the project. Moonshot used a number of processes within the team and outside the team to successfully move the project along.

In one internal process, the Moonshot team set up a weekly progress check-in meeting for those directly working on the project. People used this meeting to discuss updates on Colossus feature development, migration tooling development, Colossus rollout status, service migration status, risks and blockers on the critical path to be addressed, and so on. In some meetings, team members identified outdated documentation, missing procedures, and communication opportunities to increase awareness for affected customers. These turned into action items assigned to owners who followed up and provided an update at the next meeting if needed. In one instance, meeting attendees identified that the migration itself needed a separate meeting for more in-depth discussion, so they made that happen. Such a process facilitated communication within the team and helped team members manage the project more easily.

In another internal process, the Moonshot team divided the migration into phases, limiting the number of services it impacted at the same time. Video and Bigtable were the first customers on Colossus since GFS limitations hit them heavily and they were actively looking for a replacement storage system. Migrating these two early adopters helped the Moonshot team realize the time it would take to migrate a service at a per cell level and the tactical steps necessary for the migration (e.g., how to turn up a Colossus production and test cell, when to add and remove quota, etc.). Later on, they set up a pilot phase for two quarters, and a few smaller services (e.g. Analytics, Google Suggest, and Orkut) elected to migrate to Colossus. With each phase, the team discovered and addressed complexities before proceeding. Some lessons learned that were folded in later include defining what the common required migration tasks were per service, auditing file paths and permissions after migration was complete, understanding how much quota was needed to turn up a Colossus cell while the service was still running in GFS serving cell, and much more.

For processes external to the team, Moonshot created various communication channels to ensure ongoing feedback and prevent communication from getting lost in one person’s inbox. Some examples of this included creating a dedicated project announcement mailing list and user list (e.g. [email protected] and [email protected]), setting up office hours for one-on-one consulting, creating an exception procedure for folks who could not migrate by the targeted deadline, creating FAQs and migration instructions, and creating a form for users to submit feature requests or bug issues discovered. Each of these opened up transparency to questions and answers raised, and gave people a forum to collectively help one another. Office hours were set up to provide one-on-one consulting, for specific use cases.

Using these processes within the team and outside the team made information flow easier. These processes encouraged regular feedback, as well as communication and information sharing, within the immediate team and between teams. This kept the project moving and prioritized execution.

We’ve seen that processes are important for managing infrastructure change, but when the infrastructure change involves migrating huge amounts of data, we need to consider capacity as well. We talk about that in the next section.

Capacity Planning

Shortly after the senior Storage SRE leadership announced Project Moonshot, the Moonshot team discovered the storage migration from GFS and chunkserver-backed Colossus cells, to D-backed Colossus cells, required effectively creating CPU, memory, and storage resources out of thin air. The Borg4 ecosystem had no visibility into GFS chunkserver resource accounting since GFS predated its time so there was not enough quota to turn up a D cell for the Moonshot migration. Therefore, the senior Storage SRE leadership announced the Steamroller project, an effort to address this problem, in an internal engineering announce list:

Because this is a prerequisite for the Moonshot migration, there is no “opt out” planned. In the case of extreme emergency, a limited deferment may be possible. Note: private cells and dedicated resources are exempted from this procedure, but are expected to have their own migration plans to accomplish Moonshot goals.

What is D?

Before we discuss the Steamroller project in more detail, we need to briefly introduce you to the D5 (short for ‘Disk') server, the equivalent of GFS Chunkservers, which runs on Borg. D formed the lowest storage layer and was designed to be the only application with direct read and write access to files stored on the physical disks. The physical disks were connected to the machines it ran on. Similar to how a GFS chunkserver managed reading and writing the raw chunk data while the GFS Master kept track of the filesystem metadata, the D server managed access to the raw data while Colossus managed the map of what data was on which D server. The Colossus client talked to Colossus to figure out which D servers to read/write from, and then talked directly to the D servers to perform the read/writes.

See the following diagram for a visual overview.

To get D-backed Colossus on Borg, there needed to be available storage resources to spin up D servers in each cell. The Moonshot team created a separate dedicated team across the development team and SREs, to acquire resources for D through the Steamroller project.

The Steamroller Project

The Steamroller project primarily focused on recovering resources from the production fleet so that these resources could be repurposed to turn up D cells for Moonshot. The project covered rolling out general critical infrastructure changes as well, in order to minimize the number of disruptive events. This included rightsizing quota usage and applying Borg job resource limits, removing and preventing overallocation of cells, reinterpreting Borg job priorities on what gets scheduled first, and a few more.

In order to accomplish these goals, every team had to manually modify their configurations for all production jobs, one Borg cell at a time. They then had to restart their Borg jobs to match the new reallocation of capacity. The Borg job configurations had been written earlier when resources were plentiful and sometimes included generous padding for future growth possibilities. The modifications “steamrolled” these resources and also placed every Borg job into containers to impose these limits.

From a numbers perspective, the Steamroller project was successful. The team completed the project within the short span of a year, reclaiming a large amount of shared Borg resources from Borg jobs and returning a non-trivial amount back to the fleet. This enabled the Moonshot project to move forward since there were enough resources to allocate to D. From a change management perspective, however, several factors did not go well.

For example, the Steamroller team used a 90-day usage period to identify the 100th percentile for RAM and 99th percentile for CPU. The team used this as the new baseline measurement for each Borg job to be applied after restarting the jobs but did not take into account the small spikes in RAM usage. Therefore, if any tasks went over their new memory limits, even by a small amount, Borg killed them immediately, causing localized service disruptions and latency.

In another example, Google engineers felt that the Steamroller team communicated the project timeline in a pressing tone, which made engineers feel that they received notice of the upcoming change too late, and had limited information apart from the email. In hindsight, Moonshot’s initial schedule turned out to be overly optimistic—which caused the aggressive timeline—but this only became clear after a critical mass of work had already begun.

Finally, in one last example, the Steamroller project failed to get the initial staffing request of at least four full-time TPMs, and only received two TPMs. If the project had more TPMs, they could have engaged more with service owners during the initial service notification, exception review process, and the manual effort put in to overcome the lack of robust self-service tools. Management recognized the shortfall in staffing at the time and made decisions to delay communication of the project, limit automation effort, and reduce the scope of the project.

On a positive note, Steamroller did unblock the Moonshot project. The Moonshot project proceeded as shown in the Byte Capacity Utilization graph. This graph shows the increase of Colossus utilization and the decrease of GFS utilization within the fleet:

Looking back on Moonshot and Steamroller, there were many lessons learned that may be useful for your own organization. We explore what we learned next.

Lessons Learned

The Moonshot project was the first large-scale data migration ever implemented at Google. Moonshot did not reach its advertised goal of moving off GFS to Colossus/D within one year but as Sabrina Farmer, VP of Engineering, mentioned, “[it] achieved several other very important but unstated goals. Focusing only on the advertised goal means we miss the other benefits a project brought.” Therefore, we hope these lessons learned shed insight into what went well and what did not, so you can apply them to your own infrastructure change.

The Moonshot project forced all teams to migrate by the target deadline.

People felt they had no choice with this declared mandate. The team asked service owners to “swap out a known and tested storage system for one that was incomplete and had a comparatively low number of ‘road miles.’” Rather than forcing all teams to migrate by a target deadline, it would have been better to collaborate closely with the teams supporting complex services and allow smaller teams more time in the background to migrate. Smaller teams often have less free capacity dedicated to taking on additional complex projects such as a migration.

The Moonshot team was comprised of 20%ers.

The migration team consisted of a handful of SREs, SWEs, PMs, and TPMs from various teams, who volunteered to work 20% of their time to make the migration happen. Such a distributed team meant they had both broad and deep levels of domain knowledge (for BigTable, GFS, D, and Colossus) to push the project forward. However, the fact that they were 20%ers and based in different offices in varying time zones added more complexity to the project. Eventually, management pulled in these SREs to work 100% on the migration effort and grouped them together, but it would have helped to have had a core migration team from the start. Nevertheless, if it’s not possible to get what the team requested—such as what happened with Moonshot—you make do with what you have. Regarding Steamroller, one engineer summed this lesson up nicely when they said, “Develop for the team you have, not the one you’re promised.”

The Moonshot and Steamroller teams made conscious trade-offs between developing automation and meeting the deadline.

As a result of the aggressive deadline, the migration team did their best to automate the data migration as much as possible by using various tools, as we discussed earlier. What would have helped is using a robust workflow engine that automatically migrated data for the users, after fulfilling a set of requirements. Much of the migration required users to run commands, wait, and then run some more commands. Creating a workflow engine would have reduced the overhead for migrating. In the words of one TPM on the Steamroller project, “It was very hands on . . . we had to look at a lot of monitoring as the migration happened and then manually move.” Before rolling out a large-scale infrastructure change in your organization, make a conscious assessment of the tradeoffs (in the context of the triple constraint model)6 and understand what risks you will accept. This may expose any considerations around your project to an effective discussion.

Each change as part of the Moonshot project caused a rippling effect of customer frustration.

Even though the migration team had published information regarding the migration, Colossus’s improved performance compared to GFS, and what people should expect from the migration, people still had questions and expressed frustration when the time came to migrate and when even more changes took place—such as the Steamroller project. Therefore, we learned that it’s helpful to widely advertise a large-scale infrastructure change through many different channels, for example, large, Google-wide, technical talks; office hours; user mailing lists; and company-wide announcements. Keep in mind, however, that there will always be someone unhappy with the change. The best you can do is reduce the blast radius of the change to affected users. You do this by automating as much of the manual work as possible, getting support from your technical influencers (i.e., your tech leads, engineering managers, or others of hierarchical seniority) to help disseminate information and by utilizing the different communication channels mentioned.

We applied these lessons learned from Moonshot to several infrastructure change management projects we worked on afterward, including our next case study, Diskless.

1 For an introduction to GFS, see this paper: http://bit.ly/2MRLcTO.

2 For more info about BigTables, please read this paper: http://bit.ly/2MOF1jp.

3 Dedicated 1 day (20%) of a 5-day week time for a personal project. See Chapter 5 in The Google Way by Bernard Girard.

4 Cluster management system for scheduling and managing applications across all Google data centers. For more info, see this paper: http://bit.ly/2sduOFs.

5 For a high level storage overview, see this paper: http://bit.ly/38n5NrX.

6 Time, scope, and cost information is available online (https://www.pmi.org/learning/library/triple-constraint-erroneous-useless-value-8024).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset