About the Authors

Tammy Butow ([email protected]) is the principal SRE at Gremlin, where she works on Chaos Engineering, the facilitation of controlled experiments to identify systemic weaknesses. Gremlin helps engineers build resilient systems using their control plane and API. Tammy previously led SRE teams at Dropbox responsible for databases and storage systems used by over 500 million customers. Prior to this Tammy worked at DigitalOcean and at one of Australia’s largest banks in security engineering, product engineering, and infrastructure engineering.

Michael Kehoe ([email protected]) is a staff SRE at LinkedIn working on incident response, disaster recovery, visibility engineering, and reliability principles. He specializes in maintaining large system infrastructure as demonstrated by his work at LinkedIn (applications, automation, and infrastructure) and at the University of Queensland (networks). Michael has also spent time building small satellites at NASA and writing thermal environments software at Rio Tinto.

Jay Holler is an engineering manager of the Core Infrastructure Services SRE team at Twitter in San Francisco, CA. Jay previously worked on the Twitter Command Center team, which was responsible for incident command and the uptime and availability of all Twitter properties, including Twitter, Vine, Periscope, etc. Prior to this, Jay worked for many years at Nasdaq in the NOC, which was responsible for all exchange platforms including equities trading, options trading, FINRA reporting, and back office clearing.

Rodney Lester is the technical lead for the Reliability Pillar of the AWS Well-Architected Framework at Amazon Web Services. He has implemented, tested, and operated many highly available applications before working with AWS customers as one of the first AWS professional services consultants. He has discovered that much of the knowledge he has gained is not commonly known. In addition to expanding his knowledge, he has educated customers and partners of AWS in how to build, test, and operate highly available systems.

Ramin Keene ([email protected]) is the founder of fuzzbox.io, which brings together machine learning and chaos engineering to help companies explore the failure modes of their application, uncover risk, and manage complexity safely. He was previously CTO at StockX and has helped large companies put machine learning into production and scale their data infrastructure. He is based in Seattle, WA.

Jordan Pritchard is director of infrastructure at SambaTV, where he is responsible for the SRE and Infrastructure Engineering teams. Jordan previously led teams at Cloudflare, Rackspace, and Sutter Health. He’s a passionate advocate for creating a culture of high reliability, and believes the first and second rules of SRE should be “Trust your people and give them autonomy,” followed by “Treat near misses with the same level of focus as you give to major incidents.”

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset