Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 1. The Shift to the Cloud

The rate of innovation is faster today than ever before. CIOs have an extremely tough job balancing “keeping the lights on” with delivering new features to market and keeping current with investments in new technologies. Pick up any trade magazine and you will see success stories of large companies adopting emerging technologies such as cloud computing, machine learning, artificial intelligence, blockchain, and many others. At the same time, companies are trying to adopt new ways to improve agility and quality. Whether it’s DevOps, Scaled Agile Framework (SAFe), Site Reliability Engineering (SRE), or another favorite buzzword, there is so much change coming at CIOs that it’s a full-time job just keeping up. Each new trend is designed to solve a specific set of problems, but it often takes a combination of tools, trends, methodologies, and best practices to deliver cloud at scale.

The Journey to the Cloud is a Long Hard Road

Even as CIOs embrace cloud computing, they have to follow the policies of corporate governance, risk, and compliance (GRC) and security teams. In many companies, those teams don’t welcome change; they usually have a strong incentive to make sure “we never end up in the Wall Street Journal” for any kind of breach or system failure. CEOs, however, also worry about appearing in the Harvard Business Review when the company goes bankrupt due to its dated, conservative view on technology adoption. CIOs are being asked to deliver more value faster in this very strained environment. The fear of risk and the need to develop new capabilities are competing priorities. Many organizations are responding by shifting traditional domain-specific functions like testing, security, and operations to the software engineering teams and even to Agile teams that are more aligned to business functions or business units.

The challenge this presents is that many IT people are not sufficiently skilled to take on these new roles effectively. It only takes one server with an open port to the web to cause your chief information security officer to lock everything down to the point where nobody can get timely work done in the cloud. When domain expertise shifts without corresponding changes in existing organizational structures, roles, responsibilities, and processes, the result is usually undesirable—and sometimes even catastrophic. IT leaders need to step back and redesign their organizations to build and run software in the cloud at scale. That means rethinking the entire software development value stream, from business value ideation to business value running ongoing in production. Additionally, since most journeys to the cloud take years, organizations must harmonize their processes across their legacy environments and new cloud environments.

Figure 1-1. Caption TK

Cloud Transformation

When we talk to companies about cloud transformation, we first try to establish what we mean by the term. Foundationally, a cloud transformation is the journey of a company to achieve two separate but related goals: to create a faster, cheaper, safer IT function, and to use its capabilities to build their future. They can achieve the first goal by leveraging the wonderful array of new technologies available on demand in the cloud, from Internet as a Service (IaaS) services to artificial intelligence (AI), machine learning, the Internet of Things (IoT), Big Data, and other exciting new building blocks of digital business. Today, every company is a technology company. Those who use the best technology to build digital products, services, and customer experiences are best positioned in the global marketplace. A cloud transformation is truly a path to the future for any CIO.

No two cloud transformations are the same, but the patterns for success and failure are very common. Many companies that succeed in the cloud have learned tough lessons about the flaws in their initial thoughts and strategies. Nobody gets it right at the beginning, but if you start your transformation expecting some bumps and bruises along the way, you can start making progress. “Progress over perfection” is a principle from the Agile Manifesto that is very appropriate for the cloud. In fact, many “born in the cloud” tech companies, like Netflix and Etsy, are on their second or even third generation of cloud architectures and approaches. The pace of change continues to increase once you arrive in the cloud.

Your company’s culture must embrace transparency and continuous learning; expect to adjust constantly and improve your cloud transformation over time. Your authors, Mike and Ken, attend a few conferences each year, such as AWS re:Invent, Google Cloud Next, and DOES (DevOps Enterprise Summit), and we hear great customer success stories. If you’re working at a company that hasn’t achieved that level of success, you can get disheartened at these conferences because it can seem like all the other companies are getting it right. However, many of those companies are battling with the same issues. Don’t be fooled; most of the presenters’ success stories only represent a single team (or product line or business unit) within a very large organization, not the entire organization. They started their journey with challenges similar to those your company is facing, and much of their organization may still be in the very early stages. Keep your chin up. This book will share a lot of lessons learned for what to do and, more importantly, what not to do as you embark on your cloud journey. Getting started can be hard, even daunting, but remember the words of Chinese philosopher Lao Tzu: “A journey of a thousand miles begins with a single step”.

Cloud transformations are a multiyear journey that is never really complete. The term cloud may be dropped one day (just like we really don’t say “client-server” systems anymore), but operating (and optimizing) in the cloud is a never-ending journey. What’s more important than getting it right at the beginning is actually starting. Too many organizations get caught up in trying to create the perfect strategy with low tolerance for risks and failures. These companies often have only years of strategy documents, PowerPoint decks and a few consulting bills to show for their efforts, even as their competitors keep advancing. Why have their efforts created so little business value?

The reason some companies don’t get too far is they don’t see the cloud as a transformation. They only see it as a technology project, like adopting a new tool. Being too conservative to allow the company to move forward and do anything significant in the cloud might be more of a failure than moving to the cloud and running into problems with availability and resiliency. At least with the latter the company is getting experience with the cloud and increasing maturity. Slowing or stopping a cloud transformation “done wrong” is very similar to what we saw in the 1990s when companies stopped moving from mainframes to client servers due to “client server done wrong.”

When companies don’t recognize the need to transform their organization to build, operate, and think differently about software, they take their old business processes, tooling, and operating model with them to the cloud—which almost always results in failure. Even worse, sometimes they then declare victory—“Yes, we have a cloud!”—thereby further inhibiting the business.

The evolution of cloud adoption

Figure 1-2. Cloud maturity curve.

Coauthor Mike created this maturity curve based on customer requests he’d been receiving since 2013. He explains:

When I first started consulting in the cloud computing space, most of the client requests were for either a TCO (total cost of ownership) or an ROI analysis, for a cloud initiative or overall cloud strategy. Many leaders had a hard sell to convince their CEO and board that cloud computing was the way forward. The cloud, like virtualization before it, was viewed as a cost-saving measure achieved by better utilization of resources. At that time, about 80% of the requests were focusing on private cloud while only 20% were for the public cloud, almost exclusively AWS. In November 2013, at the annual re:Invent conference, AWS announced a wide variety of enterprise-grade security features. Almost immediately their phone rang off the hook with clients asking for public cloud implementations. A year later our work requests were completely flipped, with over 80% for public cloud and 20% for private cloud.

Why the private cloud isn’t

From 2005-2012, many large enterprises focused their cloud efforts on building a private cloud. Security and regulatory uncertainty made them believe they needed to retain complete control of their computing environments. The traditional hardware vendors were more than happy to condone this point of view: “Yes, buy all of our latest stuff and you will be able to able to gain all the advantages you would get by going to a public cloud!” Quite a few Fortune 500 companies invested hundreds of millions of dollars to build world-class data centers they promised would, through the power of virtualization and automation, deliver the same benefits as going to the cloud. While such efforts were often declared successes (they did tend to save money on hardware), they fell well short of turning the company into the next Uber or Amazon.

We have seen some companies jump all-in to the public cloud with positive results. While the adoption of public cloud increased, companies moved or built new workloads in the cloud at rates much faster than they had traditionally deployed software. However, two common antipatterns emerged.

Wild West antipattern

Developers, business units and product teams now had access to on demand infrastructure and leveraged the cloud to get product out the door faster than ever. Since cloud was new to the organization, there was no set of guidelines or best practices. Development teams were now taking on many responsibilities that they never had before. They were delivering value to their customers faster than ever before, but often exposing the organization to more security and governance risks than before and delivering less resilient products. Another issue was that each business unit or product team was reinventing the wheel: buying their favorite logging, monitoring, and security third party tools. They each took a different approach to designing and securing the environment and often implemented their own CI/CD toolchains, with very different processes.

Command and control antipattern

Management, infrastructure, security, and/or GRC teams put the brakes on access to the public cloud. They built heavily locked down cloud services and processes that made developing software in the cloud cumbersome, destroying one of the key value propositions of the cloud—agility. We have seen companies take 3-6 months to provision a virtual machine in the cloud, something that should only take 5-10 minutes. The command and control cops would force cloud developers to go through the same ticketing and approval processes required in the datacenter. These processes were often decades old, designed when deployments occurred 2-3 times a year and all infrastructure was physical machines owned by a separate team. Ken relates this story from his very early experiences with cloud adoption:

The company had decided on an aggressive cloud adoption plan. Being a large SAP shop, they were very excited about the elastic nature of the cloud. SAP environments across testing, staging, development, etc. can be very expensive, so there are rarely as many as teams would like. Shortly after the non-production environments moved to the cloud, I had the cloud infrastructure team show me how they had automated the provisioning of an entire SAP environment. What had previously taken months could now be done in a few hours! With excitement in my stride, I strolled over to the testing team.

“You must be thrilled that we can now provision a whole new SAP line in a few hours!” I exclaimed.

“What are you talking about?” they asked. I went on to explain what I had just learned about the automated provisioning of the SAP environment.

“Well, that’s not how it really works,” they told me. “If we want a new environment, we have to create a Remedy ticket requesting it. Our environment manager has 5 business days to review it. If he approves it without questioning it (he never does), it then goes to finance for approval. They meet monthly to review budget change requests. If the budget is approved, we then need to submit it to the architecture review board. They meet the third Friday of every month. That process typically requires at least two cycles of reviews. So I’m thrilled that someone can do it in a few hours, but I’m still looking at several months.”

Totally deflated, I realized the truth in the old saying: “I have met the enemy, and the enemy is us”.

These two antipatterns drove a lot of work over the next few years. In the Wild West pattern, production environments became unpredictable and unmanageable due to a lack of rigor and governance. There were regular security breaches because teams did not understand the fundamentally different security postures of the public cloud, believed that security was “somebody else’s job,” or both. The command-and-control pattern created very little value while requiring large amounts of money in ongoing strategy and policy work, building internal platforms that did not meet developers’ needs. Worse yet, this created an insurgence of shadow IT: groups or teams running their own mini IT organizations because their needs are not being met through normal channels.

All of these issues have created an awareness of the need for a strong focus on cloud operations and a new cloud operating model. Since 2018, one of our clients’ most frequent requests is for help modernizing operations and designing new operating models.

Many of the companies we work with are 2-3 years or more into their journey. In the inaugural years, they pay a lot of attention to cloud technologies. They improve their technical skills for building software and guardrails in the cloud. They often start at the IaaS layer because they are comfortable dealing with infrastructure. As their cloud experience matures, they realize that the true value of cloud is higher up in the cloud stack, and they look into PaaS and SaaS services.

Figure 1-3. Cloud value.

Development shops have been embracing automation and leveraging concepts like continuous integration (CI) and continuous delivery (CD). The rest of this book will focus on the impact of concepts like DevOps, cloud native architecture, and cloud computing on traditional operations.

From Hardened Datacenter to Blank Canvas

When you start building in the public cloud, you are basically starting from scratch. You have no existing cloud datacenter, no guardrails, no financial management tools and processes, no disaster recovery or business continuity plan—just a blank canvas. Conventional wisdom is to just apply all the tools, processes, and organizational structures from the datacenter to the cloud. That’s a recipe for disaster

Regardless of how well or badly an organization manages its datacenter, people are accustomed to the existing policies and processes and generally know:

What to do when incidents, events or outages arise

What processes to follow to deploy software

What the technology stack is for the products they are building and managing.

What process to follow to introduce new technology

When applications are moved, refactored, or built new on the cloud, they are being deployed to a brand new virtual environment that is radically different than the environments that people are used to in the existing datacenters. The processes and policies governing how work gets done in a datacenter have typically evolved from many years of change across numerous shifts in technology from mainframes, client-server architectures, internet-enabled applications, and today’s modern architectures. Many of these processes were defined in a different era: a gigabyte of storage was hundreds of thousands of dollars in the 1980s. In the cloud, it’s about two cents per month. Human labor was the the cheap component. Along with these legacy processes come a whole host of tools, many of which are legacy themselves and were never intended to support software that runs in the cloud.

Too often, teams from infrastructure, security, GRC and other domain specific areas insist on sticking to their existing tools and processes.If these tools are not cloud native or at least cloud friendly, a painful integration must take place to make them work effectively.

This creates unnecessary friction for getting software out the door. It can also create complexity, which can increase costs, reduce performance, and even reduce resiliency. Another issue is that often these legacy tools are also tied to legacy processes which makes it challenging and sometimes impossible to automate the end-to-end software build and release processes.

Another common antipattern is the desire to keep an existing on-premises logging solutions in place and not go to a cloud-native solution. When you do this, all logs must be sent from the public cloud back to the datacenter through a private channel, incurring data transfer costs and creating an unnecessary dependency on the datacenter. These legacy logging solutions often have dependencies to other software solutions as well as processes that create dependencies between the cloud and datacenter. This means that a change in the datacenter can cause an outage in the cloud because nobody knew of the dependency. These issues are very hard to debug and fix quickly.

Here is another example. We did an assessment of a client’s tools, recommended tools that would work well in the cloud, and advised them on which ones should be replaced by a more cloud-suitable solution. One we recommended replacing dealt with monitoring incoming network traffic. The group that managed the tool refused to look into a new tool because they were comfortable with the existing tools and didn’t want to have to manage two tools. This created a single point of failure for all of the applications and services running in their public cloud. One day the tool failed and no traffic was allowed to flow to the public cloud, thus taking down all cloud applications.

The lesson here is to evaluate replacements for tools that are not well suited for the cloud. We often see resistance to process change lessen when new tools are brought in; tools are often tightly tied to the processes they are used in. Try to reduce the number of dependencies that the applications running in the public cloud have on the datacenter and have a plan to mitigate any failures on a datacenter dependency.

The new cloud operating model that we are advocating brings domain experts closer together. We hope it will reduce these avoidable incidents as companies rethink their approach to the cloud.

Shared Responsibility: The Datacenter Mindset versus the Cloud Mindset

A common mistake that companies make is they treat the cloud just like a datacenter. They think in terms of physical infrastructure instead of leveraging cloud infrastructure as a utility, like electricity or water. There are two major capabilities that are different in the cloud when compared to most on-premises infrastructures. First, in the public cloud, everything is software defined and software addressable (that is, it has an API). This creates an incredible opportunity to automate, streamline, and secure the systems. While software-defined everything has made significant strides in the datacenter in the last decade, most of our clients still have major components that must be configured and cared for manually. The second major difference in the public cloud is the inherent design for multi-tenancy. This “from the ground up” view of multi-tenancy has driven a great level of isolated configuration in the cloud.

Here’s an example. In most companies, there are one or two engineers who are allowed to make DNS changes. Why is that? Because the tooling we often use on-premises does not isolate workloads (or teams) from each other. This means that if we let Joe manage his own DNS, he might accidentally change Sue’s DNS, causing disruption. So we have made sure that only David and Enrique are allowed to change DNS for everyone in the whole company. In contrast, in the cloud, everyone’s accounts are naturally isolated from each other. Joe can have full authority over his DNS entries while he might not even be able to browse, let alone change, Sue’s entries. This core difference is often overlooked and is one of the key facets that allows for self-service capability in the public cloud.

Enterprises who have been building and running datacenters for many years often have a challenge shifting their mindset from procuring, installing, maintaining, and operating physical infrastructure to a cloud mindset where infrastructure is consumed as a service. (Randy Bias has memorably described the difference between cloud and physical servers as being like the difference between pets and livestock and pets; one is named and cared for personally, the other is numbered and replaceable.)

You might also think of An analogy we like to use is buying a house versus renting onea house. The analogy really boils down to assets that are purchased versus assets that are rented and the responsibilities that go along with each. When you buy a house, you are investing in both property and physical structure(s) on that property. You are responsible for maintaining the house, landscaping, cleaning, and everything else that comes with home ownership. When you rent, you are paying for the time that you inhabit the rental property. It is the landlord’s responsibility to maintain it. The biggest difference between renting and buying is what you, as the occupant of the house, have control over. (And just as people get more emotionally attached to their owned homes than to their rented apartments, plenty of infrastructure engineers have true emotional attachments to their servers and storage arrays.)

When you leverage the cloud, you are renting time in the cloud provider’s “house.” What you control is very different than what you control in your own datacenter . For people who have spent a lot of their career defining, designing, and implementing processes and technologies for the controls they are responsible for in their datacenter, shifting some of those controls to a third party can be extremely challenging

The two groups who probably struggle the most to grasp the cloud shared-responsibility model are auditors and GRC teams. These teams have processes and controls for physically auditing datacenters. When you pause to think about it, physically evaluating a datacenter is a bit of a vestigial process. Sure, 50 years ago nearly all IT processes (including application development) probably happened in the datacenter building, but today, many datacenters run with skeleton crews. IT processes are distributed in many locations, often globally. But the auditors expect to be able to apply these exact processes and controls in the cloud. The problem is, they can’t. Why? Because these datacenters belong to the cloud service providers (CSPs), who have a duty to make sure your data is safe from their other clients’ data. Would you want your competitor walking on the raised floor at Google where your software is running? Of course not. That’s just one simple example.

At one meeting we attended, a representative of one of the CSPs was explaining how they handle live migrations of servers that they can run at any time during the day with no impact to the customers. The client was adamant about getting all of the CSP’s logs to feed into their company’s central logging solution. With the shared-responsibility model, the CSP is responsible for logging and auditing the infrastructure layer, not the client. The client was so used to being required to store this type of information for audits that they simply would not budge. We finally had to explain that in the new shared responsibility model, that data would no longer be available to them. We asked where they stored the logs for failed sectors in their disk array and how they logged the CRC (error correction) events in their CPU. Of course, they didn’t.

We explained to the client that they would have to educate their audit team and adjust their processes. To be clear, the policy that required the client to store those logs is still valid. How you satisfy that policy in the cloud is completely different. If the auditors or GRC teams cannot change their mindset and come up with new ways to satisfy their policy requirements, they might as well not go the public cloud. But does an auditor of a GRC team really want to hold an entire company back from leveraging cloud computing? Should the auditor be making technology decisions at all? A key task in the cloud modernization journey is the education of these 3rd party groups that have great influence in the enterprise. As technology becomes more capable and automated, the things that we have to monitor will change—because the risk profile has changed fundamentally.

In the datacenter world, teams are traditionally organized around skill domains as they related to infrastructure. It is common to find teams responsible for storage, for network, for servers, for operating systems, for security, and so forth. In the cloud, much of this infrastructure is abstracted and available to the developers as an API call. The need to create tickets to send off to another team to perform a variety of tasks to stand up physical infrastructure like a SAN (storage area network) simply does not exist in the public cloud. Developers have access to storage as a service and can simply write code to provision the necessary storage. This self-service ability is crucial to enabling one of the prizes of cloud transformation: higher-velocity IT.

Networking teams in the datacenter leverage third-party vendors who provide appliances, routers, gateways, and many other important tools required to build a secure, compliant, and resilient network. Many of these features are available as a service in the cloud. For areas where the cloud providers don’t provide the necessary network security functionality, there are many third-party SaaS or pay-as-you-go solutions available, either directly from the vendor or from the CSP’s marketplace. Procuring these solutions in the cloud when they are consumed as SaaS, PaaS, or IaaS is different than how similar tools in the datacenter are procured. In the public cloud, there are usually no physical assets being purchased. Gone are the days of buying software and paying 20-25% of the purchase price for annual maintenance. In the cloud you pay for what you use, and pricing is usually consumption based.

Use what you have vs. Use what you need

Before cloud computing was an option, almost all of the development we were involved in was deployed within the datacenters that our employers and clients owned. Each piece of the technology stack was owned by specialists for that given technology. For databases, there was a team of DBAs (database administrators) who installed and managed software from vendors like Oracle, Microsoft, Neteeza, and others. For middleware, there were system administrators that installed and managed software like IBM’s Websphere, Oracle’s Weblogic, Apache Tomcat, and others. The security team owned various third-party software solutions and appliances. The network team owned a number of both physical solutions and software solutions and so forth. Whenever development wanted to leverage a different solution from what was offered in the standard stack, it took a significant amount of justification for the following reasons:

The solution had to be purchased up front.

The appropriate hardware had to be procured and implemented.

Contractual terms had to be agreed upon with the vendor.

Annual maintenance fees had to be budgeted for.

Employees and/or consultants needed to be trained or hired to implement and manage the new stack component.

Adopting new stack components in the cloud, if not constrained by legacy thinking or processes, can be accomplished much quicker, especially when these stack components are native to the CSP. Here are some reasons why:

No procurement is necessary if solution is available as a service.

No hardware purchase and implementation are necessary if the service is managed by the CSP.

No additional contract terms should be required if the proper master agreement is set up with the CSP.

There are no annual maintenance fees in the pay-as-you go model.

The underlying technology is abstracted and managed by the CSP, so the new skills are only needed at the software level (how to consume the API, for example).

Let’s say that Acme Retail, a fictitious big box retailer, has standardized on Oracle for all of its OLTP (online transaction processing) database needs and Teradata for its data warehouse and NoSQL needs. Now a new business requirement comes in and Acme needs a document store database to address it. In the non-cloud model, adopting graph databases would require new hardware, new software licensing, database administrators (DBAs) trained in the new database technology, new disk storage, and many other stack components. The process required to get all of the relevant hardware and software approved, procured, implemented, and secured would be significant. In addition, Acme would need to hire or train DBAs to manage the database technology.

Now let’s look at how much simpler this can be in the public cloud. If Acme’s CSP of choice offers a managed service for a document store database, most of these steps are totally eliminated. They no longer need hardware, software licensing, no additional DBAs to manage the database service, no new disk storage devices—no procurement services at all. All Acme would need is to learn how to use the API for the document store database to start building its solution.

Of course they would still need to work with the security team, the network team, the governance team, and others to get approvals, but much of time it takes to adopt the new technology is reduced. In Chapter 2 we will discuss shifting responsibilities closer to development (shifting left). But for the sake of this example, let’s assume that we still have silos for each area of technology expertise (network, storage, server, security, etc.).

Let’s say I have 120 days to deliver the new feature to the business. The best solution for the requirement would have been using a document store database like MongoDb. However, we estimate that the effort required to get approvals, procure all of the hardware and software, train or hire DBAs, and implement the database would exceed our 120-day window. Therefore, we decided to leverage Oracle, our relational database engine to solve the problem. This is suboptimal, but at least we can meet our date. And remember, this is just the burden to get started; we are not yet factoring in the burden that comes from day-to-day change management as we are defining and building the solution.

This decision process repeats itself over and over from project to project, which results in a ton of technical debt because we keep settling for suboptimal solutions due to our constraints. Now let’s see how different this can all be if we are allowed to embrace a database-as-a-service solution in the public cloud.

After doing some testing in our sandbox environment in the cloud, we determine that the document store managed service on our favorite CSP’s platform is a perfect solution for our requirements. We can essentially start building our solution right away because the database is already available in a pay-as-you-go model complete with autoscaling. Leveraging stack components as a service can reduce months of time from a project. It also allows you to embrace new technologies with a lot less risk. But most importantly, you no longer have to make technology compromises because of the legacy challenges of adopting new stack components. Unfortunately, we do see some companies, especially early in their cloud journeys, destroy this model by using the new technology with their old processes (approvals, hand-offs, and SLAs). This change-resistant approach is why we are so adamant about linking new process with new technology.

To illustrate what this architecture lookslike, here are two designs. The first diagram depicts Acme’s legacy architecture for a typical application (see figure 1-4). Acme currently has standard stack components for a relational database and an enterprise data warehouse (EDW) .

Consuming stack components as a service provides greater flexibility for architects. It is important that this is understood across all domains within IT. If it is not understood universally, there is a good chance that the legacy constraints will be forced upon the cloud architects and they will wind up building suboptimal greenfield solutions that will create new technical debt in the cloud.

Immutable Infrastructure

Earlier we discussed the livestock/pet analogy. Let’s expand on that thinking here. In the cloud, a best practice is to treat infrastructure as expendable (cattle). We use the term immutable to describe the processes of destroying and rebuilding a virtual machine (VM), network configuration, database, etc. in the cloud. In the legacy model, servers exist continuously and a lot of time and effort goes into making sure they are healthy. Release planning in the legacy model usually requires that we have a backout or rollback strategy. Even regular security patching is a headache that all teams dread. Removing software updates from a production system can often create even more issues than what an unsuccessful release may have originally caused. Rollbacks can be extremely risky, especially in a complex system.

In the cloud, when a virtual machine is unhealthy, we can simply shut it down and create a new one. This allows us to focus our immediate attention on SLAs in the areas of availability, performance, reliability, etc. instead of spending time trying to determine what caused the issue. Once the system is back to being stable and meeting its SLAs, we can then perform forensics on the data we captured from the terminated infrastructure. A best practice is to take a snapshot of the machine image which can be used in conjunction with logging and monitoring tools to triage the problem.

Treating virtual machines in the cloud as immutable gives us many advantages for releasing software as well. Instead of designing complex rollback processes, we can implement a solution like the following process that is designed for web servers sitting behind a load balancer.

In this example, we have 3 web servers sitting behind a load balancer. The approach is to deploy the new software to 3 brand new VMs and to take a snapshot of the current VMs in case we need to back out changes. Then we attach the new VMs to the load balancer and start routing traffic to them. We can either take the older VMs out of rotation at this point or simply make sure no traffic gets routed to them. Once the new release looks stable, we can shutdown and detach the old VMs from the load balancer to complete the deployment. If there are any issues, we can simply reroute traffic to the older VMs and then shutdown the new VMs. This allows us to stabilize the system without introducing the risk of backing out the software. Once the system is stable, we can fix the issues without worrying about creating any new issues from a complex rollback scheme.

This is just one of many ways to embrace the concepts of immutable infrastructure for deployments. Companies with deep experience in this area can design advanced processes to create highly resilient systems, all of which can be fully automated.

Microservices and Monoliths

Historically, applications were built, packaged and deployed as a large unit of work made up of thousands or even millions of lines of code. These large systems are known as monoliths. Monoliths have many disadvantages. First, they tend to be fragile. Any change to a line of code could impact the entire system. Because of the large number of dependencies in these systems, monoliths are changed infrequently and often scheduled as quarterly, biannual, or even annual releases. These dependencies often create huge costs in regression testing (the enemy of high velocity delivery). Infrequent changes reduce agility and create long wait times for customers to gain access to new features and products.

Many companies have moved to or are experimenting with microservices. A microservices architecture is “an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API,” according to Martin Fowler and James Lewis. Each service can be run on its own infrastructure which can be a server, a virtual machine, or a container (or now, a serverless function). This style of architecture is often referred to loosely coupled because the services are independent of each other and not hard-coded to the underlying infrastructure.

Figure 1-7. Microservices Architecture. Source: Martin Fowler https://martinfowler.com/articles/microservices.html

The advantage of microservices is that each service can be deployed separately (because they can be tested separately), resulting in much more frequent deployments. Developers can make small changes or add new services without being dependent on any other changes to the overall product. When you hear of companies deploying multiple times a day, the company typically has a microservices architecture or an architecture made up of many individual components.

The disadvantage is that managing a system made of of many individual parts can be challenging. Traditional monolithic systems are made up of 3 major parts: a web front end, a database, and backend processes where most of the business logic is. In a microservices architecture, the system is made up of many independent services. Operating a service-based product requires new tooling, new processes (especially for building and deploying software), and new skills.

We will discuss microservices in more detail in Chapter 2.

The need for speed

We now build and deploy software faster than ever before.

Cloud computing is one of the strongest contributors to that speed of deployment.

The cloud providers offer developers a robust service catalog that abstracts the underlying infrastructure and a variety of platform services, allowing developers to focus more on business requirements and features. When cloud computing first became popular in the mid to late 2000s, most people used the Infrastructure as a Service cloud business model. As developers became more experienced, they started leveraging higher levels of abstraction. Platform as a Service abstracts away both the infrastructure and the application stack (operating system, databases, middleware, etc.). Software as a Service vendors provide full fledged software solutions where the enterprises only need to make configuration changes to meet their requirements and manage user access.

Each one of these three cloud service models (IaaS, PaaS, SaaS) can be huge accelerators for the business. Businesses used to have to wait for their IT departments to procure and install all of the underlying infrastructure, application stack, and build and maintain the entire solution. Depending on the size of the application, this could take several months or even years.

In addition to the cloud service models, technologies like serverless computing, containers, and fully managed services (for example, databases as a service, blockchain as a service, and streaming as a service) are providing capabilities for developers to build systems much faster. We will discuss each of these concepts in more detail in chapter 2.

All of these new ideas challenge our traditional operating models and processes. Adopting them in a silo without addressing the impacts on people, processes, and technology across all of the stakeholders involved in the entire SDLC, is a recipe for failure. We will discuss some of the patterns and antipatterns for operating models in chapter 3.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 1. The Shift to the Cloud

Create new playlist

Sign In

Sign Up