Back in Chapter 1, we discussed the core principles of FinOps. The principles are great to help guide actions, and in Chapter 7 we discussed the FinOps Framework, which was built on those principles. This chapter covers the ongoing, iterative phases of the FinOps lifecycle and how to apply them to the capabilities of FinOps. The principles haven’t changed much since the first edition of this book. But we have seen a lot of different companies and organizations begin to build and iterate on top of them, making them their own. Much like the FinOps Framework, which is a set of building blocks meant to be assembled in various ways relevant to your organization, the principles are also open source bedrocks for you to assemble and build a FinOps plan for your organization.
Again, the principles are as follows:
Teams need to collaborate.
Decisions are driven by the business value of cloud.
Everyone takes ownership of their cloud usage.
FinOps reports should be accessible and timely.
A centralized team drives FinOps.
Take advantage of the variable cost model of the cloud.
Let’s look at how each principle plays out in action with real-world ramifications, and later we’ll dig into how specific framework capabilities are designed to leverage them to achieve specific results.
First and foremost, FinOps is a cultural change that focuses on breaking down the silos between teams that historically haven’t worked closely together. When done well, the finance team uses language and reporting that moves at the speed and granularity of cloud, product managers fine-tune their application scaling forecasts to accommodate expected income from new features, while engineering teams consider cost as a new efficiency metric.
At the same time, the FinOps team works to continuously improve agreed-upon metrics for efficiency. They help define governance and parameters for cloud usage that provide some control, but focus first on ensuring innovation and speed of delivery can flourish alongside cost efficiency.
The addition of a blameless culture to this principle enables the company to learn from mistakes. Taking away the need for a person/team to blame for a cost overrun allows a postmortem to focus instead on how an overrun can be avoided in the future and what changes the organization needs to make to learn from this incident.
If a culture of finger pointing and shaming for “doing” the wrong thing prevails, people will not bring issues to light for fear of punishment.
Betsy Beyer et al., Site Reliability Engineering (O’Reilly)
Think first about the business value of cloud spend, not the cost. It’s easy to think of cloud as a cost center, especially when the spend reaches material levels. The cloud is a value creator, but the more you use it, the more cost it will incur. The role of FinOps is to help maximize the value created by that spend. Instead of focusing on the cost per month, focus on the cost per business metric, and always make decisions with the business value in sight.
Cloud costs are based on cloud use, which comes with a straightforward correlation: if you’re using the cloud, you are incurring costs and thus are accountable for cloud spending. Embrace this fact by pushing cloud spend accountability to the edges of your organization, all the way to individual engineers and their teams. And give them the information and guidance to be able to do this important job for the organization.
In the world of per-second—or even microsecond—compute resources, unlimited cloud storage, shared Kubernetes clusters, automated deployments, and services that can incur costs based on externally controlled triggers, monthly or quarterly reporting of cloud spending isn’t good enough. Real-time decision making is about getting data—such as spend changes or anomaly alerts—quickly to the people who deploy and manage cloud resources.
As discussed in Chapter 8, FinOps data should be put in the path of the people making infrastructure decisions, informing them of the information they need without adding the extra effort for them to find it. Real-time decisions enable these people to create a fast feedback loop through which they can continuously improve their spending patterns, make intelligent decisions, and improve efficiency.
Focus relentlessly on clean data to drive decisions. FinOps decisions should be based on fully loaded and properly allocated costs. The costs should be amortized to include any prepayments made as part of commitment programs and should reflect the actual discounted rates a company is paying for cloud resources. They should also equitably factor in shared costs and be mapped to the business’s organizational structure. Without these adjustments to your spending data, your teams will make decisions based on incorrect data and hamstring value creation.
Cultural change works best with a flag bearer. A central FinOps function drives best practices into the organization through education, standardization, and evangelism. This centralized team is where you find the subject matter experts who make the changes to culture through advocacy and education. The FinOps team improves the available data via better tooling and modifies the business processes that enable your organization to do FinOps. You maximize the results from rate optimization efforts by centralizing them, which gives your teams on the edge the freedom to maximize the results from usage optimization. Remember, the most successful companies decentralize responsibility to use less, and centralize responsibility to pay less.
FinOps practitioners use performance benchmarking to provide context for how well their organization is performing. Cloud performance benchmarking gives a company objective evidence on how well it’s doing. Benchmarking lets teams know whether they’re spending the correct amount or whether they could be spending less, spending differently, or spending in a better way. Companies should use both internal benchmarks to determine how individual teams compare to each other in key areas such as optimization, and external benchmarks based on industry standards to compare the company as a whole to others like it.
In the decentralized world of the cloud, planning for capacity moves from a forward-looking “What are you going to need to cover demand?” perspective to a real-time “How can we stay within our budget given what we’re already using?” perspective. Instead of basing capacity purchases on possible future demand, base your rightsizing, volume discounts, and RI/SP/CUD purchases on your actual usage data. Since you can always purchase more capacity to fit demand, the emphasis becomes making the most out of the services and resources you’re currently using, and to use as few of them for as long as possible.
As your cloud practice matures, aim to take advantage of cloud native services that scale with demand and models such as spot instances that can leverage low-cost resources when they are needed.
Now that we’ve detailed the core principles, let’s explore how they’re implemented across three distinct phases: inform, optimize, and operate (see Figure 9-1). These phases aren’t linear—you should plan to cycle through them constantly:
The inform phase gives you the visibility for allocation and for creating shared accountability by showing teams what they’re spending and why. This phase enables individuals who can now see the impact of their actions on the bill.
The optimize phase empowers your teams to identify and measure efficiency optimizations, like rightsizing, storage access frequency, or improving RI coverage. Goals are set upon the identified optimizations, which align with each team’s area of focus.
The operate phase defines and implements processes that achieve the goals of technology, finance, and business. Automation can be deployed to enable these processes to be performed in a reliable and repeatable manner.
The lifecycle is inherently a loop and is never complete. The most successful companies take an approach of gradual improvement and get a little better each time they go through it, building muscle memory through repetition and practice.
In each phase of the lifecycle you will take actions and perform activities based on the state of your cloud spend and your FinOps practice. Chapter 7 covered the FinOps Foundation Framework, which describes these aspects of the FinOps operating model. We described this using the metaphor of a garden where you will pivot to the tasks that need doing each day—watering, mulching, and weeding—based on the state in which you find the garden at that time.
The inform phase is where you will look at the state of your FinOps garden. The optimize phase is where you will look at which of the capabilities has the actions you might perform to make your FinOps garden healthier. And the operate phase is where you will take those actions in your organization’s environment to ensure your garden flourishes.
Some of the capabilities lend themselves well to a certain phase of the lifecycle, and others might be used in more than one or all the phases.
Let’s review each phase and some of the actions you’ll take as you loop through the lifecycle. It’s important to note that you won’t perform all actions during every pass through the lifecycle. Chapter 1 covered the Prius Effect, which translates real-time feedback loops into data-driven decision making. Similarly, each pass through the lifecycle should be informed by the latest data you have to focus your efforts on the smallest set of most important activities needed at that time.
The inform phase is where you start to understand your costs and the drivers behind them. By giving teams visibility into their costs on a near-real-time basis, you drive better behavior. During the inform phase, you get visibility into cloud spend, drill down into granular cost allocation, and create shared accountability. Teams learn what they’re spending on what services, and why, by using various benchmarks and reporting. For the first time, individuals can see the impact of their actions on the bill.
Some of the activities in this phase include:
The FinOps team should provide the data needed for teams to generate forecasts of cloud usage for different projects and propose budgets for each. These budgets and forecasts should consider all aspects of a cloud architecture, including cloud native services, containers, and related costs. Managing teams to budgets lets you know when to lean in with optimization or spend remediation help. It also enables a conversation about why spending has changed.
Forecasting of spend should be done for each team, service, or workload based on fully loaded costs and properly allocated spending, with the ability to model changes to the forecast based on different inputs such as history and cost basis.
The optimize phase identifies measured improvements to your cloud and sets goals for the upcoming operate phase. Cost-avoidance and cost-optimization targets come into play during this phase, with cost avoidance being the first priority.
Processes are required to set and track the near-real-time business decisions that enable your organization to optimize its cloud. We’ll also look at the cloud service provider’s offerings that can help to reduce cloud costs. This phase includes the following activities:
Whereas the optimize phase sets the goals for improving, the operate phase sets up the processes for taking actions to achieve those goals. This phase also stresses continuous improvement of processes. Once automations are in place, management takes a step back to ensure spending levels are aligned with company goals. It’s a good time to discuss particular projects with other FinOps team members to determine whether the team should make some changes. Here are some of the activities that take place during the operate phase:
There are a few key considerations you should review in your FinOps practice when beginning your journey through the lifecycle. They fall along the key ideas of FinOps: having a clear understanding of your spend, creating a company-wide movement, driving innovation, and, ultimately, helping the business reach its goals.
You will want to evaluate the following:
An important step is to tie cloud spend to actual business outcomes. If your business is growing and you’re scaling in the cloud, it’s not necessarily a bad thing that you’re spending more money. This is especially true if you know what the cost is to service a customer and you’re continuously driving it down. Tying spend metrics to business metrics is a key step in your FinOps journey.
Unit economics provide a clear, common lexicon so that all levels of the organization can discuss cloud spending in a meaningful way. Instead of management setting arbitrary spend goals, it can set targets that are tied to outcomes. The management advice becomes “Don’t worry about the total bill; just make sure you’re driving down the cost per ride” instead of the more restrictive “Spend less on cloud.”
You start by asking questions that kick-start the inform phase. Think of the FinOps lifecycle as a closed-loop racetrack—you can jump in at any point, and you’ll eventually loop back around. However, we recommend you start at inform before you get into optimize or operate. Gain visibility into what’s happening in your cloud environment and do the hard—but important—work of cleaning up your allocation so that you know who is truly responsible for what before you start making changes.
And no matter where you are in the lifecycle, you should be continually focused on culture and governance. The true power of FinOps comes from combining the actions and tools with cultural shifts that change how your whole organization relates to using the cloud.
Whatever you do, don’t try to boil the ocean. A few years ago, we saw a major retailer try to go from 0% to 80% RI coverage in a single purchase. The company studied its infrastructure, consulted its engineering teams, checked its operating systems, and made a $2 million purchase. Managers high-fived each other on their awesomeness and then went back to work for the next few weeks. The next month the cloud bill was considerably higher, and the VP was furious. Upon review, the company found it had purchased the wrong RIs in the wrong operating system due to naiveté about how BYOL (“bring your own license”) models are applied. That same retailer is now at 80% coverage, but it took a multiyear effort to uplevel finance teams and business units who were gun-shy after the earlier disaster. Take your time. Like anything, building competence takes time but can be accelerated by learning from those who have gone before you.
Before you start telling teams to turn off this resource or downsize that one, you must get a true sense of what the cost drivers are and let the teams see the impact of their spending on the business.
This will drive some surprising, autonomous results. We learned about a great example of this via a Slack message from a team member, where a manufacturing company enabled six-figure-a-year savings simply by showing a team what they were spending (see Figure 9-2).
The best part of this story is that FinOps didn’t make any recommendations to the team. All they did was shine a light on the team’s cloud usage. The team took charge to make improvements based on their understanding of the infrastructure cost. This is why you push reduction of usage out to the teams responsible for spending.
Remember, mastery of the FinOps lifecycle is an iterative approach requiring years of education and process improvement in a large enterprise.
To summarize:
The FinOps lifecycle comprises three main phases that you continuously cycle through.
Improve with each iteration of the lifecycle—don’t try to do everything at once.
Involve all your cross-functional teams early and often so they can learn with you.
Constantly look for opportunities to refine your processes, but move quickly from phase to phase.
The most critical thing you can do is provide your teams with granular, real-time visibility into their spending.
Before you can do anything else, you need to fully load and allocate your costs, factoring in your custom rates, filling allocation gaps, distributing shared costs, remapping the spend to your organizational structure, and accounting for amortizations.
This may sound like a lot of work, but it’s actually an easy process to get started. Next up, we’ll work through the first phase of the lifecycle so you can start addressing the questions you need to answer.