Chapter 9. The Infrastructure Aspect

In this chapter, we take a look at the kinds of services to create at the infrastructure layer. We explore a variety of infrastructure-related concepts that are important within the universe of deconstructed design, including Infrastructure as Code (IaC), Pipelines for Machine Learning, Chaos, and many more tools and methods.

Considerations for Architects

Sometimes, architects are viewed as only a part of the application development or product development team. They limit their specifications to only the software and services layer. Just as we saw that the effective architect’s purview also includes the business view, this individual also must contemplate the infrastructure, seeing all the aspects of business, application/services, data, and infrastructure working together.

As you consider how to design your infrastructure, the following are critical issues to address:

  • Definition of approach to infrastructure creation in support of your project, including containerization and IaC

  • Toolsets in support of these

  • Release engineering and management

  • Process definition for Continuous Delivery, Continuous Deployment, and Continuous Integration

  • Process definition for change control

  • Budgeting and financial management of the infrastructure

  • Capacity planning

  • Patching

  • Disaster recovery

  • Monitoring 

  • Logging and auditing

  • Roles and responsibilities definitions for DBAs, DevOps, architects, and application owners and/or system owners

These are all important considerations in the purview of the effective enterprise architect. They should be captured and addressed in your Design Definition Document. Although these aspects of your infrastructure are critical, they will be specific to your business and project needs.

If you are working in a cloud environment, many of these will change from your on-premises approach. For example, the saying goes that in the cloud, you treat your infrastructure as cattle, not pets. This refers to the cloud best practice of never actually patching servers. You instead take them offline and entirely replace them with a complete upgraded server using your automation tools.

Disaster recovery is another area that tends to change dramatically from on-premises to cloud. Historically, you needed to have two different datacenters, and grudgingly negotiate vendor contracts that have a separate disaster recovery datacenter. The application here tends to not have the same capacity, the same setup, and the same version of the application and data. There is usually some lag because there is less urgency to keeping these perfectly synchronized because disasters don’t happen every day. There is also significant cost associated with something you hope to never actually use. If you design your services properly, to statelessly run on top of automatically replicating peer-to-peer data services, you can have a very resilient application running across multiple datacenters, even across multiple continents in an active–active configuration. This puts your services close to your customers and maximizes both your resilience and the cost/benefit.

Architects assist in the budgeting and financial planning aspect by using tools such as cloud provider cost calculators to estimate the infrastructure and the monthly rental costs. Make sure when you do this to specify different needs for development, testing, integration, User Acceptance Testing (UAT)/staging, and production as necessary. Defining your infrastructure across several environments like this can become expensive. This is one reason why automation through IaC is crucial: it allows you not only to scale up, but to scale back down. You can shut down entire environments when they aren’t needed, to save costs. If all you need to do is push a single button to kick off the automatic creation of your entire infrastructure and deployment, you’ll be more likely to manage this closely and carefully.

Capacity planning will also require significant changes in how you operate in the cloud. Instead of trying to guess up front, months in advance before you have any real traffic patterns or load to plan for, you can take advantage of autoscaling groups. These allow you to define rules such that when a trigger circumstance is met (for instance, when a server reaches 80% CPU and stays there for some time), you can have the cloud automatically provision another server and add it to the cluster behind the load balancer. Likewise, for cost management reasons, you’ll want to define rules that remove a server in the event that usage becomes very low.

This all means that your infrastructure is more closely related to your business than ever, and potentially more closely coupled to your applications than ever. We have been used to two separate horizontal layers in the false dichotomy of infrastructure versus application. But we deconstruct that false binary opposition, and with the cloud and IaC, see the entirety of our servers, networking, and application all defined as versioned plain text and code in a single image, automated, and all working together in near real time.

Make sure that however your relationship with the enterprise operations/run team is structured within your organization that you have clearly defined the aforementioned items. The one place no one likes surprises is in the infrastructure. Your goals should be clarity, predictability, transparency, and cost-aware resilience.

DevOps

Another story that we comfort ourselves with in software is patently false: if we use this tool, this framework, this practice, we will “save time” by eliminating effort. The person who invented the ship also invented the shipwreck, which reminds us that every solution creates new problems; we are often not solving problems so much as trading them for others we (hopefully) would rather have. If we focus on the idea that we are “solving problems” and “saving time,” we will miss much of the picture. Similarly, we must let go the idea that we are eliminating effort. Effort, like problems, is typically just moved, not eliminated. This presents one of the major difficulties in DevOps today.

DevOps attempts to conflate the two jobs of development and operations. It is encouraging that DevOps is a deconstruction of the traditional binary opposition between development and operations. But the responsibilities of the two jobs do not go away.

The stated aims of DevOps are, as you might expect, improved productivity, speed, scale, reliability, collaboration, and the other usual suspects that have been the aim of most initiatives in our industry in the past half century.

There are a variety of DevOps models, and we as an industry have been discussing and debating the practice, what it is, and how to go about it for more than a decade. For our purposes, let’s do a quick overview to make sure we have defined the term and highlighted some of the key principles that might make the most material difference to architects/designers:

  • In DevOps, the application developers and the operational folks are not siloed in a Plan/Build/Run–type model in which the builders throw completed code over the wall to the runners. Instead, they work together on the same team for a more integrated lifespan of the project, and share practices and duties. Development, infrastructure, and security are viewed as part of the holistic set of concerns everyone shares.

  • DevOps focuses on IaC as a practice, which requires that traditional infrastructure folks work more as developers, but with an infrastructure and operations mindset. They need to be not just more aware of developer practices such as Agile methods, code repositories, testing, software design, commenting, and so forth, but they need to be very skilled in these practices. 

  • It represents a philosophical shift in mindset wherein both roles are focused on developer productivity, resilience and reliability, automation, and security. Instead of serializing the customer needs on through to product management and then development and then infrastructure, the DevOps engineer is less abstracted from the customer by working side by side with the application makers.

Although the way different organizations have tried to realize DevOps can vary, there are a few practices that seem consistent and important across applications:

  • Small, frequent updates as opposed to large, infrequent major pushes. This requires a CI pipeline, and a continuous (or at least frequent) delivery pipeline. Such pipelines allow you to be more responsive to your customers and improve reliability because changes are isolated to small batches instead of large and less predictable updates.

  • Service-oriented development. Aligning a single function with a single service that is independently deployable, scalable, and versionable, and in turn aligning that service with an Agile team on your org chart can also help productivity, accountability, speed to market, and reliability.

  • Other important practices include IaC, configuration management, and integrating monitoring and logging with the application development practices as we will discuss throughout this chapter.

These are the principles and ideas that will be most relevant to you as you consider your infrastructure angle further in light of your organization’s position and needs.

Infrastructure as Code

IaC allows you to describe declaratively in plain text the infrastructure that you want to create. Software systems read those declarations and spin up the infrastructure to match it. Instead of negotiating contracts, enlisting the procurement department, and spending capital dollars far in advance to provision your datacenter in a nonrepeatable and hard-to-visualize process, IaC allows you to define your entire datacenter with plain text using a configuration syntax. This gives you a blueprint of your datacenter. There are tremendous advantages here:

  • You can readily understand the comprehensive picture of your datacenter and all of the components that underpin your applications and services.

  • You can also repeat that datacenter to deploy across multiple cloud regions by simply changing the region or zone names.  

  • Additionally, infrastructure definitions can be shared and reused by other teams to give them a jump start on their projects.

  • Because they are plain text files, they can and should be stored in your code repository, which means your IaC definitions can be versioned. You can roll back entire datacenters to a Last Known Good state if something gets out of whack.

  • Your infrastructure environment becomes more testable. You can (and should) write a battery of tests for checking the health and compliance of your infrastructure.

  • You can define Governance as Code, checking that resources are properly provisioned, tagged, and compliant with guidance.

For these reasons, IaC is an important element of deconstructive software system design. Anything in your business applications sphere that can be code should be code, so that it can be presented with an API and invoked through automated processes.

Following are some of the popular tools for implementing IaC:

  • Provision local and remote systems with a tool like Vagrant. Vagrant is a free and open source tool created by HashiCorp that allows you to create a portable complete environment inside a single file called a “box.” You can then share this file, which defines your complete environment across teams so that everyone has the same repeatable, working OS with all the same versions of all the same tools. This goes a long way toward combating the “It Works on My Machine” syndrome. You define your Vagrant virtual machine boxes using Ruby. You can also search for existing boxes to give you a jump start.

  • A popular Platform as a Service (PaaS) tool is Heroku, which assists you by provisioning and orchestrating containers (which it calls “dynos”), managing and monitoring their life cycle, and providing proper network configuration, HTTP routing, log aggregation, and more. Because it’s a full PaaS tool, the platform regularly performs audits and maintains PCI, HIPAA, ISO, and SOC compliance, taking a variety of necessary but often cumbersome tasks off your plate. With Heroku, you can add extensions for Kafka, Redis, Postgres, and more. Heroku supports Ruby, Java, Node.js, Scala, Clojure, Python, PHP, and Go.

  • Define, manage, and test automated systems with Chef or Puppet. These tools allow you to perform configuration management. Puppet requires that you declare dependencies between resources, which Puppet then satisfies. Chef, on the other hand, satisfies all resources in the order in which they appear in the file.

  • Automation of creation production infrastructures can be done with Jenkins, Ansible, and Terraform. These help you to deploy on environments including Amazon Web Services (AWS), Google Cloud Platform (GCP), OpenStack, and Digital Ocean. Terraform, also by HashiCorp, lets you define and provision datacenter infrastructure using a high-level, proprietary configuration language called HashiCorp Configuration Language (HCL); you can also use JSON. With Terraform, you can configure your corporate GitHub account, dynamically create servers across multiple IaaS providers, register their names at another DNS provider, enable their monitoring from a third-party monitoring company, and specify to send the application logs to an aggregrator service.

Depending on your environment and needs, any of these in combination can be helpful to you. You can use these in conjunction with Docker and Kubernetes to create a more portable infrastructure foundation.

If you are using the AWS cloud, you would likely use AWS CloudFormation as the templating system, and something like Ansible or Jenkins to help you execute the scripts. AWS CloudFormation is essentially YAML. You can use it to describe the Amazon EC2 servers, the autoscaling groups, the security groups, databases, network routing and DNS, edge services, and basically everything you can create in AWS.

The primary mental shift that you will need to negotiate with your enterprise operations teams is this: historically, the operations and infrastructure folks want nothing to change. Change of any kind is often viewed as nothing but an opportunity for failure and uncertainty that keeps people up and night away from their families and rest. IaC asks you to embrace change, and provides a set of practices and accompanying tools that support this mental shift. Changes in an IaC world are viewed as an opportunity for improvement, rather than an obstacle or hardship.

The second challenge you’ll see organizationally is that people sometimes do not want to give up what they know or are reluctant to learn new ways of doing things. They can feel threatened or think that their jobs will go away or change and they’ll lose power and control. Don’t underestimate the force of this kind of resistance, and include the enterprise operations teams who might be running traditional datacenters, or (worse) bring a traditional datacenter operations mindset to the cloud.

Metrics First

In our rush to make deadlines, and in the absence of any demand to produce metrics numbers for a product that hasn’t been launched yet, we often begin designing and coding without consideration of metrics.

If you don’t create a few key metrics up front, you’ll not only miss out on showing how successful you’ve been, you’ll also have nothing to start reporting from when deadlines grow near and when budgets are almost used up and management begins asking questions.

Define the metrics for success of your overall project up front. Then, before actually recording any values, check with executives to see that these metrics, if you did the work to track against them and give them real values, would in fact tell them the story that they need to hear to determine whether you’re being successful in the ways that matter to them. This is a crucial point of difference for us as deconstructive designers. It’s like Test-Driven Development (TDD) in which you create the test from a client point of view, it fails because there’s no code to fulfill it, and then you fill in the code to make the test pass. You want to do this on an organizational/project level, and defining the metrics up front is like defining your own set of tests for the project.

If you define them at the end, you will be doing a “Texas Two Step”: right at the point when you’re all exhausted from the big push of delivering your project, you’ll have a second small project on your hands to figure out what the right metrics are, hope that you have things in place to procure them, add those tools and processes in when you inevitably don’t have them all, and then go through weeks, if not months, of remediating your product for performance or security, right at the worst time.

With respect to the broad infrastructure, you should consider the following success metrics:

  • Are there health checks on every service? To get a jump start on adding health checks to your services, you can check out the Netflix runtime health check library.

  • Do you have a battery of automated tests for the infrastructure itself to show that all the correct services are present and properly networked and connected?

  • Do you have regularly running Veracode scans to produce an Open Web Application Security Project (OWASP) secure coding practices report? This is especially useful throughout the project so that you are keeping the security tidy and manageable throughout. You don’t want to get to the end and discover that you have a long list of security bugs to work through before going live.

  • Do you have a mechanism in place to measure mean time between failures (MTBF) through your monitoring tool?

  • Do you have a mechanism in place for recording mean time to recovery (MTTR)? This is the more important metric going forward, but often not really measurable until you have an incident in production. You should, however, decide up front and agree on how you will measure this. Usually the Ops team will have a virtual room or Pager Duty type tool and process defined for capturing the duration of incidents.

At the application level, you want to set up certain metrics that will tell you how well your application is performing. Although these are not strictly infrastructure-related, their collection and measurement will probably need to be defined in collaboration with your Ops team. Here are some of the key infrastructure-oriented performance metrics to define, collect, and reflect:

Latency per service

This gives you a concrete measure of how long it takes to perform a task, whether that time is consumed in travel, processing, or response time. Focusing sharply on the latency in your mission-critical services will be an important key to success. Being able to consistently measure, say, your shopping response times, will help you find bottlenecks in performance and fine-tune your infrastructure or your code to improve them. It will also help with forecasting financial needs and scalability ceilings. Don’t forget offline batch jobs: create service-level agreements (SLAs) around them and measure how frequently they complete on time.

Traffic

This is the measure of load and demand on your system so that you are clear on how much work each component is doing. Collecting traffic data will indicate whether you need to provision more supporting infrastructure, or if you can redesign a component to do more work in parallel, or whether asynchronous processing can be employed. As you measure your traffic, you should look to view it in patterns and trends. If they swing significantly, this might indicate where you can add or fine-tune autoscaling groups to scale up and down accordingly.

Availability

This is important and notoriously difficult to consistently measure. People seem to argue about it all the time. So it’s good to be clear on what you mean when you say “available.” For this reason, it’s common to see advice suggesting that you measure primary functions during business hours, all functions during business hours, and both of these in a 24x7 measure. You can consider the nature of your application or product, and the impact that availability failures can have at different times. If you have a financial reporting application, that can go offline for hours on the weekend with little or no user impact. Does your measure account for planned or only unplanned downtime? Define it in whatever way makes sense for your business and your product, but make sure you’re consistent.

Incidents

Number of production incidents measured by severity (priority one, priority two, and so forth). I don’t see a lot of value in defining more than a few priority levels because they tend to just incite arguments and defensiveness and cause people to lose sight of customer focus.

There are other metrics that your organization might prefer. The point here is to be sure to define measurable metrics, start figuring out early on how to track them and report on them, and be sure that they are metrics that drive the behavior that you want to see.

Compliance Map

Depending on the size of your organization and the role of your department, you might also consider having a compliance map. This would essentially be a list of applications in your purview and how well they comply with your next generation toolset. Create a spreadsheet with the list of applications and several columns to capture specific aspects of the application current state versus your target or future state toolset. Next, you can assign a score with a color code of red/yellow/green to indicate how far away each application is. Then, you can assign a business priority to each application. This would generate for you a score in a 2x2 type quadrant: applications with high strategic business value that are far out of compliance might be prioritized first.

You can then use this as a data view to discuss with your executives and product management to create a prioritized roadmap for application remediation.

Automated Pipelines Also First

Often, we go through projects and we add automation close to the end, when we’re almost done. We wait until we have a significant part of the work done and need to turn our attention to deploying to certification, staging, or production environments. This creates a second, hidden project.

Instead, we want to start with automation, even when you have nothing. We create the simplest “Hello World” project, and then immediately begin automating the build, testing suite, and deployment across the IaC. That’s how you get the most bang for the buck because you can use your own automation throughout the development of the project. This adds efficiency and predictability overall. Moreover, when you do it in this order, you are less likely to start with application-specific needs (because the application is just a kind of empty shell at this point), and your automation pipelines can enjoy more reuse across the organization.

The Production Multiverse and Feature Toggling

We in software tell ourselves many comforting stories. One is that we have a reliable staging environment that is very much like production and that if we test our code here, we should be good in production. The problem with this story is that it’s almost never true.

You must decouple unit tests from integration tests as well as performance tests and penetration tests. Think of these as separate matters that can be kicked off, or not, as your current situation demands. Penetration tests by definition occur in production. But the rest occur before you get to the production environment.

Try this thought experiment: imagine that you had no staging environment at all, and imagine then what you would need to do differently to perform responsible deployments. It is impossible to test completely.

One problem with our typical way of thinking is that we have an idea of a perfect piece of software in a perfect environment (whether that is staging or production) and these ideas are all monolithic. Even if your application is decomposed into microservices, the idea here is monolithic, unified, perfect, complete.

If you abandon the idea that there is a perfect application, a perfect environment, you can start to create compensatory actions as a native and integral part of your design. And these compensatory actions will not only make up for the fact that you are not relying so much on the false foundation of staging, but will create new benefits.

If we see our application as rhizomatic (as being made of decentralized root systems), that is a more honest and accurate view of the world that will benefit our software. Although that sounds abstract, consider this: your source code management system exists as a series of roots. They can be merged back to the trunk, and different people can be working simultaneously on different areas of the code. In a large development project, there is no single, unified field of the code base. The code base is a set of multiplicities. One key reason our software is less resilient and higher quality than it could be is not, I wager, that we didn’t spend a million dollars on a staging environment that looks “exactly” like production. That is a fantasy that we must abandon. When your uptime availability is measured in ten-thousandths of a percent, a “pretty close to production” environment is not even in the ballpark. No, I think the reason is instead that we are happy with the idea of a multiplicity of code bases in development, and force ourselves into an inaccurate translation that there must now be a single, unified, monolithic idea of “The Production Code” right around staging time.

The idea that staging will save us does not serve us well enough. We cannot replicate the complete production environment precisely. You won’t have all of the exact same licenses, which can be prohibitively expensive. We certainly don’t have the same network setup, firewall rules, and routing tables. Is everything authorized to third-party APIs the same way, with the same throttling and service level? No. Are all the file paths identical and security groups identical and URLs identical? The data is not the same, the keys are not the same. Clinging to this idea hurts us.

If instead, we carried that multiplicity of development branches forward into production, what would that mean? What benefit would it give us? What would we need to do? What if we gave up on the idea of staging, and moved that matter into production? We would need to build those paths, that extensibility, that configurability, into our code base and subvert the idea of production in order to make it more resilient.

In what I hope by now is a more intuitive first thought in our deconstructive design, we look for the binary opposition, see which term is privileged, and overturn the hierarchy in order to determine how they are interrelated and interdependent and how they can inform each other to create a new space for an improvement. In this case, we would not privilege production over nonproduction by treating it as pristine, wholly distinct. Yes, of course, we must have it properly secured. Nothing here is saying to play fast and loose with what surely must be a hardened, resilient, secure production environment or to encourage sloppiness or entropy.

The point rather is to suggest that the code base itself, as deployed in production, might have many credible paths through it that can be turned on and off for different users, different countries, different percentages. I have heard it said that there are hundreds of “versions” of Expedia.com all running simultaneously in production. Consider production less as a single monolith and more as a choose your own adventure book, or a set of train tracks at a major rail station: the tracks can be switched to route trains (user requests) through to various points.

A good way of achieving this is through feature toggles or feature flags.

Implementing Feature Toggles

There are two primary use cases for feature toggles. One is that you have a new version of an algorithm that you want to try out on a subset of users. You might not be sure how it will perform or whether it will convert shoppers at a higher rate. So you want to introduce it slowly to a subset of your site visitors rather than rolling it out in speculation en masse and hoping it works; if it doesn’t, you are faced with rolling it all back and figuring out what to do. Feature toggles deconstruct this binary opposition of “all or nothing” and “totally on or totally off.” They allow you to see the world on a gradient and implement new features or algorithms accordingly.

The second primary use case for feature toggles is rather similar: you want to have two versions running at once in an A/B or multivariate testing scenario and gather data to learn which performs or converts better. This is common in ecommerce, where we’re likely to have a few different merchandising messages, colors, photo placements, and so forth. You might have different button labels with variations on the same message, such as “Buy Now” or “Add to Cart” or “Book it!” and you want to show these to different users of the same kind in order to measure which is more successful. If the “Buy Now” button shows a conversation rate 10% higher, you might want to settle on that wording and eventually let the other label candidates go at the conclusion of your test.

Let’s consider how we might implement feature toggles. In the most rudimentary way, you could comment out the old lines of code in favor of running the new code and then redeploy and switch back if it didn’t work out. This is not, however, what we’re talking about. Beyond the fact that commenting out code is a terrible practice, it does not achieve our aim of separating the idea of deployment from the idea of what is “released.”

A slightly more advanced way to do it would be to make the flags dynamic so that you can have both options of the functions/algorithms/whatever you’re toggling available, and then flip a Boolean in a configuration or runtime parameter to state which to run:

if (flagEnabled) { return exciting new thing }
else { return standard thing }

You can get more fancy with this, such that you have a function to determine which path this runtime request is in. You can even build a UI to make it easy to see all the flags you have and turn them on or off. The inadequacy here is that a Boolean just means on or off; you must pick between one of two states. But worse, your code will become littered with conditional logic all over the place, and the state machine you’re creating will become far more complex to picture with the more flags you put in place. In this situation, the chances of having at least some users wind up in a bad state becomes much higher.

A more sophisticated way of doing this is to use a Strategy pattern. This is my preferred method. If the development teams know that when designing every microservice they must ensure that the service contains no actual business logic, but rather that all business logic is “injected” via Strategy pattern, you will be able to keep your code very clean, intuitive, readable, and manageable while still providing the ability to feature toggle. You can have one strategy with the exciting new algorithm and one strategy remaining for the old one. Then, you can create a toggle router component that sets the Toggle Context. This has a plain-text configuration to associate various strategy implementations with runtime attributes. For example, you might want to send 5% of requests as selected by the load balancer to the strategy A path, and the rest to strategy B. Or you might select a path based on country of request origin, geolocation, logged in users, loyalty members, random cookie settings, an HTTP header setting, or whatever suits your needs. Using the Strategy pattern should be standard in your microservice design, and for feature toggling, it allows you to avoid any conditionals littering your code.

Strategy Pattern

We’ve discussed the venerable Gang of Four Strategy pattern earlier, but it’s always a great time to be reminded of this simple, powerful design technique. See the explanation, diagram, and examples at DoFactory. The examples are in C#, and I like to refer people to it because the explanation is very clear and it’s easy to translate.

You can find a good article on thinking about and designing feature toggles at Martin Fowler’s website. It is very long, so you are making a commitment, but if this idea is important to you, it’s a good read.

Finally, if you really love this idea of feature toggling and find yourself wanting to go all-out with it, you might also be interested in feature flags as a service, which you can check out at Launch Darkly.

Putting the idea of feature toggling first, and assuming that you will have multiple  production environments running at once is an excellent way to learn what your users truly prefer, how they use your application, and what works best for your business.

Multi-Armed Bandits: Machine Learning and Infinite Toggles

An outstanding extension of this idea is the Netflix user interface. Instead of deciding which toggle path to chose, the more modern and advanced way of doing feature toggling is to do so much toggling on so many aspects that you end up with many thousands of simultaneous versions of your application, such that the entire idea of toggling sort of goes away and is sublated into the realm of machine learning. This level of personalization represents a key facet of deconstructed design.

They use machine learning to select not only the selection of movies to recommend to you, but even the image thumbnails for movies, based on your preferences. I highly recommend reading up on how the company does this on its engineering blog. Using a multi-armed bandit machine learning algorithm, Netflix selects the best image for you personally, based on items you have previously enjoyed. For example, if you have watched and liked several Matt Damon movies, the image Netflix selects for you when recommending Good Will Hunting would include a picture of him. If you’ve never watched another Matt Damon movie, but enjoy lots of comedies, it might instead select an image from that movie featuring Robin Williams.

The name “multi-armed bandit” (MAB) is derived from the image of slot machines, colloquially called “one-armed bandits” because slot machines “steal” your money. You pull the arm to place a bet. If you find a machine that you think is “hot,” or that is paying off well, you might tend to stick with it and continue pulling the arm of the same machine. However, in the row of slot machines before you, others might pay off better. You’ll never get the optimal payout unless you fashion a combination of continuing with machines that you know work and occasionally trying other machines that might work better. These two axes on which a MAB operates are known as “exploit” and “explore”: you continue to execute what you know works (exploit) in an optimal balance with exploring other options that could work better. The machine learning algorithm converges when, after many executions, it learns this optimal balance. This is how a basic recommender engine works, suggesting that the people who bought the sleeping bag also bought the flashlight, and then occasionally recommending something that might have less chance of hitting but which would represent a higher revenue point and profit margin, like recommending a tent, too. What your MAB should be optimizing here is not the number of conversions but the total revenue or total profit.

Your data scientists should be able to pull together a good multi-armed bandit in short order. If you don’t have a strong data science team or want to test it yourself quickly, Jason Liu has put his multi-armed bandit library for Java on GitHub. That’s an easy way to get started.

As you can see, it’s difficult at this point to say that there is one “Netflix website.” It hardly makes sense to refer to "The Netflix website” as if there is one and it’s always all the same. The same is obviously true for Google, as well, in which you see personalized results based on patterns, but even results that no one but you sees.

In your design work, ask yourself how you can unravel the idea of the single monolithic unified application in ways that make sense for your users and workload. What would make things quicker and easier for them? Have you designed a single monolithic workflow as the One Grand Narrative to rule them all? Or have you considered that you have both novices and power users, and thought about how you can distinguish between the two and in real time modify the workflow steps or the additional controls you reveal to them? This is a silent, seamless, wonderful way to make the easy things easy and the hard things possible.

How can you introduce paths for the production multiverse?

Infrastructure Design and Documentation Checklist

In your lookbook or Design Document, you will want to be clear and directive with teams regarding the infrastructure decisions you have made. The following should all be things that you outline and take a clear, declarative stance on in your architecture:

  • Statement of what infrastructure provider you are using. Will this be on-premises, in cloud (if so, which one), or a hybrid?

  • Operating system. This should include whether you want to use a cloud vendor’s version of software. Often it has the advantage of getting regularly patched and updated as a service, alleviating that responsibility from your teams.

  • If you are in a public cloud, you must explicitly state which region you will deploy to. Base your decision on where your customers are, latency between that cloud region and any home runs the systems there will need to make back to your datacenters, and the tools available in each. Not all regions have the same capabilities even within the same cloud vendor, so be sure to check.

  • How many datacenters (“Availability Zones” in AWS) will you deploy to within that region? 

  • Will you be using an edge cache? Through which vendor?

  • Are there particular infrastructure requirements for your application’s design? For example, you might choose to forego web servers altogether, and instead deploy your static assets such as JavaScript, CSS, and images to a storage service such as Amazon S3 and have them served from your edge cache. 

  • Define how you will handle Security Groups and Access Control Lists (ACLs). Which services will be in which security groups and how will you balance the complexity challenge of maintainability when they each are accessible only via their own load balancer? What connections will you require between datacenters? Will you use a Direct Connect? Do you require use of a bastion or jump server to control access to environments?

  • Define how you will handle key management.

  • How do you anticipate scaling? What will be next one or two regions you expect to deploy to? 

  • How will you handle disaster recovery (DR)? Or you might choose not to have DR, but rather what I refer to as “built-in DR,” where you run active–active in three or more datacenters and merge your DR investment with your active runtime investment. This, of course, must be designed into the application.

  • To support your related infrastructure practices such as IaC, you should specify the design of your pipelines. Also specify some of the seemingly small matters that can end up making a big difference, such as disallowing anyone from using the cloud provider’s UI console to make changes. Instead, mandate that any and all infrastructure changes occur only through the IaC automated process.

  • To help control costs, you should specify how you will do resource tagging. If you’re an AWS user, be sure to read its tagging guide.

  • Of course, you must specify the typical matters such as load balancers, DNS names and relevant IP lists, firewalls, routing, reverse proxies, and the infrastructure communications setup including what type of servers, from what vendor, with what power and capacity, and how requests will route through them. What protocols will you allow and disallow?

  • What monitoring must be in place? What alerts and triggers do you have? 

  • How will you perform autoscaling? What are the thresholds defined for those rules?

  • Will you employ service or server virtualization? Gateways? How will you throttle traffic from the internet? Do you need to have tiers of API service (in which case, you must be able to identify traffic properly)?

  • List the environments you expect to have. Is this Production, Testing, and Development only? Or will you also have Integration, Demo, Staging, UAT, Certification, and Load Test, or do these overlap in some way? Be very clear on this because it becomes the specification for the IaC people to build, and it has significant implications for cost and manageability. Be sure you’re clear on who will use them, when, how, and for what purpose. Put this in a chart. It seems obvious, but it will require at least two meetings to sort out, and then another one later to tighten the belt when either it’s not being followed according to the specification or when finance comes to find you.

Make sure that you do the calculations and projections for costing as you make these choices. If you end up with an incredibly resilient architecture that costs a million dollars a month to run, you might be asked to revise your plan. Make sure you are working with finance and product closely as you make these decisions. They are, after all, business decisions.

Chaos

As deconstructionist designers, we recall that an important part of our work in the production of concepts is to identify the values, the arguments, the principles, and the apparent superstructure in order to discover their opposites: we find the binary oppositions that adhere in the set of concepts we are working with. When we find a pair of binary oppositions, we can identify which term in the pair is privileged, and which is marginalized, secondary, or ancillary. Through analysis, we will discover how the privileged term actually relies on that marginalized term, how they are intertwined. Such analysis allows us to subvert that privileging, which is desirable because it will help us discover a more innovative, better design. It will be better because it will be a more accurate and less myopic view of the world, so our concept will be cleaner, richer, and reflect a truer state of affairs. This will improve our design, which is nothing but a transcription in code of our concept.

A very common binary opposition in our world is that of development versus production. Development is hopefully not the Wild West, but we expect it to be a bit messy and not at all presentable to guests. We expect it to be very dynamic, and that we must break stuff in development almost by definition because we are in the process of making the thing in the first place.

Production, on the other hand, should be frozen, the opposite of dynamic, pristine and perfect, which must never be touched and must be tread silently by, stepping delicately and not even looking at it or speaking above a whisper. Production is clearly the privileged term in this binary opposition.

Chaos Engineering is a term coined by Netflix around the year 2010. This is a wonderful, innovative practice that makes perfect sense for us as deconstructive designers: it has engineers invert the sacrosanct idea that production should never go down. Instead of thinking of production as the place you hope never breaks, and which you do everything to prevent breaking, you break production on purpose in order to make your application more resilient. It’s beautiful. And if you actually do it, very effective.

The tool that Netflix made for this and eventually open sourced is called Chaos Monkey. You can think of Chaos Monkey as Failure as a Service. It does a few basic things to create problems for your application services. As you see how your application responds by creating these common problems and then observing how it responds, you can then design and plan changes to your application to improve its behavior under these adverse conditions. In this way, you’re creating a terrific feedback loop. Perhaps by way of analogy, it’s a little bit like vaccinations: you infect your application with a little bit of real world trouble in order to build up terrific defenses against it ruining things when it occurs in the wild.

It tends to work across a few lines:

Resource

Starve your service of resources it needs to operate properly. These can include CPU, memory, or disk. In the real world, common problems like these are caused by runaway threads, stalled processes, and log files filling up due to improper configuration and (on Linux) too many open file descriptors.

State

Change the state of your service’s underlying environment. This can mean shutting down the operating system of one of the servers in a cluster, rebooting a machine, or changing the network time. It might mean removing a dependency.

Network

Create simulated network stability problems. You can kill a specific process or flood the network.

Request

Randomly create problems for specific requests.

Its popularity and usefulness within Netflix caused the company to spawn an entire “Simian Army,” including Chaos Gorilla, which destroys an entire datacenter, and Chaos Kong, which destroys an entire region. Other monkeys are more “helpful,” such as Janitor Monkey, which scans for leftovers and unused resources and cleans up, and Conformity Monkey, which runs at regular intervals and checks that all your resources conform with predefined rules, such as being tagged properly, creating a simple form of Governance as Code.

A great place to begin is to read the Principles of Chaos. Then, you can download Chaos Monkey to run locally and read the documentation for how to install and use it there.

You can also try Chaos as a Service using Gremlin instead of trying to set it up and run it yourself.

Stakeholder Diversity and Inside/Out

We often talk about the customer as the person outside our company who is buying our product. In this binary opposition of inside/outside in which the external stakeholder is privileged, we might actually be doing them a disservice in not considering the many multiple customers we have.

If you do a thought experiment and imagine that your internal colleagues are your customers, too, that everyone’s back end is someone else’s front end, that the development environment is the production environment to a developer, you might change a few practices that can help your external paying customer.

Who really are all the users of the system? At different stages, there are many of them.

Developers are the first user of the system. There are a few things you can do to set the table for them that will pay off richly in happier developers who aren’t dealing with the same low-level and uninteresting headaches every day:

  • Invest in automation. This includes deployment, testing, and provisioning. There is a step that needs to be inserted in the software development life cycle before coding starts to make sure that the table is set for them. In a sense, you’re building production first, but with the developer as the customer.

  • Do everything you can as an architect and a leader of influence to take process bureaucracies away from them. The more time they spend filling out tickets just to get access to the environment that they work in every day, the more grumpy and distracted from the important work they will be.

  • The developer is a user of the system after it’s deployed into the hands of paying customers. Ensuring that they have commented thoughtfully and made the code readable and properly named and properly segmented will help to make their work more efficient when they are going back to fix bugs and make maintenance updates.

The Network Operations Center team and the Bunch of Poor People on the Phone at 3 A.M. on a Crit Sit Call are also users of your system. Make sure to put the following in place to take care of this customer:

  • There must be proper monitoring for them to gain transparency and clarity.

  • Log messages properly and actually design the logging subsystem and naming conventions thoughtfully. Consider how messages will be written in order to be quickly looked up, indexed by a system like Splunk, and consider how they will be aggregated.

  • Can you build components as managed components (think of something like Managed Beans in Java as part of Java Management Extensions or JMX)? Wrap or decorate your services as managed components for vendor-agnostic viewing, monitoring, and even updating at runtime. The Apache Cassandra database does this, and it essentially turns the software system inside out, making all the runtime components available in this way. It’s a fantastic feature of the system and allows vendors to build monitoring and manipulation control panels on top of it very easily and plug in existing ones.

Make Managed Components

Even if you’re not using Java, that doesn’t matter. The point is to use this idea of the managed component in whatever language. See how Apache Cassandra does it in the source code on GitHub with the CommitLog and CommitLogMBean. You don’t need to worry about how Cassandra works or what a commit log is; this is just an accessible example.

Testers and auditors are also users of the system. Consider their needs along these lines. The obvious point is that if you think only of the “user” who is sitting in front of the UI clicking, your long-term product will suffer. Everything you do in the design to support this more diverse customer set will pay off.

Summary

In this chapter, we reviewed a variety of modern practices and methodologies that you can employ to make your infrastructure more scalable, resilient, predictable, and manageable.

There can be entire books written on the subject of infrastructure architecture that go into more detail about specific fine points of the areas we have touched on here. We have focused primarily on considerations for infrastructure with respect to building software products or applications, because there is no point in infrastructure in its own right; it exists solely for the purpose of providing a platform for some kind of application.

In Chapter 10, we turn our attention to broader development methods, operations, and change-management processes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset