Chapter 2. High-Level Code and Data Design

In this chapter we review the primary structures of a Puppet codebase, and discuss how the structure of code and data will affect the cost of module maintenance, code reuse, debugging, and scaling.

Tip

A codebase is the complete body of source code for a software program or application. For our purposes, it is the total body of code used to build Puppet catalogs.

This chapter covers best-practice use of core features of Puppet, including the following:

  • Node facts

  • Hiera data

  • Puppet modules

We introduce broad categories for code and data used with Puppet, creating a common vocabulary for use in later chapters. This shared basis makes it easier for us to discuss appropriate ways of handling the concerns of each topic.

As with Chapter 1, this chapter focuses primarily on “why” rather than “how.”

Code and Data Organization

The organization of your Puppet codebase is critical for all of the following reasons:

  • Organization-based structures for data allow multiple teams to share responsibility for Puppet data.

  • Separation of code and data promote code reuse as you bring up new applications with similar needs.

  • Puppet modules provide independently testable implementations that are simpler to debug and improve.

  • Independent and dynamic environments improve testing opportunities, and minimize the disruption caused by local environmental changes.

  • Versioning and release management allow highly customized controls for stability and speed of change in each environment.

All of these work together to reduce the impact of changes to business logic and operational requirements driven by ever-changing business requirements.

Code and Data Categories

When discussing Puppet implementations, it helps to identify the types of code and data seen in the infrastructure. Naming these categories enables discussions about the organization of your code and how it maps to the various features and design patterns of Puppet. We have found the following categories useful for that discussion:

  • Application logic

  • Business logic

  • Site-specific data

  • Service-specific data

  • Node-specific data

Why is it important to categorize information in this manner? These categories loosely map to a number of Puppet features and design patterns. The following outlines a common data source pattern used by many organizations:

  • The logic to configure and deploy a single application fits within a Puppet module (code).

  • You can find business logic that informs app configuration in the following:

    • Static information stored in Hiera

    • Role and profile definitions declared in Puppet code

  • Site-specific data is often static, and usually stored in Hiera

  • Service-specific data is usually:

    • Static information stored in Hiera.

    • Stateful information can be pulled from PuppetDB or another service discovery system

  • Node-specific data is usually:

    • Facts submitted by the node

    • Static information stored in Hiera

    • Infrastructure data provided by a node classifier

As you can see from this structure, each type of data has different sources and can have diverse needs for accuracy and freshness. Although these categories have different sources and needs, a good structure for the code and data will allow a single Puppet module to provide the complete logic to manage a single application, service, or subsystem, regardless of the diversity of data sources we just outlined.

Types of Code Logic

Taking time to evaluate which type of logic is being implemented might seem strange, but this provides essential knowledge useful to untangle the intertwined layers of an implementation. Take the time now, or you’ll spend the time plus significant interest later.

Application logic

Application logic is the logic to manage a single application or component and optionally its dependencies. A MySQL database, the Postfix mail processor, and the Chocolatey windows package packager are applications that a Puppet module can configure by using application logic.

Application logic can contain logic and data to handle platform-specific implementation details. For example, an Apache module should contain platform-specific logic to ensure the following:

  • The correct package is installed (e.g., httpd on Red Hat and apache2 on Debian)

  • The configuration file is installed to the platform-appropriate location.

  • The configuration contains platform-specific configurations.

  • The correct service name is used when managing the Apache service.

The Apache module contains a lot of data about Apache: package names, configuration details, file locations, default port numbers, docroot, and file ownership information.

In some cases, the application might include some default data that can be overridden with site- or service-specific data. For example, the NTP module includes a set of publicly available Network Time Protocol (NTP) servers for use if specific servers are not configured.

Site-specific data

Tip

We use the term site to refer to a distinct grouping that has some shared configuration. In most situations, this maps cleanly to a physical or logical location such as an office or datacenter. In cloud environments, it would depend on how shared resources are managed. This could be unique on the VPC (Amazon Web Services) or Virtual Network (Microsoft Azure) level, or could be shared at the Account or Subscription layer.

Site-specific data is the data that is unique to your site, and isn’t fundamental to any of your applications. Following are some examples of site-specific data:

  • The package repositories used by nodes for installation and updates

  • The authoritative time sources used by nodes for time synchronization

  • The passwords used by applications to authenticate to database servers

Node data

Node-specific data is a super-localized instance of site-specific data. Node-specific data identifies properties specific to a single node, such as the following:

  • Location

  • Node name

  • IP addresses

  • Tier or group

Although this data can be maintained in static form with your site-specific data (such as Hiera), it’s more commonly sourced from external data sources used to manage node inventory. In these days of cloud computing, nodes are constantly brought up, shut down, modified, and reassigned. Nodes are often maintained by a different team (or company!) than the one that manages the services.

Tip

In Puppet terms, a node is a unique instance configured by a Puppet agent. It’s not uncommon for a virtualization engine to be a node managed by Puppet and for each virtualized node provided by that engine to also run Puppet.

In the early days of Puppet, nodes were managed by using node statements within a site-wide Puppet manifest. Node classification was a mix of code and data that attempted to group configurations using node inheritance. This mixture of code and data led to inconsistent, often surprising results.

These days, node definition is entirely data, often retrieved by lookup from an external node management service (e.g., Foreman or the cloud provider’s API).

Service data

Service data provides configuration details for a specific service as part of a specific technology stack. This kind of data is prevalent with horizontally scaled, multitier applications.

For example, the classic three-tier technology stack consisting of frontend web servers, mid-tier application servers, and backend databases would typically have two pools of service data that would need to be maintained to configure every host in the stack:

  • Service data for the Tomcat servers would be consumed by the frontend web servers to load balance requests to the application servers.

  • Service data for the database instances would be consumed by the application servers to balance read requests across a pool or identify the write master.

Service data can be defined as static site-specific data; however, this manual approach does not facilitate automatic scaling or failover. Service discovery from PuppetDB or other APIs enables dynamic Puppet configuration that adjusts to instance availability and automatic node provisioning by autoscaling groups.

We discuss service discovery data access in “Service Discovery Backends”.

Business logic

Business logic provides configuration data at a higher level than applications or instances. Logic at this level isn’t concerned with package names or the platform-specific implementation details of services. For example, it would define a web server instance without specifying whether to use Apache or NGINX. The implementation details of each component are abstracted away as application code.

Business data commonly contains organization-wide requirements, such as compliance requirements for password expirations or log retention.

Examples of Logic Types

The following presents an example breakdown of the data types in action:

  • The application logic for Apache contains resources to install the Apache httpd package, manage the httpd.conf configuration file, and ensure that the httpd service is running.

  • The service logic for a web service hosted by Apache might define an Apache virtual host to provide service for the application.

  • The site-specific logic might identify shared resources for the application, such as backend databases.

  • The load balancer might utilize node data to configure nodes in the service pool.

  • The business logic might describe instance names, user policies, and operational constraints.

Because a server can host multiple applications, services, and their dependencies, there can be a bit of overlap in possible places for data to be relevant. Don’t stress too much about nailing each one down, but do spend some time thinking about the appropriate category for each type here. You’ll gain tremendous insight into how data is used in your organization.

Mapping Data Types to Puppet Use Cases

Let’s take a look at how these concerns map to the features and common design patterns of Puppet.

Application Logic and Puppet Modules

For high-level design purposes, when we discuss module design, we are almost always looking at modules from the perspective of service or application management using the Puppet DSL. The Puppet documentation describes modules as “self-contained bundles of code and data.” Modules serve many purposes in Puppet, and it’s difficult to provide a more specific description while remaining concise and accurate. With this scope in mind, well-designed Puppet modules configure applications or services utilizing business and site-specific data.

Modules are most effective when they serve a single purpose, limit dependencies, and concern themselves only with state related to their named purpose. A well-designed module will usually manage a single service and accept enough input parameters to be used in multiple ways. Modules need not contain nor declare their dependencies: doing so invariably embeds business logic in the module. The prerequisites and dependencies of the service can be managed by other modules, and multiple modules can be used together to create technology stacks.

Example component module

The NTP module is fairly simple as far as Puppet modules are concerned. It contains few resources, and a lot of data. However, all of this data is specific to the NTP module. You can easily use this module by applying it with default parameters or override the default parameters with your own site-specific data.

Although this module is concerned with the NTP service, the list of authoritative NTP sources tends to remain static. Service configuration data can be provided as site-specific data stored in Hiera.

From a business logic perspective, this module would be part of your baseline system profile, and would most likely be applied to every node in your infrastructure.

Identifying business logic in Puppet modules

There’s no absolute guide for identifying the business logic antipattern, but it’s not terribly difficult to spot. As a general rule, the code implements business logic rather than application logic when the following conditions are met:

  • The module conflicts with another module that does not overlap in functional intent.

  • The module includes a dependency outside the explicit scope or concern of this module.

  • The module implements a subsystem that could be a standalone application or service.

  • The module implements a feature that has been implemented by a module focused on that feature.

When a module’s code meets one or more of the aforementioned criteria, consider whether you could split the module into smaller, more focused modules that each implement one feature.

Identifying site-specific data in Puppet modules

Modules tend to contain a lot of data: package and filenames, network ports, and application-specific default values. The ideal module contains no data specific to your organization, service, or site, but does contain the minimum necessary data to bring this application or service up in a generic way with appropriate values for supported platforms.

An example of application-specific data is the name of a package that should be deployed and the format of the configuration file. This is specific to the application and necessary to configure the application.

An example of site-specific data is a URL to download a file from an internal server. This is clearly specific to a given site’s implementation.

There are several important reasons to keep site-specific data out of your module, even if your module is completely proprietary and would never be released to the public:

Embedding data in the module creates module interdependencies
When a module contains data about your site, it’s tempting to de-duplicate the data by referencing it from other modules. Using the module as a data source creates explicit interdependencies and violates the principles of interface-driven design. Refactoring this module will be impossible without breaking other modules.
Data changes—constantly
This is by far the greatest issue created by embedding site-specific data in modules. If the site-specific data is spread across multiple modules, simple changes to your site’s configuration can demand a massive effort to refactor all interdependent modules.
Data stored inside a module isn’t easily accessible
The site-specific data in a module will be placed for the convenience of the module author. This will rarely be formatted in a consistent way and might be spread across a large set of files. Making use of the data can require manipulation within another module to get the necessary values, causing the two modules to be tightly interwoven and thus fragile.
Modules are rarely stored in the same source repository as data
Data stored in modules might not have the appropriate permissions for the resource manager to keep it up to date. Worse yet, the data’s existence within the module might not be documented, forcing a person to manually track down data that is used in multiple locations.

All of these problems can be avoided by keeping site-specific and business data out of Puppet component modules. By adhering to this pattern, you encourage more flexible module design, module reuse within diverse profiles, and centralize site-specific data within Hiera.

Business Logic Should Not Be Written into Component Modules

A major feature of Puppet is module reuse. Component modules provide a way to model applications and services in a portable and reusable way. Technology stacks can be created by combining component modules together in interesting ways using the roles and profiles pattern.

For example, an instance of WordPress would need a technology stack containing the following components:

  • A database server (MySQL)

  • A web server (such as Apache)

  • PHP

  • WordPress

Even though a monolithic Puppet module could configure WordPress and all of its dependencies, such a module would be quite complex and inflexible. It would likely have conflicts with other modules used on the same node.

Less is more

A module that concerns itself only with the deployment of WordPress and relies on other modules to provide the dependencies would be much more flexible. For example, the WordPress module could depend on the following:

Using this approach allows us to rely on the high-quality Puppet-supported modules for Apache and MySQL, along with a Puppet-approved module for PHP. Using the community-supported modules takes advantage of their shared experience, improvements, and bug fixes.

Small components are flexible building blocks

The modular approach facilitates code reuse. By writing a module that focuses on WordPress, the consumer of your module (different team, different site, different service within your own group) could easily swap in a different web server or database server.

Distinct components avoid conflicts

If two modules assigned to a node both depend on Apache, which of those modules should be the one to configure it? If one of the modules expects the other module to configure Apache, both modules are required. Refactoring one module risks breaking the other.

Using shared, flexible modules allows you to avoid this design problem. A node could apply multiple modules that depend on Apache without conflict.

Small components are easily testable

A self-contained module makes it simpler to test each system in isolation and helps reduce the amount of code that would need to be reviewed to identify and isolate bugs.

Business Logic with Roles and Profiles

If you should keep modules small and focused on specific applications, how should you configure a complete technology stack? You can do this cleanly and safely by abstracting the application stack into roles and profiles.

Roles describe business logic and site-specific configurations

A role contains responsibility for implementing a site-specific configuration. To this end, a role is often no more than a list of profiles to be applied to a node.

Profiles implement technology stacks

The profile simply declares the modules needed to build a given technology stack, their ordering dependencies, and any profile-specific parameters that should be passed to the component modules. Because a profile is site- or service-specific, profiles provide all site-specific details to the component modules about the stack requirements.

For example, we could have a WordPress profile include the WordPress, Apache, PHP, and MySQL component modules. It would configure Apache appropriately for the WordPress service, configure WordPress’ dependencies in PHP, and ensure that WordPress has a database available for its use in MySQL. Thus, the profile defines the technology stack.

We take a closer look at the roles and profiles design pattern in Chapter 7.

Roles and profiles versus node classifiers

Roles and profiles provide a necessary abstraction layer between your modules and your node classifier. Before roles and profiles became common, it was necessary to use external node classifiers (ENCs) to manage Puppet node assignments, which presented a wide variety of data management problems. Now the ENC is best suited to identify the role to be assigned to the node and pass along provisioning and node-specific details it is well suited to manage.

We explore roles and profiles and how they have replaced other node classification mechanisms in Chapter 7.

Business, Service, Site, Node, and Application Data

Hiera is the data store for node, application, service, site, and business-specific data. Hiera contains three layers for data, each of which can implement a unique hierarchy specific to that layer’s needs. With the addition of pluggable modules to provide external data lookup, Hiera has become the omnipresent, ubiquitous data source for Puppet.

Having all data lookup available through a consistent data access mechanism makes it easy to query data using a standard tool, puppet lookup, without knowledge or care about the origin of the data. This can be extremely valuable for debugging modules when data comes from diverse sources or multiple teams.

Global layer: business data

The global layer is best used for business data that should never vary on a site-specific basis. As it comes first in the data lookup, any value it contains will override values in a lower layer.

Depending on your business needs, you might want to put security controls and mandatory compliance data at this layer. If the site-specific configurations are diverse, it might not be useful to have any data in the global layer.

Environment layer: site, service, and node-specific data

A node’s catalog is built for a specific environment. The data hierarchy in the environment should provide a layered lookup to retrieve node, service, and site-specific data.

Following are some examples of site-specific data appropriate for the environment layer:

  • User account information or authentication services

  • Local data storage replicas

  • DNS resolvers and NTP synchronization sources

Site-specific data is local to a specific implementation and tends to change more often than application code or business logic. New sites come up, infrastructure changes, nodes are added or removed. Placing the data describing your service or site in the Puppet environment makes it easy to localize data that is relative to the context in which it is used.

Module layer: application defaults

As a general rule, application defaults do not belong in the global or environment layers of Hiera. Default values for basic usage of the module should be provided within the module data.

One of the most eagerly awaited features provided by Hiera v5 was data in modules. Instead of writing code to introspect the environment (as done in the older params.pp pattern) the Hiera data hierarchy within a module provides a consistent data lookup mechanism for returning application defaults. This allows application defaults to be based on node facts, like os.family or os.version.major.

Warning

A module may not contain data outside its own namespace. This prevents a module from declaring values for the environment or a different module.

Hiera Data Sources

Hiera provides a data lookup hierarchy with pluggable backend data providers. We examine usage of Hiera in depth in Chapter 6; however, we mention it here to discuss how Hiera can provide access to the different types of data.

Static data in Hiera

Hiera has built-in data backends to read text files in three different formats:

  • YAML

  • JSON

  • HOCON

In a small site that changes infrequently or with stable services, service data can be staticly declared in Hiera data files. Any change to the data can be made by pushing changes to those files. Even when service discovery data is available, it’s not uncommon for core services such as DNS servers or package repositories to be maintained in static data.

Service discovery from Hiera

Modern computing environments are rarely static. Autoscaling pools are increasingly common, service clusters are increasingly large, and dynamic cloud provisioning has become the new normal. In these cases, manually editing data files every time a node comes up becomes burdensome and unrealisticly slow. It also creates high levels of churn in your data, which can increase and magnify human error.

Service discovery is the act of discovering information about services from live (as opposed to preconfigured) data. Dynamic service discovery allows internode relationships to be retrieved on demand. For example, a load-balancer module can create service pools using a list of nodes that provide the application.

If you have a large, autoscaled environment, it might be necessary to acquire service discovery data from another source. We cover the use of Hiera backends that access exported resources from PuppetDB for service discovery in “Service Discovery Backends”.

Accessing third-party external data sources in Hiera

You can configure any level of the Hiera hierarchy to source data from an external third-party database, application, framework—really anything that can be queried. This makes reuse of existing data sources easy to manage.

Node Classification

Node data is a form of site-specific data usually handled separately from general site data. Site-specific data tends to be relatively static, whereas nodes are added to your data store every time a node is brought up or down. The node data contains unique information about each node, such as the IP address of the node, the physical location of the node, and the Puppet environment assigned to that node.

ENCs

An ENC utilizes node information provided by a data source to determine what roles (and thus profiles) should be assigned to a specific node. Such data sources are typically provisioning systems that maintain their own host databases and manage more properties of a node than Puppet requires.

Besides tracking node details, the ENC can provide a list of Puppet classes for the node. This is best used to assign the node’s role.

There are many options available for external node classification:

  • Puppet Enterprise includes a node management interface and classifier.

  • Mature infrastructure management solutions such as Foreman and Cobbler provide node classification for Puppet.

  • Node data can be retrieved from the Lightweight Directory Access Protocol (LDAP), a database, or a NoSQL implementation like MongoDB.

  • Node data can be queried from cloud or infrastructure vendor APIs.

Because ENCs are simple to write, you can use any data source that can provide information about the nodes.

Tip

Using an external data store as a classifier does not preclude you from using PuppetDB for reporting and analysis purposes.

Hiera as a node classifier

An alternative to using an external classifier is to assign roles and profiles to the node based on information the node knows about itself. The node supplies facts, which can be used in the Hiera hierarchy to include the role appropriate for node. This is usually done with custom facts used specifically to assist with node classification.

Appropriate uses of node data

The best use of node classification is to gather enough data to assign the appropriate role to a node and, optionally, to provide node-specific configuration details. Attempts to source site and service data from node provisioning systems invariably lead to painful duplication and inconsistency in the data.

Provisioning systems rarely match the flexibility of Hiera for assigning roles, defining application profiles, and general node configuration. For example, the node classifier in Cobbler cannot set class parameters and has limited support for defining groups of classes. The Puppet Enterprise node classifier and Foreman both use all features of the Classifier API, but the API was never intended to provide dependencies, class ordering, or relationships. These are business logic that should be expressed in the roles and profiles.

Summary

This chapter introduced categories for code and data used in a Puppet deployment and discussed how categories are related to components of Puppet. The roles and profiles design pattern was introduced for effective usage of code and data to deploy a complete application stack.

Here are this chapter’s takeaways:

  • Before writing code, take a moment to categorize the scope of the data using the list presented at the beginning of this chapter.

  • Keep modules small, focused, and modular so that you can reuse them.

  • Isolate application logic to the Puppet module.

  • Manage business logic and application dependencies using the roles and profiles pattern.

  • Use an ENC to retrieve node data from your provisioning system.

  • Use Hiera at the environment layer to localize site- and service-specific data.

  • Consider using service discovery to dynamically manage internode relationships and service data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset