Chapter 3. What Software Developers Want

To think about how a team can work together for a more cohesive view of data management, it’s necessary to step back and look at what’s important to the various players. Let’s start with the developer’s perspective.

As developers, we have a substantial impact on the way our users do business. All we have to do is keep up with a never-ending list of new requirements, bug fixes, and technical debt.1 How fast we can respond to these needs dictates how fast others can get on with their jobs.

To address the stream of changes, we need to be flexible. The software we work with needs to be flexible, too.

Freedom from Rigid Schemas

We’ve come to realize that requirements are constantly evolving. This is natural—as our understanding of business needs change, our software must adapt to the current demands. Attempting to define a schema up front that will handle all of our data and meet all of our needs is time consuming and frustrating—and changing requirements will affect what needs to be stored, updated, searched, and queried.

We understand that having a schema isn’t bad at all. There’s value in knowing what your data looks like. The challenge presented by relational databases isn’t that they require a schema—it’s that they require exactly one.

One of the hurdles to being nimble is the extremely painful process of schema changes. Identifying the needed change is just the first step, to be followed by a migration plan, scripting the change, adapting the business layer code to the new schema, and working with the DBA to find a window where the change can be made.

We need an approach to managing data that gives us the access and flexibility necessary to address the constantly changing demands of the business.

Schemas rigidly constrain data that will be persisted into a particular table (Figure 3-1). Every row of the table must match this schema, and if the schema changes, every row of existing data must be reformed. Some other tools can use schemas to validate data but still allow the ingestion of invalid data. Where ingestion of huge scale and complexity is required, this can save on data reconciliation costs.2

ddsd 0301
Figure 3-1. Rigid schema (RDBMS)

Dynamic Schemas

Without any schemas, almost any type of data can be persisted. If the data is Extensible Markup Language (XML), an explicit schema can exist on a per-document level. JavaScript Object Notation (JSON) documents can reference schemas as well using the JSON API or a competing standard.3 XML and JSON are both considered self-describing4 data. Text, comma-separated values (CSV), and other formats offer enough structure so that at least some record-based indexing can be done.

Selective Use of Schemas

Our applications deal with all types of collections, and schemas should be there to help us when and if we want them. When we are processing diverse types of data, we would like to be able to work without schemas, and when we want to maintain structure and rigor, we want schemas.

For example, some applications require only a Google-style free text search to find the data we require. At the other extreme, some applications require complex aggregates against rigidly fielded data types. We want tools that work with us as we accept, transform, and deliver data. By their nature, our applications make assumptions about the data they process. The old RDBMS model requires that homogeneous collections of records populate each table, and this assumption simply does not fit many of the situations we encounter. One can use a filesystem to persist heterogenous objects and one can use a directory to hold a collection. (See Figure 3-2.) Most of us have developed programs that use both an RDBMS and a filesystem. Such a system is using a polyglot persistence example (described in more detail momentarily). The point is simply that our applications require flexibility in how they persist data.

Multi-model databases provide a single system to store multiple types of data, including JSON, XML, text, CSV, triples (RDF), and binary. Accessing these data types from a single source using a common interface simplifies the application code, compared to using different systems to store the same information. This provides much greater flexibility to implement features that require data integration across these boundaries. These tools can therefore maintain homogeneous, constrained, and/or heterogeneous collections.

When such tools are used, applications can be written to be constrained “just enough.” For example, saving to a homogeneous collection that is constrained to a single schema might be the right choice for storing financial data early in an application’s lifecycle. A constrained collection might be a good fit as that application evolves to handle more varied inputs. It might be enough if all the documents had just a few fields in common. Setting up a cross-reference sub-system in that same application might be best supported by a heterogeneous collection.

Taking this example a bit further, the cross-reference sub-system could benefit from full-text searches across all collections and data types. Such needs are satisfied in a polyglot persistence system with a dedicated search tool. Multi-model databases incorporate features that could support all aspects of data persistence for this hypothetical application.

ddsd 0302
Figure 3-2. A multi-model DBMS should be capable of persisting homogenous, constrained, and heterogenous collections

Ability to Handle Different Data Types

The relational approach comes with a couple of assumptions: data comes in rows and columns, and the specific set of columns to be stored within a given table is the same for each row. Actual data, of course, comes in various shapes and sizes, and using tools that constrain our choices makes it more difficult for developers to meet the current and ongoing needs of our stakeholders. Persisted data across collections may include schema-full and schema-less data. Accessing them through a single interface simplifies code and makes the delivery of value to our stakeholders more efficient.

Many software projects begin with existing sets of data. These may be from APIs offering JSON or XML, binary documents produced with desktop software, even linked data describing connections among entities. Applications may generate new types of data. There is no one-size-fits-all approach to storing information like this; the best tools will depend on what the data looks like and what one needs to do with it. For each of these types of data, an architect needs to determine the best way to store it.

Architectural Simplicity

As projects expand over time, more data and more types of data are included. If we are using a multi-model database, then our system can handle incoming data in its raw form and process it using tools appropriate to those types. A multi-model data management system that supports text, JSON, XML, CSV, triples, binary, and geospatial models—and also supports indexing across all of those models—can provide a robust basis for a variety of applications. Without multi-model data support, polyglot persistence solutions can be assembled. But that assembly can be difficult to get right.5 As it turns out, performing join operations and building and invoking indexes for searching are best done by databases, not by application code.

Polyglot Persistence

Neal Ford coined the term polyglot programming to express the idea that complex applications combine different types of problems, so picking the right language for each job may be more productive than trying to fit all aspects into a single language.6 This same concept can be applied to databases; you can have an application that talks to different databases using each for what they are best at to achieve an end goal, thus giving birth to polyglot persistence. While polyglot persistence has advantages, issues arise from needing to integrate multiple tools to accomplish this. We characterize this as a multiproduct approach. (See Figure 3-3.)

As was mentioned before, to qualify as multiproduct, we need only two persistence mechanisms. Our simplest example illustrates the problem. If our application tier is writing to both a local filesystem and a database, the consistency of the records persisted on each is very different. If one has two instances of the application server running, files read or written to one server will not be the same as the files read and written from the other. Yes, this can be solved,7 but this choice has made our lives and the lives of other professionals in our organization a little more complicated!8

ddsd 0303
Figure 3-3. Multi-product, multi-model persistence

Multi-Model DBMS

With a unified multi-model database, all the issues surrounding persisting data in our organization still must be addressed—exactly once.

Minimize Conversion

In an effort to maintain stability and to control complexity, many organizations will try to limit the types of data persistence they allow. This is quite understandable, but it trades one type of complexity for another. Rather than pushing our colleagues to adopt a myriad of persistence tools, we are asked to push and pull our data into the limited tools that are available. For us, this conversion code is some of the most complex, voluminous, and brittle code we write. Polyglot persistence and multi-model databases can both improve this particular situation.

Functionality

What will our applications do with the data? Our end users may want full-text searches, fast transactions, analytical data, or—most likely—a combination of many capabilities. However the solutions are structured, they would necessarily implement whichever set of capabilities is required. Even better, our solutions should be flexible enough to handle future needs that haven’t surfaced yet. To maintain a smooth, Agile approach to feature delivery, having a fundamental set of tools available for all types of data likely to be encountered becomes a requirement (see Figure 3-4).

ddsd 0304
Figure 3-4. Characteristics of multi-model DBMS

Shaping Data and Impact on Other Roles

To address requirements in better and faster ways, we need to have access to the data, as well as the ability to shape it. We must be able to determine how to store a wide variety of data types, adjusting as we go.

Now that we have determined our “wants,” let’s look at the primary responsibility of others involved in the development process. By doing so, we can understand the impact of the changes we want to pursue on these other domains.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset