Chapter 2
Key Concepts

In this chapter, I'll explain some of the underlying key design concepts that Git uses. Implementation around these concepts forms the basis for how Git works and how to use it. I'll broadly break these concepts down into two categories, user-facing and internal, and show how they differ from more traditional source management systems. Lastly, I'll focus on some important considerations for creating repositories in Git, and managing special content such as binary files.

DESIGN CONCEPTS: USER-FACING

Version control systems (VCS) such as Git can be broadly classified as either centralized or distributed. Git is an example of a distributed version control system (DVCS). Other systems in this category include Mercurial and Bazaar. Examples of a centralized version control system (CVCS) would be Concurrent Versions System (CVS) and Subversion.

The fundamental differences between a DVCS and a CVCS have to do with how the system manages repositories and the workflow that the user employs to get content into the server-side part of the system.

Centralized Model

Figure 2.1 illustrates a traditional centralized model. In this model, you have a central server that holds all of the repositories with all of the history and all versions of changes that have been put into the system over time. This area is the one source of the truth—the container of all the repositories.

Image described by caption and surrounding text.

Figure 2.1 A traditional centralized version control model

When users want to work with a file in one of these repositories, they connect to the server via a client, and retrieve the files and the versions they want to work with. They then make whatever changes they need to, connect to the server again, and send the update back to it. There, the differences from the previous version are determined and stored in the repository as updates.

In this type of model, users are dependent on the central server. If, for some reason, users cannot connect to the server, they cannot do any source management operations.

Distributed Model

In a distributed system, the model is somewhat different. There is still a server that holds the shared repositories, and that clients interact with. However, when users want to start making changes, instead of getting individual files or directories from the server, they get a copy of the entire repository. The copy comes from the server side and has all content (including history) up to the point in time when the copy is created.

In Git terminology, the server side is called the remote repository (or just remote). The copy operation is referred to as a clone. You can call the area on your local system with the cloned repository your local environment because it consists of several layers (which you'll explore in the next chapter). For simplicity, I'll refer to the remote repository as just the remote throughout the rest of this discussion. Figure 2.2 illustrates this model.

Image described by caption and surrounding text.

Figure 2.2 A distributed version control model

The actual cloned (copied) repository within the local environment is called the local repository. It has all of the files, histories, and other data that were in the remote. A change that is made into the local repository is called a commit, similar in concept to a check-in in some other systems.

Once users have cloned from a remote, they can do all of their source management operations against the local repository. When users have made all the commits they want in the local repository, they then push their changes to the remote.

The key difference here is that, in a DVCS such as Git, users are performing the source management operations against a local copy of the server-side (remote) repository instead of making them against the actual server-side repository. Until users need to push the changes back to the remote, they do not even need to be connected to it. The connection between the local and the remote side is not constant. Rather, it is activated when updates need to be synchronized between the two repositories.

Because users do not have to be connected to the remote to do their source management operations, they can work disconnected from the remote. As noted in Chapter 1, this is referred to as being able to do disconnected development. Figure 2.3 shows a conceptual model of this approach.

A schematic diagram of Disconnected Development with schematics of three pairs of Local Environment and Remote Repository with the last pair to the right connected by a double-headed arrow labeled Sync.

Figure 2.3 Disconnected development

In Figure 2.3, starting on the left, a user makes a change to a file in the local repository without any connection to the remote. Then a second change is made in the same way. Finally, the local environment is synched up with the remote side so that both areas have the latest content.

One other thing to note is that a remote can actually be any Git repository that is set up to function that way. Most commonly, a remote is a Git repository hosted on a server and running as a daemon process. However, there are various protocols for communicating between Git clients and servers, even a simple one that operates via shared folders. I'll have more to say about these protocols in Chapter 12 where I discuss remotes in more detail.

DESIGN CONCEPTS: INTERNAL

Another area where Git differs significantly from traditional source management systems is in the way it represents and stores changes internally.

Delta Storage

In a traditional source management system, content is managed on a file-by-file basis. That is, each file is managed as an independent entity in the repository. When a set of files is added to a repository for the first time, each file is stored as a separate object in the repository, with its complete contents. The next time any changes to any of these files are checked in, the system computes the differences between the new version and the previous version for each file. It constructs a delta, or patch set, for each file from the differences. It then stores that delta as the file's next revision.

This model is called delta storage. Figure 2.4 illustrates this process.

Image described by caption and surrounding text.

Figure 2.4 The delta storage model

In the first iteration, files A, B, and C are checked in. Then, changes are made to the three files and those changes are checked in. When that occurs, the system computes the deltas between the current and previous versions. It then constructs the patch set that will allow it to re-create the current version from the previous version (the set of lines added, deleted, changed, and so on). That patch set is stored as the next revision in the sequence. The process repeats as more changes are made. Each delta is dependent on the previous one in order to construct that version of the file.

In order to get the most current version of a file from the system when the client requests it, the system starts with the original version of the file and then applies each delta in turn to arrive at the desired version. As the files continue to be updated over time, more and more deltas are created. In turn, more deltas must be applied in sequence to deliver a requested version. Eventually, this can lead to performance degradation, among other issues.

Snapshot Storage

Git uses a different storage model, called snapshot storage. Whereas in the delta model, revisions are tracked on a file-by-file basis, Git tracks revisions at the level of a directory tree. You can think of each revision within a Git repository as being a slice of a directory tree structure at a point in time—a snapshot. The structure that Git bases this on is the directory structure in your workspace (minus any files or directories that Git is told to ignore—more about that later).

When a commit is made into a Git repository, it represents a snapshot of part or all of the directory tree in the workspace, at that point in time. When the next commit is made, another snapshot is taken of the workspace, and so on. In each of these snapshots, Git is capturing the contents of all of the involved files and directories as they are in your workspace at that point in time. It's recording the full content, not computing deltas. There is no work to compute differences at that point.

The snapshot storage model is shown in Figure 2.5. In this model, you have the same set of three files, A, B, and C. At the point they are initially put into the repository, a snapshot of their state in the workspace is taken and that snapshot (with each of the file's full contents) is stored in Git and referenced as a unit.

A schematic diagram of the snapshot storage model with different file icons for Links and Files A, B, and C connected by arrows.

Figure 2.5 The snapshot storage model

As additional changes are made to any of the files and further commits are done, each commit is built as a snapshot of the structure as it is at that point. If a file hasn't changed from one commit to the next, Git is smart enough not to store a new version, and just creates a link to the previous version. Note that there are not any deltas being computed at this point and you are managing content for the user at the level of a commit rather than individual files.

Later, when you want to get one of these snapshots back, Git can just hand back the specific set of content associated with that commit, without going through the extensive reconstruction process required by the delta model.

Git's Storage Requirements

One of the questions that usually comes to mind right away when people are introduced to the snapshot storage concept is, “Doesn't this use a lot of disk space?” There are a couple of points related to that. First, as I just noted, Git can use links in some cases to reduce duplicate content. Second, Git compresses content using zlib compression. (Notice the smaller compressed size of the blocks representing content in the repository in Figure 2.5.) Third, periodically, at certain trigger points, such as when running garbage collection functionality, Git looks for content that is very similar between revisions and packs those revisions together to form a compressed pack file. In these cases, it can actually create an associated delta of sorts that represents the differences between very similar revisions. The delta here is what it takes to get back to previous revisions. Git assumes that the most recent revision is the one that will be most requested and thus best to keep as a full, ready revision.

So, in the Git model, the use of any deltas is a deliberate optimization for storage rather than the default versioning mechanism. Figure 2.6 illustrates a way to think about this concept, where multiple objects have been packed together internally. This is invisible to the user. From a user perspective, Git still manages interactions with the user in terms of individual snapshots, regardless of whether or not content ends up packed in the repository.

Image described by caption and surrounding text.

Figure 2.6 A representation of Git's packing behavior to optimize content size

All of these approaches help to reduce the space a Git repository requires. In fact, if you were to compare the corresponding disk space requirements for a source control system that uses the delta model to the snapshot model that Git uses, you might find that in the best cases, Git actually uses less.

(You may be wondering how a model like this handles binary files since those don't lend themselves to a delta model. I cover dealing with Git and binary files in more detail later in this chapter.)

A final, related point is that Git is designed to work with multiple, smaller repositories rather than large, monolithic repositories, a characteristic I'll explore in more detail in the next section.

So, to summarize, there are two differences between delta and snapshot storage:

  1. Delta storage manages content on a file-by-file basis, as opposed to snapshot storage where content is managed at a directory tree level.
  2. Delta storage manages versions over time by figuring out the differences and storing that information from revision to revision (the delta). It reconstructs later revisions by starting with the base version and applying deltas on top of that. Because snapshot storage is storing a capture of the entire tree, it does not usually have to do any reconstruction, or only a very small amount if the content has been packed.

Git's approaches in these areas create a very powerful model to build on, especially as they pertain to branching. However, they also create the need to structure repositories appropriately in Git for the best usability and performance. This is the topic of the next section.

REPOSITORY DESIGN CONSIDERATIONS

When beginning to work with Git, whether creating repositories for new content or migrating existing content from another source management system, it is important to consider how you size and structure your repositories. For existing content, unless your code is already broken down into very distinct, separate modules, a one-to-one migration is unlikely to be the best approach. This is because of repository scope.

Repository Scope

A key point to keep in mind when beginning to work with Git is that it is designed to be used as a set of many, smaller repositories. How small? Well, as an example, consider the case of a Java project managed in a traditional, centralized source management system. You might have a single repository for a Java project that's made up of ten different JARs, with all of the source code for all of the JARs stored in different subdirectories in the repository. This arrangement typically works well in a centralized model where each file is managed separately. In the working model for that system, you don't typically check out or check in the entire repository each time. You can manage things at smaller granularities, such as only checking out the subdirectory with the code for one particular JAR, modifying a few files, and then checking those files back in.

In the Git model, a more common scenario would be to have a separate repository for the code associated with each separate JAR. Why? Recall that Git manages changes as commits that are a snapshot of the larger workspace—the set of files and directories. While Git is efficient in how it stores and retrieves data, this efficiency is still relative to the size of the content. If the content is inordinately large, you may find yourself waiting longer than you'd expect for operations that get or put data from or into the repository.

In addition, as I alluded to in Chapter 1, because Git manages content in terms of snapshots, any changes by two users within the scope of the same snapshot, regardless of whether or not they are to the same file, have potential to cause a merge conflict, depending on timing.

To illustrate this, suppose you and another user clone the same repository in Git down to your local systems, and the repository contains directories 1 and 2. The other user makes a change in file A in directory 1, commits it, and pushes it up to the remote. Then you make a change in file B in directory 2, and commit and attempt to push your changes back to the remote. Git will reject your changes at the point where you try to get them into the remote. This is because Git considers that something else (anything else) has changed in this repository since you originally got your copy of the code. Even though you didn't touch the same file as the other user, you have a merge conflict within the snapshot, because someone else made a change before you could get yours in. This is one of the key frustrations for new Git users. I'll talk more about this in Chapter 13, including how to resolve the merge conflicts. (Also, see the following Note.)

In addition to repository size, there's a second point to consider. Ideally, you want to create repositories that will not have too many users working in them at the same time, and making (from Git's viewpoint) conflicting changes. This will help limit the number of rejected pushes and the amount of merging work that has to be done.

Having smaller repositories with only a few users making changes also allows for closer collaboration and coordination. It helps to keep operations in Git working quickly and smoothly when Git is manipulating content at the scope of a repository. This also applies to development environments, such as Eclipse, that look at projects as equating to a repository when interfacing with Git.

In general, you can think of one repository in Git as equating in scope to one module of your project. If your code is not already modularized, it can sometimes be difficult to figure out what should constitute a module. One general guideline is to map the code to build a JAR, DLL, EXE, or other single component to a repository. Think in terms of what code you would use to build a single deliverable in an application such as a Gradle or Maven project or a developer interface such as Eclipse, IntelliJ, or Visual Studio. Consider code that is owned and maintained by only one or a few people to reduce the risk of merge conflicts. If your code does not easily map out this way, then it's worth spending some time up front to figure out how to get it into a structure that is more modular. You can then base your Git repositories on that revised structure.

When considering how to organize code in Git repositories, it's also important to consider whether all categories of content related to a module are appropriate to migrate or store in a repository. There are general guidelines (especially around very large files) that apply, mostly independent of the source management application. I'll explore those guidelines next.

File Scope

When dealing with very large files, there are a number of considerations and approaches to take into account. An arbitrary definition of very large might be over 100 MB for text files, but less for binary files for reasons I'll talk about in the next few sections. Nearly all of these considerations apply to any source management system, not just Git. I'll now discuss some points you should consider.

Storage Model

Source management systems can't create deltas between versions of binary files. As a result, they end up storing full versions for each change to a binary file. This is necessary, but inefficient, and can quickly consume significant disk space if the files are large. Even in a system such as Git that compresses content, most binary files do not compress well. For certain types of smaller binary content, such as icons or other graphical elements, storing those files in the system usually doesn't present a problem and makes sense. For larger files, some pre-planning of alternative approaches to managing these files can help avoid issues in the repository. One common alternative approach for dealing with these files is to store them in a separate repository.

Separate Repositories

For the reasons outlined previously, storing very large files, especially binaries, in a repository such as Git is not the best approach. This also applies to generated files. Instead, there are specially designed applications for working with these types of files: artifact repositories. Artifact repositories work much like a source control system, but are designed to be a good fit for managing versions of files that don't really belong or fit well in your standard source repositories. Builds and other parts of a pipeline can pull source code from the source management system and resolve needed pre-built binary dependencies from artifact repositories. Some of the more popular artifact repositories today include Artifactory and Nexus.

There is also an option to store large files that need to be managed in source control in a second, separate Git repository designated for them. This approach still suffers from the problems discussed in the “Storage Model” section. However, it does remove the impact of dealing with the large binaries in the other smaller repositories.

Extensions to Git

Not surprisingly, a set of applications and packages has been created around trying to solve the limitations of Git with large files. Among these are extensions to Git, such as the git-annex and Git Large File Storage (Git LFS) open-source packages. There are also other packages, but these two seem the most likely to continue to receive support and development. This is primarily due to their incorporation into two of the major Git-hosting applications: git-annex has now been incorporated into GitLab as GitLab-Annex, and Git LFS is now incorporated into GitHub as well as some versions of Bitbucket—another Git repository hosting system.

In these implementations, the large files are stored in a separate space, but referenced by pointers inside of a normal Git repository. The applications vary in terms of characteristics that include the following:

  • Performance
  • Configurability (Can files be stored on user-configurable locations?)
  • Ease of use (registering of files and use of existing commands versus new commands)
  • Cost for long-term/large-scale use
  • Learning curve

All of these characteristics factor into the transparency and usability of the process, but some setup and overhead is always required.

Generated Content

Files generated from source code stored in your source control system should not actually be stored in the source management system. If these files are generated from sources that you have control over, then the file can always be reproduced from the sources.

In a model where the generated files are stored in the source repository, if the sources change frequently, then the generated content must also be updated frequently in the repository. This can be challenging to keep in sync and can lead to the problems discussed in the “Storage Model” section.

Generally, the reason why files produced from existing source are stored in the source management system boils down to having them easily accessible or using the source management system as a transport mechanism between processes. However, there are better ways to manage those needs such as using an artifact repository (described in the “Separate Repositories” section) that is designed for this purpose.

Shared Code

While I'm on the topic of easily accessing code in the source management system, at times, it may seem that you need to share code from one repository to another. Git provides a way to do this through a construct called submodules. A submodule is essentially a static reference to another repository that resides in your local environment. Git understands that it is a separately managed repository even though it is in your tree structure.

Submodules can be useful in certain cases, such as when an organization needs to share source for development dependencies that are being worked on by one group with other groups. However, they can be challenging to keep in sync without careful attention to updates. Managing them requires a different, coordinated set of operations. And it can be easy to back-level them for yourself or other users. For these reasons, submodules can be problematic and are not generally recommended for beginning Git users.

Git also supports another construct called subtrees that provides similar benefits to submodules, but with a simpler structure and a simpler set of operations to manage them. Both submodules and subtrees are explored in detail in Chapter 14 and the reader is advised to read that before attempting to use either of these constructs.

Another alternative approach is to just build the needed artifacts from other repositories separately and specify them as compile-time or run-time dependencies to pull them in, if this fits with how your project is organized.

SUMMARY

In this chapter, you learned about some of the differences between Git's overall design and functioning, and that of more traditional centralized source management systems. I covered the model that Git uses to clone a repository and create a stand-alone local environment in which to do source management operations versus the typical legacy “always do the operations to the server” model. Along these lines, I talked about how Git is structured with the local environment and the remote environment. I also introduced the concept of disconnected development, which is one of the appealing aspects of using Git. All of this allows you to get things the way you want them locally before you share them back with others in a public repository.

I also shared some insights on how Git manages things internally. You learned how Git sees sets of files involved in a commit as a unit and works at a granularity that is directory tree–based, not file-based. You also looked at how it stores and manages commits over time.

Finally, I discussed some considerations when creating or migrating to Git repositories, defining some guidelines for repository scope and file scope, especially around large files and binaries. Git is not strong in managing very large files, but there are good alternatives.

In the next chapter, you'll expand your understanding of the local environment that Git provides by looking at the Git promotion model, as well as looking at the workflow to move content through the different levels.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset