Chapter 8. Managing Code

Working on a software project that involves more than one person is tough. Everything slows down and gets harder. This happens for several reasons. This chapter will expose these reasons and will try to provide some ways to fight against them.

This chapter is divided into two parts, which explain:

  • How to work with a version control system
  • How to set up continuous development processes

First of all, a code base evolves so much that it is important to track all the changes that are made, even more so when many developers work on it. That is the role of a version control system.

Next, several brains that are not directly wired together can still work on the same project. They have different roles and work on different aspects. Therefore, a lack of global visibility generates a lot of confusion about what is going on and what is being done by others. This is unavoidable, and some tools have to be used to provide continuous visibility and mitigate the problem. This is done by setting up a series of tools for continuous development processes such as continuous integration or continuous delivery.

Now we will discuss these two aspects in detail.

Version control systems

Version control systems (VCS) provide a way to share, synchronize, and back up any kind of file. They are categorized into two families:

  • Centralized systems
  • Distributed systems

Centralized systems

A centralized version control system is based on a single server that holds the files and lets people check in and check out the changes that are made to those files. The principle is quite simple—everyone can get a copy of the files on his/her system and work on them. From there, every user can commit his/her changes to the server. They will be applied and the revision number will be raised. The other users will then be able to get those changes by synchronizing their repository copy through an update.

The repository evolves through all the commits, and the system archives all revisions into a database to undo any change or provide information on what has been done:

Centralized systems

Figure 1

Every user in this centralized configuration is responsible for synchronizing his/her local repository with the main one in order to get the other user's changes. This means that some conflicts can occur when a locally modified file has been changed and checked in by someone else. A conflict resolution mechanism is carried out, in this case on the user system, as shown in the following figure:

Centralized systems

Figure 2

This will help you understand better:

  1. Joe checks in a change.
  2. Pamela attempts to check in a change on the same file.
  3. The server complains that her copy of the file is out of date.
  4. Pamela updates her local copy. The version control software may or may not be able to merge the two versions seamlessly (that is, without a conflict).
  5. Pamela commits a new version that contains the latest changes made by Joe and her own.

This process is perfectly fine on small-sized projects that involve a few developers and a small number of files. But it becomes problematic for bigger projects. For instance, a complex change involves a lot of files, which is time consuming, and keeping everything local before the whole work is done is unfeasible. The problems of such approach are:

  • It is dangerous because the user may keep his/her computer changes that are not necessarily backed up
  • It is hard to share with others until it is checked in and sharing it before it is done would leave the repository in an unstable state, and so the other users would not want to share

Centralized VCS has resolved this problem by providing branches and merges. It is possible to fork from the main stream of revisions to work on a separated line and then to get back to the main stream.

In Figure 3, Joe starts a new branch from revision 2 to work on a new feature. The revisions are incremented in the main stream and in his branch every time a change is checked in. At revision 7, Joe has finished his work and commits his changes into the trunk (the main branch). This requires, most of the time, some conflict resolution.

But in spite of their advantages, centralized VCS has several pitfalls:

  • Branching and merging is quite hard to deal with. It can become a nightmare.
  • Since the system is centralized, it is impossible to commit changes offline. This can lead to a huge, single commit to the server when the user gets back online. Lastly, it doesn't work very well for projects such as Linux, where many companies permanently maintain their own branch of the software and there is no central repository that everyone has an account on.

For the latter, some tools are making it possible to work offline, such as SVK, but a more fundamental problem is how the centralized VCS works.

Centralized systems

Figure 3

Despite these pitfalls, centralized VCS is still quite popular among many companies mainly due to inertia of corporate environments. The main examples of centralized VCSes used by many organizations are Subversion (SVN) and Concurrent Version System (CVS). The obvious issues with centralized architecture for version control systems is the reason why most of the open source communities have already switched to the more reliable architecture of Distributed VCS (DVCS).

Distributed systems

Distributed VCS is the answer to the centralized VCS deficiencies. It does not rely on a main server that people work with, but on peer-to-peer principles. Everyone can hold and manage his/her own independent repository for a project and synchronize it with other repositories:

Distributed systems

Figure 4

In Figure 4, we can see an example of such a system in use:

  1. Bill pulls the files from HAL's repository.
  2. Bill makes some changes on the files.
  3. Amina pulls the files from Bill's repository.
  4. Amina changes the files too.
  5. Amina pushes the changes to HAL.
  6. Kenny pulls the files from HAL.
  7. Kenny makes changes.
  8. Kenny regularly pushes his changes to HAL.

The key concept is that people push and pull the files to or from other repositories, and this behavior changes according to the way people work and the way the project is managed. Since there is no main repository anymore, the maintainer of the project needs to define a strategy for people to push and pull the changes.

Furthermore, people have to be a bit smarter when they work with several repositories. In most distributed version control systems, revision numbers are local to each repository, and there are no global revision numbers anyone can refer to. Therefore, tags have to be used to make things clearer. They are textual labels that can be attached to a revision. Lastly, users are responsible for backing up their own repositories, which is not the case in a centralized infrastructure where the administrator usually sets back up strategies.

Distributed strategies

A central server is, of course, still desirable with a DVCS if you're working in a company setting with everyone working toward the same goal. But the purpose of that server is completely different than in centralized VCS. It is simply a hub that allows all developers to share their changes in a single place rather than pull and push between each other's repositories. Such a single central repository (often called upstream) serves also as a backup for all the changes tracked in the individual repositories of all team members.

Different approaches can be applied to sharing code with the central repository in DVCS. The simplest one is to set up a server that acts like a regular centralized server, where every member of the project can push his/her changes into a common stream. But this approach is a bit simplistic. It does not take full advantage of the distributed system, since people will use push and pull commands in the same way as they would with a centralized system.

Another approach consists of providing several repositories on a server with different levels of access:

  • An unstable repository is where everyone can push changes.
  • A stable repository is read-only for all members except the release managers. They are allowed to pull changes from the unstable repository and decide what should be merged.
  • Various release repositories correspond to the releases and are read-only, as we will see later in the chapter.

This allows people to contribute, and managers to review, the changes before they make it to the stable repository. Anyway, depending on the tools used, this may be too much of an overhead. In many distributed version control systems, this can also be handled with a proper branching strategy.

The other strategies can be made up, since DVCS provides infinite combinations. For instance, the Linux Kernel, which is using Git (http://git-scm.com/), is based on a star model, where Linus Torvalds is maintaining the official repository and pulls the changes from a set of developers he trusts. In this model, people who wish to push changes to the kernel will, hopefully, try to push them to the trusted developers so that they reach Linus through them.

Centralized or distributed?

Just forget about the centralized version control systems.

Let's be honest. Centralized version control systems are relict of the past. In a time when most of us have the opportunity to work remotely full-time, it is unreasonable to be constrained by all the deficiencies of centralized VCS. For instance, with CVS or SVN you can't track the changes when offline. And that's silly. What should you do when the Internet connection at your workplace is temporarily broken or the central repository goes down? Should you forget about all your workflow and just allow changes to pile up until the situation changes and then just commit it as a one huge blob of unstructured updates? No!

Also, most of the centralized version control systems do not handle branching schemes efficiently. And branching is a very useful technique that allows you to limit the number of merge conflicts in the projects where many people work on multiple features. Branching in SVN is so ridiculous that most of the developers try to avoid it at all costs. Instead, most of the centralized VCS provides some file-locking primitives that should be considered the anti-pattern for any version control system. The sad truth about every version control tool is that if it contains a dangerous option, someone in your team will start using it on a daily basis eventually. And locking is one such feature that in return of fewer merge conflicts will drastically reduce the productivity of your whole team. By choosing a version control system that does not allow for such bad workflows, you are making a situation, which makes it more likely that your developers will use it effectively.

Use Git if you can

Git is currently the most popular distributed version control system. It was created by Linus Torvalds for maintaining versions of the Linux kernel when its core developers needed to resign from proprietary BitKeeper that was used previously.

If you have not used any of the version control systems then you should start with Git from the beginning. If you already use some other tools for version control, learn Git anyway. You should definitely do that even if your organization is unwilling to switch to Git in the near future, otherwise you risk becoming a living fossil.

I'm not saying that Git is the ultimate and best DVCS version control system. It surely has some disadvantages. Most of all, it is not an easy-to-use tool and is very challenging for newcomers. Git's steep learning curve is already a source of many jokes online. There may be some version control systems that may perform better for a lot of projects and the full list of open source Git contenders would be quite long. Anyway, Git is currently the most popular DVCS, so the network effect really works in its favor.

Briefly speaking, the network effect causes that the overall benefit of using popular tools is greater than using others, even if slightly better, precisely due to its high popularity (this is how VHS killed Betamax). It is very probable that people in your organization, as well as new hires, are somewhat proficient with Git, so the cost of integrating exactly this DVCS will be lower than trying something less popular.

Anyway, it is still always good to know something more and familiarizing yourself with other DVCS won't hurt you. The most popular open source rivals of Git are Mercurial, Bazaar, and Fossil. The first one is especially neat because it is written in Python and was the official version control system for CPython sources. There are some signs that it may change in the near future, so CPython developers may already use Git by the time you read this book. But it really does not matter. Both systems are great. If there would be no Git, or it were less popular, I would definitely recommend Mercurial. There is evident beauty in its design. It's definitely not as powerful as Git, but a lot easier to master for beginners.

Git flow and GitHub flow

The very popular and standardized methodology for working with Git is simply called Git flow. Here is the brief description of the main rules of that flow:

  • There is a main working branch, usually called develop, where all the developments for the latest version of the application occurs.
  • New project features are implemented in separate branches called feature branches that always start from the develop branch. When work on a feature is finished and the code is properly tested, this branch is merged back to develop.
  • When the code in develop is stabilized (without known bugs) and there is a need for new application release, a new release branch is created. This release branch usually requires additional tests (extensive QA tests, integration tests, and so on) so new bugs will be definitely found. If additional changes (such as bug fixes) are included in a release branch, they need to eventually be merged back to the develop branch.
  • When code on a release branch is ready to be deployed/released, it is merged to the master branch and the latest commit on the master is labeled with an appropriate version tag. No other branches but release branches can be merged to the master. The only exceptions are hot fixes that need to be immediately deployed or released.
  • Hot fixes that require urgent release are always implemented on separate branches that start from the master. When the fix is done, it is merged to both the develop and master branches. Merging of the hot fix branch is done like it were an ordinary release branch, so it must be properly tagged and the application version identifier should be modified accordingly.

The visual example of Git flow in action is presented in Figure 5. For those that have never worked in such a way, and have also never used distributed version control systems, this may be a bit overwhelming. Anyway, it is really worth trying in your organization if you don't have any formalized workflow. It has multiple benefits and also solves real problems. It is especially useful for teams of multiple programmers that are working on many separate features and when continuous support for multiple releases needs to be provided.

This methodology is also handy if you want to implement continuous delivery using continuous deployment processes because it is always clear in your organization and which version of code represents a deliverable release of your application or service. It is also a great tool for open source projects because it provides great transparency to both the users and the active contributors.

Git flow and GitHub flow

Figure 5 Visual presentation of Git flow in action

So, if you think that this short summary of Git flow makes a bit of sense and it did not scare you yet, then you should dig deeper into online resources on that topic. It is really hard to say who the original author of the preceding workflow is, but most online sources point to Vincent Driessen. Thus, the best starting material to learn about Git flow is his online article titled A successful Git branching model (refer to http://nvie.com/posts/a-successful-git-branching-model/).

Like every other popular methodology, Git flow gained a lot of criticism over the Internet from programmers that do not like it. The most commented thing about Vincent Driessen's article is the rule (strictly technical) saying that every merge should create a new artificial commit representing that merge. Git has an option to do fast forward merges and Vincent discourages that option. This is, of course, an unsolvable problem because the best way to perform merges is a completely subjective matter to the organization Git is being used in. Anyway, the real issue of Git flow is that it is noticeably complicated. The full set of rules is really long, so it is easy to make some mistakes. It is very probable that you would like to choose something simpler.

One such flow is used at GitHub and described by Scott Chacon on his blog (refer to http://scottchacon.com/2011/08/31/github-flow.html). It is referred to as GitHub flow and is very similar to Git flow:

  • Anything in the master branch is deployable
  • The new features are implemented on separate branches

The main difference from Git flow is simplicity. There is only one main development branch (master) and it is always stable (in contrast to the develop branch in Git flow). There are also no release branches and a big emphasis is placed on tagging the code. There is no such need at GitHub because, as they say, when something is merged into the master it is usually deployed to production immediately. Diagram presenting an example b flow in action is shown in Figure 6.

GitHub flow seems like a good and lightweight workflow for teams that want to have a continuous deployment process setup for their project. Such a workflow is, of course, not viable for any project that has a strong notion of release (with strict version numbers)—at least without any modifications. It is important to know that the main assumption of the always deployable master branch is that it cannot be ensured without proper automated testing and a building procedure. This is what continuous integration systems take care of and we will discuss that a bit later. The following is a diagram presenting an example of GitHub flow in action:

Git flow and GitHub flow

Figure 6 Visual presentation of GitHub flow in action

Note that both Git flow and GitHub flow are only branching strategies, so despite having Git in their names, they are not limited to that single DVCS solution. It's true that the official article describing Git flow mentions specific git command parameters that should be used when performing a merge, but the general idea can be easily applied to almost any other distributed version control system. In fact, due to the way it is suggested to handle merges, Mercurial seems like a better tool to use for this specific branching strategy! The same applies to GitHub flow. This is the only branching strategy sprinkled with a bit of specific development culture, so it can be used in any version control system that allows you to easily create and merge branches of code.

As a last comment, remember that no methodology is carved in stone and no one forces you to use it. They are created to solve some existing problems and keep you from making common mistakes. You can take all of their rules or modify some of them to your own needs. They are great tools for beginners that may easily get into common pitfalls. If you are not familiar with any version control system, you should then start with a lightweight methodology like GitHub flow without any custom modification. You should start thinking about more complex workflows only when you get enough experience with Git, or any other tool of your choice. Anyway, as you will gain more and more proficiency, you will eventually realize that there is no perfect workflow that suits every project. What works well in one organization does not need to work well in others.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset