Handling concurrent merges to master in multi-stage Azure DevOps pipelines

In this post I describe a problem we were seeing in our large Azure DevOps pipeline, where a merge to master would cause PR builds to break mid-run. I describe the problem, why it occurs, and the solution we took to fix it.

This is definitely a niche issue, but it was causing us headaches more often than we liked. The workaround I describe in this post, while not pretty, solved all our issues!

Background: Multi-stage Azure DevOps pipelines

At work on the Datadog .NET APM Tracer, we have a big Azure DevOps pipeline. I tried to fit the whole thing into the overview below, but it's just too big!

Screenshot of our Azure DevOps pipeline

The reason it's so large is that we have to test on a lot of platforms, and support a lot of frameworks. So as a vague overview we have to:

Build the tracer on all the supported platforms—Windows (x64 and x86) and Linux (x64, musl x64, arm64).
Run unit tests for each platform, against all the supported frameworks (.net461, netcoreapp2.1, netcoreapp3.0, etc).
Run unit tests for each platform, against all the supported frameworks (.net461, netcoreapp2.1, netcoreapp3.0, etc).
Build installers for Windows (MSI, NuGet, dd-trace) and Linux (deb, rpm, tar etc)
Run smoke tests for all the installers against a range of supported OSes
Run benchmarks
Run throughput tests
etc

Each of those steps is a different "Stage" in Azure DevOps, and each platform/framework permutation within a stage is a separate "Job". We have dependencies between different stages (e.g. the testing stages depend on the build stages, the smoke tests depend on the installer stages etc), so you end up with a directed acyclic graph, as shown in the previous diagram.

Each Job in Azure Devops runs on a separate host, so we can parallelize the majority of work that happens in a stage. The downside to that is that each job does a fresh clone of the repository, and has to re-download all the assets and artifacts it needs from previous jobs. But short of running everything in a single stage (completely impractical), we don't really have a lot of choice other than using a multi-stage setup.

Background: GitHub pull requests and phantom merge commits

Our Azure DevOps pipeline builds every pull request (PR) that gets created in GitHub. We have some GitHub Actions too, but for Reasons™, most of the build happens in Azure DevOps.

When the PR is created, GitHub sends a webhook to Azure via the DevOps/GitHub integration, containing details about the pull request, such as the pull request number, the branch, and the commit.

This is where things get…tricky.

When you create a PR, GitHub creates a "merge" branch, separate to the branch you created the PR with, called something like refs/pull/:prNumber/merge. It then does a git merge between this branch and your master/main branch, and it's that commit which is sent to Azure DevOps

For example, imagine you have a branch called test. You push it to GitHub, and create a PR, let's say PR number 3. GitHub then creates a pull/3/merge commit, and does a git merge. It's the commit with SHA 582984347 in the example below that gets tested in your CI, not the test branch:

The effective behaviour of GitHub with PRs

The "merge branch" behaviour is described in the GitHub documentation, but it's not very intuitive, and if you aren't aware of it, it will likely bite you at some point, but in most cases, it does make sense.

Essentially, you don't want to test the branch, you want to test the impact of the branch after it's merged to the main branch. Testing the merge commit does this (and it has the same effect whether you use merge, rebase, or squash).

The problem: a merge to master while a run is happening

Where things get tricky is if someone merges to master after you've created your PR. When this happens, GitHub refreshes all of the merge branches. If the master/main branch changes, then the git merge for all the PR branches must change too.

This normally doesn't have an impact on any existing PRs. If there is a running build for a PR, it will continue to use the previous merge commit, but if you push to the branch again, or restart the build, you will get the new merge commit. No problem.

The problems arise if you have a big multi-stage build, like we do…

The problem is that for each stage in our pipeline, we do a checkout using Azure DevOps' built-in checkout action. If someone merges to master while a build is on-going. different stages of the build will use different merge commits.

This would lead to scenarios like the following:

A PR is created.
Azure DevOps checks out the merge commit, and starts the build stages.
Someone merges to master that adds a new feature, and adds additional tests for the feature.
The build stages of the PR complete.
The test stages start, and check out the new merge commit
The tests for the new feature fail because the binary built in the build stages was built using a commit that didn't include this feature.

Fundamentally the problem is that Azure DevOps doesn't use a consistent commit across build stages. This is so painful to deal with when you have a long CI process like we do 😩

In the next section you'll see how we decided to hack around it.

The workaround : use a consistent commit ID

The solution we came up with involved three pieces:

Don't use Azure DevOps's built-in checkout command, use a custom one instead
Add an additional stage at the start of the pipeline, on which all others depend, that records the commit ID of the master branch at the start of the pipeline.
In stages that need to checkout, do a manual git merge between the PR branch being tested and the commit ID recorded. This ensures a consistent code base is tested across all stages.

If you think this sounds convoluted, yeah, you're right. Don't use this is if you don't need it!😃 But we haven't had any issues related to concurrent merges to master since.

First of all, we have the extra stage, which grabs the commit ID of the master branch

- stage: master_commit_id
  dependsOn: []
  jobs:
  - job: fetch
    pool:
      vmImage: ubuntu-18.04

    steps:
    - checkout: none
    - bash: |
        git clone --quiet --no-checkout --depth 1 --branch master $BUILD_REPOSITORY_URI ./s
        cd s
        MASTER_SHA=$(git rev-parse origin/master)
        echo "Using master commit ID $MASTER_SHA"
        echo "##vso[task.setvariable variable=master;isOutput=true]$MASTER_SHA"
      failOnStderr: true
      displayName: Fetch master id
      name: set_sha

This stage consists of a single job that uses the Azure DevOps build variable BUILD_REPOSITORY_URI and does a shallow clone of the git repository. It then uses git rev-parse origin/master to extract the current commit ID of the master branch, and writes that to an output variable. Subsequent stages can access this variable, as you'll see shortly.

Next, we need to create our custom checkout implementation. As we reuse this in every stage, I created a template called clone-repo.yml, and added the following. There's a lot to this file so I'll give an overview subsequently:

parameters:
  - name: masterCommitId
    type: string

steps:
- ${{ if endsWith(variables['build.sourceBranch'], '/merge') }}:
  - checkout: none
  - bash: |
      # As this is a pull request, we need to do a fake merge
      # uses similar process to existing checkout task
      prBranch=$SYSTEM_PULLREQUEST_SOURCEBRANCH
      echo "Checking out merge commit for ${{ parameters.masterCommitId }} and $prBranch"
      git version
      git lfs version
      echo "Updating git config ..."
      git config init.defaultBranch master
      echo "Initializing repository at $BUILD_REPOSITORY_LOCALPATH ..."
      git init "$BUILD_REPOSITORY_LOCALPATH"
      echo "Adding remote $BUILD_REPOSITORY_URI ..."
      git remote add origin "$BUILD_REPOSITORY_URI"
      git config gc.auto 0
      git config --get-all http.$BUILD_REPOSITORY_URI.extraheader
      git config --get-all http.extraheader
      git config --get-regexp .*extraheader
      git config --get-all http.proxy
      git config http.version HTTP/1.1
      echo "Force fetching master and $prBranch ..."
      git fetch --force --tags --prune --prune-tags --progress --no-recurse-submodules origin +refs/heads/master:refs/remotes/origin/master +refs/heads/$prBranch:refs/remotes/origin/$prBranch
      echo "Checking out $prBranch..."
      git checkout --force $prBranch
      echo "Resetting $prBranch to origin/$prBranch ..."
      git reset origin/$prBranch --hard
      echo "Running git clean -ffdx ..."
      git clean -ffdx
    displayName: checkout
  - bash: |
      echo "Updating credentials ..."
      git config user.email "[email protected]"
      git config user.name "Automatic Merge"

      echo "Merging $prBranch with ${{ parameters.masterCommitId }} ..."
      git merge ${{ parameters.masterCommitId }}
      
      mergeInProgress=$?
      if [ $mergeInProgress -ne 0 ]
      then
        echo "Merge failed, rolling back ..."
        git merge --abort
        exit 1;
      fi

      echo "Merge successful"
      git status
    displayName: merge
- ${{ else }}:
  - checkout: self
    clean: true

This template takes a masterCommitId as a parameter, which the calling job provides. That keeps this template relatively "standalone".

The first thing the template does is check if we're building a git branch that ends in /merge. If we are, we assume we're building in a PR branch, and we do the custom checkout. If not, then we fallback to a "standard" checkout to avoid the complexity of handling multiple scenarios. It's only the merge-branch scenario we handle here.

Checking for the /merge suffix is a bit crude, but it's worked for us in practice. You could easily take a different route based on other parameters if you prefer

If we're testing a PR, then we explicitly set checkout:none to skip Azure DevOps's implicit checkout. We then configure the git repo with various settings (I mostly just copied what the default checkout:self does). Next we do a git fetch to pull in the master branch and the original PR branch (i.e. not the merge branch). We then checkout the PR branch and clean/reset (to make sure our self-hosted runners are reset correctly).

In the next step we do the actual merge. Merging requires git credentials, so we add some dummy ones (we're not going to actually commit anything here), and then run git merge using the masterCommitId. Finally, we check that the merge was actually successful, and if so, we're done!

To use the task in a build stage, you should do three things:

Add a dependency on master_commit_id for the stage
Grab the output ID from the master_commit_id stage
Call the template by passing the relative path to the definition.

For example, for our build stage, we have this:

- stage: build_windows_tracer
  dependsOn: [master_commit_id]    # depend on the master_commit_id stage
  variables:
    masterCommitId: $[ stageDependencies.master_commit_id.fetch.outputs['set_sha.master']]    # grab the stage output
  jobs:

  - job: build
    pool:
      vmImage: windows-2019
    steps:
    - template: steps/clone-repo.yml    # Call the template providing the relative path
      parameters:
        masterCommitId: $(masterCommitId)    # inject the commit ID
    # ...

With these changes, we no longer had issues with our PR builds breaking because someone merged to master! If you're interested you can see the PR where I originally implemented it here, though we tweaked some things subsequently, as shown in this post.

Limitations

There are two main arguments against using the approach described in this post:

Surely this should be the default behaviour in Azure DevOps?
It's complicated and convoluted.
A PR isn't tested against the latest master commit, just whatever the branch was when the build started.

The first point is easy: yeah, it should. It feels kinda weird that it isn't, but I presume there are technical reasons that it works like this.

The second point is also easy: yes, it's complicated. If you don't need this, don't do it. I hope you don't have to use it. I considered other options to avoid it, but they all ran into other road blocks, so this was the easiest approach in the end.

The third point is more subtle, but yes, you're right. It's tested against a point in time. But that's also true without this fix. If you build a PR, the build finishes, and then someone merges to master, your PR build may pass, but it also wasn't tested against the latest master. Nothing about that changes in this PR.

You can avoid this and enforce PRs are always up to date in GitHub (see this post for details). The downside (and the reason we haven't implemented it), is it could add a lot more pressure to your CI, with every PR needing to be rebuilt whenever anyone merges to master.

All my fix does is ensure we use a consistent commit for a PR. For that, it works well.

Well, it works, anyway.

Summary

In this post I describe a problem with multi-stage Azure DevOps builds where someone committing to the master branch while a PR branch is building causes the PR to use different code for different parts of the build. I solved the issue by recording the commit ID of the master branch at the start of the build, and manually doing git checkout and git merge at the start of each stage. This ensured a consistent code checkout in every stage.

Andrew Lock | .NET Escapades Andrew Lock