Builds randomly fail


#1

I’ve been noticing some of my builds randomly failing. It’s unclear how long this has been happening as things get automatically retried and eventually work. Here’s the end of the build log:

   deployment-service: Installing
mkdir: created directory '/hab/pkgs/chef/deployment-service'
mkdir: created directory '/hab/pkgs/chef/deployment-service/0.1.0'
mkdir: created directory '/hab/pkgs/chef/deployment-service/0.1.0/20180703142035'
mkdir: created directory '/hab/pkgs/chef/deployment-service/0.1.0/20180703142035/bin'
   deployment-service: Installing generated binaries
   deployment-service: Writing configuration
   deployment-service: Writing service management scripts
   deployment-service: Using run hook /src/components/automate-deployment/habitat/hooks/run
   deployment-service: Building package metadata
   deployment-service: Generating blake2b hashes of all files in the package
   deployment-service: Generating signed metadata FILES
» Signing deployment-service_blake2bsums
☛ Signing deployment-service_blake2bsums with chef-20160614114050 to create /hab/pkgs/chef/deployment-service/0.1.0/20180703142035/FILES
★ Signed artifact /hab/pkgs/chef/deployment-service/0.1.0/20180703142035/FILES.
   deployment-service: Creating manifest
   deployment-service: Generating package artifact
/hab/pkgs/core/tar/1.29/20170513213607/bin/tar: Removing leading `/' from member names
/hab/cache/artifacts/.chef-deployment-service-0.1.0-20180703142035-x86_64-linux.tar (1/1)
  100 %       6308.3 KiB / 25.3 MiB = 0.243
/hab/pkgs/core/xz/5.2.2/20170513214327/bin/xz: /hab/cache/artifacts/.chef-deployment-service-0.1.0-20180703142035-x86_64-linux.tar: File seems to have been moved, not removing
   deployment-service: Build time: 1m22s
   deployment-service: Exiting on error
🚨 Error: The command exited with status 1

#2

Is it possible a CI pipeline is clobbering that file before xz can run to generate the hart from the tar? Or maybe a process clearing out the cache directory on an interval?


#3

Looks like it is related to our build environment. @ssd and @srenatus did some investigating:

This is a summary of what we've found so far. We haven't solved
it but hopefully this might help someone continue to dig in if
they have time.This error is being emitted from `xz` which is called during the
habitat build processhttps://github.com/habitat-sh/habitat/blob/master/components/plan-build/bin/hab-plan-build.sh#L2112-L2123The error inside xz is emitted here, when xz tries to unlink the
file it was compressing:https://git.tukaani.org/?p=xz.git;a=blob;f=src/xz/file_io.c;h=48ef8223ca8bdae23bbd7a11ac7ecd0828c67b62;hb=HEAD#l321As one can see from the code, the error is emitted if (1)
stat/lstat fails on the file it is trying to remove, (2) the
st_dev of the file has changed or (3) the st_ino (inode) of the
file has changed.While that could happen if some concurrent process was moving the
file, we aren't aware of any concurrent operations that would be
modifying the files in that way so we started looking into other
possibilities.One possibility is that the inode of the file is changing because
of some quirks with overlayfs.  The overlayfs documentation

    While directories will report an st_dev from the overlay-filesystem,
    non-directory objects may report an st_dev from the lower filesystem or
    upper filesystem that is providing the object.  Similarly st_ino will
    only be unique when combined with st_dev, and both of these can change
    over the lifetime of a non-directory object.  Many applications and
    tools ignore these values and will not be affected.
   
    In the special case of all overlay layers on the same underlying
    filesystem, all objects will report an st_dev from the overlay
    filesystem and st_ino from the underlying filesystem.  This will
    make the overlay mount more compliant with filesystem scanners and
    overlay objects will be distinguishable from the corresponding
    objects in the original filesystem.

We confirmed by running `mount` as part of the build that overlayfs is
involved at various levels of the build.As the second paragraph mentions, a partial fix for this problem
has been added to linux here:https://github.com/torvalds/linux/commit/b948abf53a381a0c681aadd612e2affba47f62bcwhich should be available in the linux kernel we are using; however,
that fix doesn't cover the cases of the overlayfs being composed of
multiple underlying filesystems.  We were not yet able to get insight
into whether that might be the case or not.Others have experience similar bugs related to this as can be seen
here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1728489

#4

Are these builds being run in containers?


#5

@elliott-davis In the case we investigated yes. The TLDR on that summary is that xz tries to detect if a file has moved by comparing inodes but overlayfs doesn’t guarantee that stat will report the same inode and thus occasionally triggers the error case. Exactly how to fix it depends on the kernel version you are on and the container setup (unless we patch xz to ignore that check or something)