How to debug supervisor which does not update a certain package to its latest version


#1

Any suggestions on how to debug or collect information about a supervisor which does not update a package to its latest version?

habitat is set up via the latest official chef habitat cookbook. The supervisor manages two services: One that updates without any issues and one that is stale at the currently installed version. An identical setup is running on another node, which updates just fine, however is a little younger than the one in question.

I fount that there is one way to update to the latest version manually:

hab svc unload myorg/mypkg
hab pkg install --auth mytoken myorg/mypkg (no explicit version, but it installs the latest)
hab svc load myorg/mypkg

invoking unload and load without install does not update to the latest version


#2

Strange, theres a couple things that could cause this behavior, usually related to envvars or the flags passed when the service was started. I don’t know much about the habitat cookbook but I think the first thing you’re going to want to do is get the supervisor logs. With the cookbook I think the sup is (usually) being started as a systemd service so you’re going to want to use journalctl to see whats going on. If you could grab those logs and dump them here thats probably the easiest way to deduce whats going on.

You could also theoretically use ps aux | grep hab to see if one of those services wasn’t started with an upgrade strategy.


#3

Thank you for the response. However, I have checked the system logs in the past before and I could not find anything related to that issue. Today I compared the running processes on both systems as well as watch the logs during a deployment. I found the following things:

  • ps aux | grep hab is identical on both nodes. I don’t see any hint regarding update strategy. The java process is the one that won’t update
root      1651  0.2  0.4 218112 26284 ?        Sl   Oct01  31:02 /hab/pkgs/core/hab-sup/0.63.0/20180914030447/bin/hab-sup run
root     19843  0.0  0.0  37716  6088 ?        S    18:36   0:00 nginx: master process /hab/pkgs/core/nginx/1.15.5/20181010011756/bin/nginx -c /hab/svc/www-core/config/nginx.conf
hab      19850  0.0  0.0  38028  4992 ?        S    18:36   0:00 nginx: worker process
hab      19851  0.0  0.0  38028  4528 ?        S    18:36   0:00 nginx: worker process
hab      20218  1.1  4.7 4091364 288564 ?      Sl   18:56   0:45 java -Dvertx.disableFileCPResolving=true -Dvertx.logger-delegate-factory-class-name=io.vertx.core.logging.Log4j2LogDelegateFactory -jar src-0.3.13-fat.jar run groovy:com.superevilcompany.core.Server -conf /hab/svc/superevilcompany-core/config/conf.json
  • journalctl only shows messages regarding deployments of the other package that works fine. There is nothing listed regarding the package that won’t update

So far I think I need a more sophisticated strategy to find out what really is going on.

Just double checked. It is the same habitat version that is running on both systems. (hab 0.63.0/20180914025124)

This is journalctl on hab svc load after I unloaded the service (no explicit install). Yet again I do not seen anything of interest here…

Oct 11 17:05:03 stack-01 hab[1108]: hab-sup(AG): The myorg/mypkg-core service was successfully loaded
Oct 11 17:05:06 stack-01 hab[1108]: hab-sup(MR): Starting myorg/mypkg-core
Oct 11 17:05:06 stack-01 hab[1108]: service-core.default(UCW): Watching user.toml
Oct 11 17:05:06 stack-01 hab[1108]: www-core.default(HK): Hooks compiled
Oct 11 17:05:06 stack-01 hab[1108]: service-core.default(HK): Hooks compiled
Oct 11 17:05:06 stack-01 hab[1108]: service-core.default(SR): Initializing
Oct 11 17:05:06 stack-01 hab[1108]: service-core.default(SV): Starting service as user=hab, group=hab
Oct 11 17:05:07 stack-01 hab[1108]: www-core.default(HK): Hooks compiled
Oct 11 17:05:07 stack-01 hab[1108]: service-core.default(HK): Hooks compiled

invoking install after unload installs the new version, which then runs fine if I invoke load again without a version number. It looks somewhat like when trying to load a private package when no hab token is present, however if that would be the case then the other package would not work as well… So, yeah… pretty strange…


#4

Ah I definitely left something out of my previous response: you’re gonna want to use the RUST_LOG=debug envvar to get more useful output in the logs. That should give us a better idea of whats occurring for each service.


#5

yeah, that makes sense :smiley:
I manually modified the upstart script to include the env var and after a reboot it clearly shows that the service is not started with a update strategy:

DEBUG 2018-10-15T09:35:15Z: habitat_sup::manager::service::spec: Writing service spec to '/hab/sup/default/specs/mypkg-core.spec': ServiceSpec { ident: PackageIdent { origin: "mypkg", name: "mypkg-core", version: None, release: None }, group: "default", application_environment: None, bldr_url: "https://bldr.habitat.sh", channel: "stable", topology: Standalone, update_strategy: None, binds: [], binding_mode: Strict, config_from: None, desired_state: Up, svc_encrypted_password: None, composite: None }

However, checking the used recipe in chef manage shows that the service is configured to use strategy ‘at-once’

hab_service 'myorg/mypkg-core' do
  action :load
  strategy 'at-once'
  channel node.habitat['channel']
end

So either this is an issue with the chef habitat cookbook, or with habitat itself. However as this node has seen a few updates, this might also have happened along the way when upgrading. If there is any additional information I could gather that would be helpful, please do let me know. Otherwise I’ll just throw away that node and start with a clean state and see if I’ll run into this again.

Yet, probably chef should be able to figure out that a service is running with the wrong strategy and take care of that. After looking through the cookbook, I don’t think that there is any implementation which would check if a service is configured correctly. I’ll open an issue over at the repo. Maybe we can find a way to improve this.


#6

That’s some damn good detective work! I haven’t heard anyone else reporting a bug like this so I bet you’re right, probably an issue with the cookbook. Thank you so much for following that through to the end!