Rolling strategy not working


#1

Hello all, I’m pretty new to habitat and I’m now trying to test the rolling update strategy but I have somme issues and it doesn’t work, at-once strategy works fine.
I’m loading the service with this command hab svc load --strategy rolling --group wat mfol/wat but it doesn’t work and the service does not update when i build new package, and promote it . I even tried restoring the hab service with systemctl restart hab, system is Centos 7.5, I only receive below errors in the logs

Sep 20 12:36:45 ip-10-247-2-17 hab: ERROR 2018-09-20T12:36:45Z: habitat_butterfly::server::inbound: PingReq request PingReq { membership: [Membership { member: Member { id: "5bc0362b2486431d86d4f26d747d9b0c", incarnation: 9, address: "10.247.1.134", swim_port: 9638, gossip_port: 9638, persistent: true, departed: false }, health: Confirmed }, Membership { member: Member { id: "e100da3f94cf4e2daa3508c67676b347", incarnation: 1, address: "10.247.6.47", swim_port: 9638, gossip_port: 9638, persistent: false, departed: true }, health: Departed }, Membership { member: Member { id: "de24f603244f45abafc1783c3ae63a21", incarnation: 0, address: "10.247.1.241", swim_port: 9638, gossip_port: 9638, persistent: false, departed: false }, health: Departed }, Membership { member: Member { id: "c10b17f7563543e3bde11c69de36675f", incarnation: 0, address: "10.247.1.160", swim_port: 9638, gossip_port: 9638, persistent: false, departed: false }, health: Departed }, Membership { member: Member { id: "a17d57d3ed8c415e943059e3897188c0", incarnation: 0, address: "10.247.1.121", swim_port: 9638, gossip_port: 9638, persistent: false, departed: false }, health: Alive }, Membership { member: Member { id: "1d1f0b5d0ebd40feb5a8e0c6630e2f4f", incarnation: 7, address: "10.247.1.214", swim_port: 9638, gossip_port: 9638, persistent: false, departed: false }, health: Departed }], from: Member { id: "7a82e915e3334a64bfa13590abc530b5", incarnation: 0, address: "10.247.1.113", swim_port: 9638, gossip_port: 9638, persistent: false, departed: false }, target: Member { id: "5bc0362b2486431d86d4f26d747d9b0c", incarnation: 9, address: "10.247.1.134", swim_port: 9638, gossip_port: 9638, persistent: true, departed: false } } for invalid target
Sep 20 12:36:45 ip-10-247-2-17 hab: ERROR 2018-09-20T12:36:45Z: habitat_butterfly::server::inbound: PingReq request PingReq { membership: [Membership { member: Member { id: "38b2c00dfba84fae85fe13cde55bac04", incarnation: 1, address: "10.247.3.98", swim_port: 9638, gossip_port: 9638, persistent: true, departed: true }, health: Departed }], from: Member { id: "480d2f0a7ecd47aaa0ba764656960aae", incarnation: 0, address: "10.247.1.220", swim_port: 9638, gossip_port: 9638, persistent: false, departed: false }, target: Member { id: "38b2c00dfba84fae85fe13cde55bac04", incarnation: 1, address: "10.247.3.98", swim_port: 9638, gossip_port: 9638, persistent: true, departed: true } } for invalid target
Sep 20 12:36:47 ip-10-247-2-17 hab: ERROR 2018-09-20T12:36:47Z: habitat_butterfly::server::inbound: PingReq request PingReq { membership: [Membership { member: Member { id: "1b344197e5734f4c9b53a098adab1b5e", incarnation: 64, address: "10.247.3.156", swim_port: 9638, gossip_port: 9638, persistent: true, departed: false }, health: Departed }], from: Member { id: "128500670b3a4d82a9492c2eb70de3f5", incarnation: 0, address: "10.247.1.28", swim_port: 9638, gossip_port: 9638, persistent: false, departed: false }, target: Member { id: "1b344197e5734f4c9b53a098adab1b5e", incarnation: 64, address: "10.247.3.156", swim_port: 9638, gossip_port: 9638, persistent: true, departed: false } } for invalid target
Sep 20 12:36:49 ip-10-247-2-17 hab: ERROR 2018-09-20T12:36:49Z: habitat_butterfly::server::inbound: PingReq request PingReq { membership: [Membership { member: Member { id: "5bc0362b2486431d86d4f26d747d9b0c", incarnation: 9, address: "10.247.1.134", swim_port: 9638, gossip_port: 9638, persistent: true, departed: false }, health: Confirmed }, Membership { member: Member { id: "21c13f8cad3849d69a55bb539759596b", incarnation: 0, address: "10.247.1.100", swim_port: 9638, gossip_port: 9638, persistent: true, departed: false }, health: Alive }, Membership { member: Member { id: "1d1f0b5d0ebd40feb5a8e0c6630e2f4f", incarnation: 7, address: "10.247.1.214", swim_port: 9638, gossip_port: 9638, persistent: false, departed: false }, health: Departed }, Membership { member: Member { id: "6e8ae50e5fc3489f921cb7781565da43", incarnation: 0, address: "10.247.3.150", swim_port: 9638, gossip_port: 9638, persistent: true, departed: false }, health: Alive }, Membership { member: Member { id: "513a7fee20d44ce68ae8d2663d8ffa7a", incarnation: 16, address: "10.247.1.110", swim_port: 9638, gossip_port: 9638, persistent: false, departed: false }, health: Departed }, Membership { member: Member { id: "7a82e915e3334a64bfa13590abc530b5", incarnation: 0, address: "10.247.1.113", swim_port: 9638, gossip_port: 9638, persistent: false, departed: false }, health: Alive }], from: Member { id: "553d0c6230e6495cafe4b54e01ccb409", incarnation: 0, address: "10.247.1.249", swim_port: 9638, gossip_port: 9638, persistent: false, departed: false }, target: Member { id: "5bc0362b2486431d86d4f26d747d9b0c", incarnation: 9, address: "10.247.1.134", swim_port: 9638, gossip_port: 9638, persistent: true, departed: false } } for invalid target
Sep 20 12:36:49 ip-10-247-2-17 hab: ERROR 2018-09-20T12:36:49Z: habitat_butterfly::server::inbound: PingReq request PingReq { membership: [Membership { member: Member { id: "8682fba4031348e6a1b528a95914cd6e", incarnation: 1, address: "10.247.3.36", swim_port: 9638, gossip_port: 9638, persistent: true, departed: true }, health: Departed }], from: Member { id: "fdb8a10b90834532828e370a059c7ab6", incarnation: 0, address: "10.247.1.221", swim_port: 9638, gossip_port: 9638, persistent: false, departed: false }, target: Member { id: "8682fba4031348e6a1b528a95914cd6e", incarnation: 1, address: "10.247.3.36", swim_port: 9638, gossip_port: 9638, persistent: true, departed: true } } for invalid target
hab sup status 
mfol/wat/36.5.0/20180920091846          standalone  up       up     1612         9170  wat.wat

and

cat /hab/sup/default/specs/wat.spec
ident = "mfol/wat"
group = "wat"
bldr_url = "https://bldr.habitat.something"
channel = "stable"
topology = "standalone"
update_strategy = "rolling"
binds = []
binding_mode = "strict"
desired_state = "up"

Can you help me with this error, what it means and how to investigate it ?


#2

Can you tell me more about the setup of your network? In particular, what services are the Supervisors at the following hosts running, particularly with respect to your wat.wat service group?

10.247.1.134
10.247.3.98
10.247.3.156
10.247.3.36

And have you recently restarted these Supervisors with entirely new identities (e.g., wiped out their /hab/sup/default/ directories and restarted them?)


#3

I have three supervisors which create bastion ring:

10.247.1.100
10.247.2.149
10.247.3.150

and at the time being, I also have 5 Centos 7.5 servers with running mfol/wat service which is connected to the bastion ring with command

/bin/hab sup run --no-color --peer bastion.kris.something

bastion.kris.something domain points to all three supervisors

I have restarted the Supervisors and removed /hab/sup/default/ content and the error is gone, Thanks! :slight_smile:

but rolling strategy still doesn’t work and at the minute I have three services running

mfol/wat/36.5.0/20180920091846

and two which was started later with a promoted to stable update

mfol/wat/36.5.0/20180920112143

Is there a way that I can debug update strategy ?


#4

Sorry, I didn’t necessarily mean to wipe out the /hab/sup/default directories… I was just trying to determine what had happened with those particular servers, based on the error messages you posted initially. Deleting the /hab/sup/default directories will certainly work, though, but that’s a big hammer :smile:

What were the exact hab svc load commands you used on each of these Supervisors to load up your wat service?

If you run the hab supportbundle command on each of the machines, it will create a tarball of the relevant Supervisor state. You can DM them to me in Slack and I can take a look at them (don’t post them here, as they may include information you don’t want to share with the rest of the world).


#5

Just an update here. I’ve been chatting with @Kris and working through some potential causes.

We’ve ruled out the following:

  • more-recent packages being installed locally
  • no access to credentials to get private packages

All Supervisors appear to be starting up properly, and all services appear to be loaded and configured properly.

Running the Supervisor with

RUST_LOG=debug,rustc_metadata=error,cargo=error,jobserver=error,rustc_trans=error,rustc_driver=error,rustc_mir=error,rustc=error,tokio_core::reactor=info

which enables debug logging, but filters out a lot of low-level stuff we don’t care about, yields output like this:

Sep 21 09:14:27 ip-10-247-1-246 hab: DEBUG 2018-09-21T09:14:27Z: habitat_sup::manager::service_updater: We're in an update but it's not our turn
Sep 21 09:14:28 ip-10-247-1-246 hab: DEBUG 2018-09-21T09:14:28Z: habitat_sup::manager::service_updater: We're in an update but it's not our turn
Sep 21 09:14:28 ip-10-247-1-246 hab: WARN 2018-09-21T09:14:28Z: habitat_butterfly::server::outbound: Timed out waiting for Ack from 4f29c9b7cabb4ab791e59d474061521c@10.247.2.149:9638
Sep 21 09:14:29 ip-10-247-1-246 hab: DEBUG 2018-09-21T09:14:29Z: habitat_sup::manager::service_updater: We're in an update but it's not our turn
Sep 21 09:14:29 ip-10-247-1-246 hab: DEBUG 2018-09-21T09:14:29Z: habitat_butterfly::server: Successfully persisted rumors to disk: /hab/sup/default/data/03151e3cac59459fb8ff40b89609d342.rst
Sep 21 09:14:30 ip-10-247-1-246 hab: DEBUG 2018-09-21T09:14:30Z: habitat_sup::manager::service_updater: We're in an update but it's not our turn
Sep 21 09:14:30 ip-10-247-1-246 hab: WARN 2018-09-21T09:14:30Z: habitat_butterfly::server::outbound: Timed out waiting for Ack from 4f29c9b7cabb4ab791e59d474061521c@10.247.2.149:9638
Sep 21 09:14:30 ip-10-247-1-246 hab: WARN 2018-09-21T09:14:30Z: habitat_butterfly::server::outbound: Marking 4f29c9b7cabb4ab791e59d474061521c as Suspect
Sep 21 09:14:31 ip-10-247-1-246 hab: DEBUG 2018-09-21T09:14:31Z: habitat_sup::manager::service_updater: We're in an update but it's not our turn
Sep 21 09:14:31 ip-10-247-1-246 hab: WARN 2018-09-21T09:14:31Z: habitat_butterfly::server::outbound: Timed out waiting for Ack from 11e6b26957be42deb0074fff409d1c51@10.247.2.170:9638

So it does appear to be something legitimate in the rolling update processing. I’m not 100% sure what the cause is yet, but will do some further investigation and report back here.