`core/postgresql` stuck waiting for election


#21

I think departure is not working. I wrote some additional shell functions to streamline querying the ring, and I see this after departing the old Postgres primary member_id and then rebuilding all 3 hosts that should run core/postgresql:

service_group_leaders 'ssh ubuntu@staging-permanent-peer-0.domain.tld' 'postgresql.staging'
Hostname:  ip-172-31-12-169
Address:   172.31.12.169
Alive:     false
Leader:    true
Departed:  true
MemberID:  9df128caafd34ad3873c3e4c08596b7a
======

service_group_members 'ssh ubuntu@staging-permanent-peer-0.domain.tld' 'postgresql.staging'
Hostname:  ip-172-31-6-32
Leader:    false
Departed:  false
MemberID:  572f4d4d34164be9a91d2b09f247ffb1
======
Hostname:  ip-172-31-13-204
Leader:    false
Departed:  false
MemberID:  68813a17f4044220b39685cf7a6c63f4
======
Hostname:  ip-172-31-15-72
Leader:    false
Departed:  false
MemberID:  f1b82aa4459e4cc4a16ea9bbef24af4e
======

#22

Hello again! Thanks for your patience. I’ve done enough reading of the elections code now to have a decent general understanding of how it’s supposed to work and I see some places where it may be going wrong. However, that code also has woefully little logging which makes it hard to tell exactly what’s going wrong.

The thing that jumps out at me now is that (as you said) it’s likely a problem with membership, and that makes sense with the log message you posted in the first message:

postgresql.staging(SR): Waiting to execute hooks; election in progress, and we have no quorum.

In order for an election to occur, of all the nodes in the service group (that is, ones that have added Service rumors) which are not Departed (meaning Alive, Suspect or Confirmed), we must have a majority which are Alive. I think the way we handle departing nodes needs to be changed, and I also think we need a mechanism for leaving a service group. These issues would explain why your election isn’t proceeding due to quorum issues.

I also looked into whether we have an issue with the suitability hook. Originally, I thought the fact that it wasn’t returning distinct values of different nodes was suspicious. And while that still seems wrong, it shouldn’t cause an unbreakable tie since we fall back to using the member ID. You pointed out this log message:

postgresql.staging hook[init]:(HK): Waiting for leader to become available before initializing

If the suitability hook depends on the service itself, but the service can’t run because it’s waiting on elections, that could certainly deadlock things. However, it looks here like the init hook hasn’t succeeded, so based on this code:

    pub fn suitability(&self) -> Option<u64> {
        if !self.initialized {
            return None;
        }
        self.hooks.suitability.as_ref().and_then(|hook| {
            hook.run(
                &self.service_group,
                &self.pkg,
                self.svc_encrypted_password.as_ref(),
            )
        })
    }

The service shouldn’t be initialized and I don’t think the suitability hook should be getting called at all (again, more logging could help to know), but based on the postgres init hook code, I’d expect it to exit with a status of 1

In that case, there should be a log from this line containing Initialization failed. Do you see that? If so, suitability is not the issue and we probably just need to address the membership/quorum problems that are preventing the election from getting started.


#23

I do indeed see multiple occurrences of Initialization failed in the logs we had saved from the broken Postgres ring:

null_resource.postgresql_services[1] (remote-exec): Oct 17 16:17:46 ip-172-31-5-152 hab[11107]: postgresql.staging(HK): Initialization failed! 'init' exited with status code 1

#24

@bixu: have you tried hab sup departing the known-dead member IDs? We’ll definitely work on fixing up our membership issues, but this might suffice as a workaround in the meantime.

I’ll add more logging to the elections code to make these kinds of issues easier to diagnose in the future.


#25

We did indeed write some code to handle departures. However, that didn’t seem to have the effect we wanted. Not that it had a bad effect, but the issues that we are debugging here were present even after the departure code was added.


#26

Couple thoughts about the Postgres plan:

Regarding the suitability hook, there’s definitely a clear bug in local_xlog_position where it should return an integer value even if the psql command is unsuccessful - probably a 0.

I think it absolutely should still be based on the latest xlog position, this shouldn’t be changed. The idea is that you don’t accidentally elect a new leader that has older data - that can be disastrous. If you have two members with the same xlog position they are arguably equally qualified to become the leader.

One thing we could do is add some number ( 1?) to the suitability value if it is the current leader, making it more likely that you won’t arbitrarily switch leaders in case of a topology change. Thoughts?

Regarding the init hook bombing out if a leader isn’t ready here, we may need to modify this behavior for an already established cluster - I’m not sure. While it makes sense during initial cluster setup (keep retrying the follower setup until the leader is ready), it’s clearly impeding re-election. I’m open to ideas on what we can do here.


#27

pg_controldata could be used instead of a psql connection - it can report WAL location even if the server is down.

There actually should be some checks for the systemid AND timeline in there somewhere as well.


#28

Great idea @jamessewell , pg_controldata is way better than depending on PG to be up!

Would you be interested in pairing up on implementing these checks? It seems like you have quite a bit of expertise on this topic!


#29

I don’t think you’d ever change leaders as long as the existing leader continues to be Alive, but I’ll confirm.

That one I need to give some more thought to. I believe leader election should only require that the service is loaded (not that the init hook as completed). This may mean the suitability hook gives an error, but that can be dealt with.


#30

I really don’t agree with this approach - leader election should only use instances which are passing monitoring checks (although I know this is hard at the moment as monitoring isn’t first class)