Aws 3 node cluster. Election in progress, and we have no quorum


#1

Hi Guys,

I’m trying to deploy a 3 node elasticsearch cluster on ec2 using terraform + habitat. I have a security group setup to allow inbound on 9200, 9631 and 9638( both tcp and udp) with the default habitat gossip settings.

I’m not able to start my cluster in a leader topology. It hangs with the message “Election in progress, and we have no quorum”.

Here is my systemd file:
(Max map count and limit no file settings are needed for ES to start)

#!/usr/bin/env bash
sysctl -w vm.max_map_count=262144
curl https://raw.githubusercontent.com/habitat-sh/habitat/master/components/hab/install.sh | bash
useradd hab
groupadd hab
cat << EOF >> /etc/systemd/system/hab-supervisor.service
[Unit]
Description=Habitat Supervisor

[Service]
ExecStart=/bin/hab sup run  --peer ${peer}
Restart=on-failure
LimitNOFILE=65536
[Install]
WantedBy=default.target
EOF

systemd daemon-reload
systemd start hab-supervisor

sleep 5s

hab sup start core/elasticsearch --peer ${peer} --topology leader

Any ideas?


#2

Hmm, we might end up needing to get some logs on this one. But first, can you tell me what version of habitat you’re running?

Second, can you tell me how your ${peer} variable is getting populated is that something coming out of terraform?


#3

Sure!

The peer is generated from the elastic ip resource. It is set to the public IP of the first node that get created.

Hab version is at:
hab 0.55.0/20180321220925


#4

Rad, so to start it would probably be good to get the state of the ring.

If you can curl ip-of-node-with-sup:9631/butterfly and ip-of-node-with-sup:9631/services (you might want to check the data and sanitize if theres anything you need to keep private. That should give us a starting point for whats going on.


#5
{
	"member": {
		"members": {},
		"health": {},
		"update_counter": 0
	},
	"service": {
		"list": {
			"elasticsearch.default": {
				"6a4d71e36daf4f47af38dab96871b9ae": {
					"type": 2,
					"tag": [],
					"from_id": "6a4d71e36daf4f47af38dab96871b9ae",
					"service": {
						"member_id": "6a4d71e36daf4f47af38dab96871b9ae",
						"service_group": "elasticsearch.default",
						"package": "core/elasticsearch/6.2.2/20180419233650",
						"incarnation": 1,
						"cfg": {
							"http-port": 9200,
							"transport-port": 9300
						},
						"sys": {
							"ip": "172.31.26.118",
							"hostname": "ip-172-31-26-118",
							"gossip_ip": "0.0.0.0",
							"gossip_port": 9638,
							"http_gateway_ip": "0.0.0.0",
							"http_gateway_port": 9631
						},
						"initialized": false
					}
				}
			}
		},
		"update_counter": 1
	},
	"service_config": {
		"list": {},
		"update_counter": 0
	},
	"service_file": {
		"list": {},
		"update_counter": 0
	},
	"election": {
		"list": {
			"elasticsearch.default": {
				"election": {
					"type": 3,
					"tag": [],
					"from_id": "6a4d71e36daf4f47af38dab96871b9ae",
					"election": {
						"member_id": "6a4d71e36daf4f47af38dab96871b9ae",
						"service_group": "elasticsearch.default",
						"term": 0,
						"suitability": 0,
						"status": 2,
						"votes": ["6a4d71e36daf4f47af38dab96871b9ae"]
					}
				}
			}
		},
		"update_counter": 1
	},
	"election_update": {
		"list": {},
		"update_counter": 0
	},
	"departure": {
		"list": {},
		"update_counter": 0
	}
}

#6

:9631/services

[{
	"service_group": "elasticsearch.default",
	"bldr_url": "https://bldr.habitat.sh",
	"channel": "stable",
	"spec_file": "/hab/sup/default/specs/elasticsearch.spec",
	"spec_ident": {
		"origin": "core",
		"name": "elasticsearch",
		"version": null,
		"release": null
	},
	"start_style": "Transient",
	"topology": "leader",
	"update_strategy": "none",
	"cfg": {
		"action": {
			"destructive_requires_name": "true"
		},
		"bootstrap": {
			"memory_lock": "false"
		},
		"cluster": {
			"name": "elasticsearch",
			"routing": {
				"allocation": {
					"awareness-attributes": "",
					"node_concurrent_recoveries": "2",
					"node_initial_primaries_recoveries": "4",
					"same_shard-host": "false"
				}
			}
		},
		"discovery": {
			"minimum_master_nodes": 1,
			"ping_unicast_hosts": "[]",
			"zen_fd_ping_timeout": "30s"
		},
		"gateway": {
			"expected_data_nodes": "0",
			"expected_master_nodes": "0",
			"expected_nodes": "0",
			"recover_after_nodes": "",
			"recover_after_time": ""
		},
		"indices": {
			"breaker": {
				"fielddata-limit": "60%",
				"fielddata-overhead": "1.03",
				"request-limit": "40%",
				"request-overhead": "1",
				"total-limit": "70%"
			},
			"fielddata": {
				"cache-size": ""
			},
			"recovery": {
				"max_bytes_per_sec": "20mb"
			}
		},
		"logger": {
			"level": "info"
		},
		"network": {
			"host": "_site_",
			"port": 9200
		},
		"node": {
			"data": "true",
			"master": "true",
			"max_local_storage_nodes": 1,
			"name": "",
			"rack_id": "",
			"zone": ""
		},
		"path": {
			"data": "",
			"logs": "logs"
		},
		"plugins": {
			"cloud_aws_signer": ""
		},
		"runtime": {
			"es_java_opts": "",
			"es_startup_sleep_time": "",
			"heapsize": "1g",
			"max_locked_memory": "",
			"max_open_files": ""
		},
		"transport": {
			"port": 9300
		}
	},
	"pkg": {
		"ident": "core/elasticsearch/6.2.2/20180419233650",
		"origin": "core",
		"name": "elasticsearch",
		"version": "6.2.2",
		"release": "20180419233650",
		"deps": [{
			"origin": "core",
			"name": "coreutils-static",
			"version": "8.25",
			"release": "20170514151156"
		}, {
			"origin": "core",
			"name": "gcc-libs",
			"version": "5.2.0",
			"release": "20170513212920"
		}, {
			"origin": "core",
			"name": "glibc",
			"version": "2.22",
			"release": "20170513201042"
		}, {
			"origin": "core",
			"name": "jre8",
			"version": "8.172.0",
			"release": "20180419233349"
		}, {
			"origin": "core",
			"name": "libxau",
			"version": "1.0.8",
			"release": "20171013025301"
		}, {
			"origin": "core",
			"name": "libxcb",
			"version": "1.12",
			"release": "20180409205314"
		}, {
			"origin": "core",
			"name": "libxdmcp",
			"version": "1.1.2",
			"release": "20171013025332"
		}, {
			"origin": "core",
			"name": "libxext",
			"version": "1.3.3",
			"release": "20180409205946"
		}, {
			"origin": "core",
			"name": "libxi",
			"version": "1.7.9",
			"release": "20180409210147"
		}, {
			"origin": "core",
			"name": "libxrender",
			"version": "0.9.10",
			"release": "20180409205945"
		}, {
			"origin": "core",
			"name": "libxtst",
			"version": "1.2.3",
			"release": "20180409210348"
		}, {
			"origin": "core",
			"name": "linux-headers",
			"version": "4.3",
			"release": "20170513200956"
		}, {
			"origin": "core",
			"name": "xlib",
			"version": "1.6.5",
			"release": "20180409205516"
		}],
		"env": {
			"JAVA_HOME": "/hab/pkgs/core/jre8/8.172.0/20180419233349",
			"PATH": "/hab/pkgs/core/elasticsearch/6.2.2/20180419233650/es/bin:/hab/pkgs/core/jre8/8.172.0/20180419233349/bin:/hab/pkgs/core/glibc/2.22/20170513201042/bin:/hab/pkgs/core/libxext/1.3.3/20180409205946/bin:/hab/pkgs/core/libxrender/0.9.10/20180409205945/bin:/hab/pkgs/core/coreutils-static/8.25/20170514151156/bin:/hab/pkgs/core/busybox-static/1.24.2/20170513215502/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
		},
		"exposes": ["9200", "9300"],
		"exports": {
			"http-port": "network.port",
			"transport-port": "transport.port"
		},
		"path": "/hab/pkgs/core/elasticsearch/6.2.2/20180419233650",
		"svc_path": "/hab/svc/elasticsearch",
		"svc_config_path": "/hab/svc/elasticsearch/config",
		"svc_data_path": "/hab/svc/elasticsearch/data",
		"svc_files_path": "/hab/svc/elasticsearch/files",
		"svc_static_path": "/hab/svc/elasticsearch/static",
		"svc_var_path": "/hab/svc/elasticsearch/var",
		"svc_pid_file": "/hab/svc/elasticsearch/PID",
		"svc_run": "/hab/svc/elasticsearch/run",
		"svc_user": "hab",
		"svc_group": "hab"
	},
	"sys": {
		"version": "0.55.0/20180321222338",
		"member_id": "6a4d71e36daf4f47af38dab96871b9ae",
		"ip": "172.31.26.118",
		"hostname": "ip-172-31-26-118",
		"gossip_ip": "0.0.0.0",
		"gossip_port": 9638,
		"http_gateway_ip": "0.0.0.0",
		"http_gateway_port": 9631,
		"permanent": false
	},
	"initialized": false,
	"user_config_updated": false,
	"health_check": "Unknown",
	"last_election_status": "None",
	"needs_reload": false,
	"needs_reconfiguration": false,
	"smoke_check": "Pending",
	"binds": [],
	"hooks": {
		"health_check": {
			"render_pair": "/hab/svc/elasticsearch/hooks/health_check",
			"stdout_log_path": "/hab/svc/elasticsearch/logs/health_check.stdout.log",
			"stderr_log_path": "/hab/svc/elasticsearch/logs/health_check.stderr.log"
		},
		"init": {
			"render_pair": "/hab/svc/elasticsearch/hooks/init",
			"stdout_log_path": "/hab/svc/elasticsearch/logs/init.stdout.log",
			"stderr_log_path": "/hab/svc/elasticsearch/logs/init.stderr.log"
		},
		"file_updated": null,
		"reload": null,
		"reconfigure": null,
		"suitability": null,
		"run": {
			"render_pair": "/hab/svc/elasticsearch/hooks/run",
			"stdout_log_path": "/hab/svc/elasticsearch/logs/run.stdout.log",
			"stderr_log_path": "/hab/svc/elasticsearch/logs/run.stderr.log"
		},
		"post_run": null,
		"smoke_test": null,
		"post_stop": null
	},
	"config_from": null,
	"manager_fs_cfg": {
		"butterfly_data_path": "/hab/sup/default/data/butterfly.dat",
		"census_data_path": "/hab/sup/default/data/census.dat",
		"services_data_path": "/hab/sup/default/data/services.dat",
		"data_path": "/hab/sup/default/data",
		"specs_path": "/hab/sup/default/specs",
		"composites_path": "/hab/sup/default/composites",
		"member_id_file": "/hab/sup/default/MEMBER_ID",
		"proc_lock_file": "/hab/sup/default/LOCK"
	},
	"process": {
		"pid": null,
		"state": "Down",
		"state_entered": 1524751400
	},
	"svc_encrypted_password": null,
	"composite": null
}]

#7

Ok, so if those are both complete sets of data, then your supervisors are not seeing each other. We should see multiple entries for elasticsearch but we’re only seeing the one. So, for some reason these nodes are not connected as a ring. Can we validate that the ip being passed to peer is correct? You could check in your systemd logs to see what happened there.


#9

I am able to curl the services endpoint from all 3 nodes via the public ip (peer).

My systemd logs all look identical from when I start up hab.

Apr 26 14:03:16 ip-172-31-27-158 cloud-init[1318]: hab-sup(MR): Supervisor Member-ID c434cc154c8349fda02f7f6f13ab6702
Apr 26 14:03:16 ip-172-31-27-158 cloud-init[1318]: hab-sup(MR): Starting core/elasticsearch
Apr 26 14:03:16 ip-172-31-27-158 cloud-init[1318]: elasticsearch.default(UCW): Watching user.toml
Apr 26 14:03:16 ip-172-31-27-158 cloud-init[1318]: hab-sup(MR): Starting gossip-listener on 0.0.0.0:9638
Apr 26 14:03:16 ip-172-31-27-158 cloud-init[1318]: hab-sup(MR): Starting http-gateway on 0.0.0.0:9631
Apr 26 14:03:16 ip-172-31-27-158 cloud-init[1318]: elasticsearch.default(HK): health_check, compiled to /hab/svc/elasticsearch/hooks/health_check
Apr 26 14:03:16 ip-172-31-27-158 cloud-init[1318]: elasticsearch.default(HK): init, compiled to /hab/svc/elasticsearch/hooks/init
Apr 26 14:03:16 ip-172-31-27-158 cloud-init[1318]: elasticsearch.default(HK): run, compiled to /hab/svc/elasticsearch/hooks/run
Apr 26 14:03:16 ip-172-31-27-158 cloud-init[1318]: elasticsearch.default(HK): Hooks compiled
Apr 26 14:03:16 ip-172-31-27-158 cloud-init[1318]: elasticsearch.default(SR): Hooks recompiled
Apr 26 14:03:16 ip-172-31-27-158 cloud-init[1318]: default(CF): Updated elasticsearch.yml 671e5a9462e3b6e2518189bc38762aef0cebcbaea055b54264f687ffb9d0b735
Apr 26 14:03:16 ip-172-31-27-158 cloud-init[1318]: default(CF): Updated jvm.options d5d9493fddf0eda515e6358db539b7bb592aa2cc075141bee2af60916d50b196
Apr 26 14:03:16 ip-172-31-27-158 cloud-init[1318]: default(CF): Updated log4j2.properties f90c00c17eff0d97b1356d0f26b93515442613427f31fe5c5fbb5c34f07cc169
Apr 26 14:03:16 ip-172-31-27-158 cloud-init[1318]: elasticsearch.default(SR): Configuration recompiled
Apr 26 14:03:16 ip-172-31-27-158 cloud-init[1318]: elasticsearch.default(SR): Waiting to execute hooks; election in progress, and we have no quorum.

#10

Do your firewall configs allow traffic to both of those ports on public and private IPs or just private? If so are all of the nodes on the same private subnet?

If you can curl the endpoints of all the public IPs is it possible your terraform config is using a private IP for that ${peer} flag?


#11

Turns out you need to allow egress from UDP too, not just TCP…Thanks for all the help!


#12

OH!! I’m so glad you figured it out! I’m going to mark this as the solution for anyone that comes in behind you seeing this same thing. Thanks for working through it and posting the solution. :slight_smile: