production elasticsearch failures

production elasticsearch failures

Recently we encountered a problem with our ElasticSearch cluster in production. The true cause of the failure is still unknown and being investigated by the AWS EC2 team.

Current infrastructure:

  • 3x i3.2xlarge nodes – Data+Master using XFS filesystem
  • 1x m4.large node – Kibana+Coordinating

Production Event:

  • Around 1AM Eastern our monitoring system notified us that the ElasticSearch health check was failing across the entire cluster
  • After notifications were recieved and validated that it wasn’t a false positive I investigated the failure
  • At this time Node-A and Node-B data+master nodes had no running Java process
  • Node-C was still running successfully
  • Restarting Node-A and Node-B, while it worked for some amount of time, they continued to fail and fault out – rerouted failed shards
curl -s -XPOST \
'https://elastic:redacted@elasticsearch-01.example.com:9201/_cluster/reroute?retry_failed=true'
  • I was able to recover 99% of all indexes by restarting the failed nodes, except about 2GB worth of shards for a specific index
  • Provisioned 2 additional nodes to keep the cluster healthy while I attempted to recover the failing shards
  • After unmounting and re-mounting the data volume for one of the instances with the shard data, it was able to successfully initialize the shards on a new instance
  • I ended up now with 4 nodes, which caused warnings within ElasticSearch that was resolved by draining shards from a node for removal
curl -XPUT \
'https://elastic:redacted@elasticsearch-01.example.com:9201/_cluster/settings' \
-H "Content-Type: application/json" -d @cluster.json

cluster.json
{
  "transient" : {
    "cluster.routing.allocation.exclude._ip" : "1.2.3.4"
  }
}

Example kernel log buffer:

[1234] Buffer I/O error on dev nvme0n1, logical block 644930, lost async page write
[1235] Buffer I/O error on dev nvme0n1, logical block 644931, lost async page write
[1236] Buffer I/O error on dev nvme0n1, logical block 644932, lost async page write
[1237] blk_update_request: I/O error, dev nvme0n1, sector 3868640
Ryan Gravlin's Picture

About Ryan Gravlin

Over 25 years experience as a system operator.

Miami, FL https://dbag.tech

Comments