Craig Glennie

Once we realised that Elasticsearch’s own timeout functionality only works like the docs say in some situations we were still left with the problem of how to kill long-running searches after some amount of time had elapsed. Remember, we’re trying to protect (or recover) from badly-behaved searches sucking up our cluster resources and causing slowness or an outage.

Our problem: we have a lot of searches happening in our Elasticsearch cluster. Some of them consume a lot of resources in some scenarios, and some of those are eligible for cancellation. It’s not a great user experience, but it’s useful as a protection mechanism for the cluster.

There were 3 things we need to do:

  1. Figure out which searches were eligible for cancellation
  2. Know when a search had been running for too long
  3. Cancel it

The solution was to use a combination of the Task API’s cancellation function, and the X-Opaque-Id header. I was able to use that header to tag searches with the name of the service that created them, and a timestamp (from the clients perspective) of when the search was created. This was possible because X-Opaque-Id header is any string you want it to be. ES doesn’t care, won’t parse or validate it, and just attaches it to the tasks that end up running. You could say that, from ES’s perspective, it’s opaque…

So I can make X-Opaque-Id a JSON string, which gives me an extensible data blob. I can chuck a couple of helpful values in there, so it looks like {service: "apiServer", clientTimestamp: "2020-05-26T05:30:00Z"}

Now, if it want to cancel a search that has been running too long all I have to do is ask the Tasks API to list all searches, deserialize the X-Opaque-Id header, find the service I care about, and check clientTimestamp to see how old the search is.

Then I have to kill it, which turns out (at least in 7.5.2) to not be the most reliable thing in the world. At least, not if you try to kill the search by using its parent_task_id, and only try once. What I found is that the cancel API call would keep returning data, and the search would keep running, at least until I called the endpoint 2-3 times. In production, when the cluster was heavily loaded, we’d see searches run way over their limit, and just fail to cancel.

So, using parent_task_id mostly works, if you try enough times. But I figured out a way to get a little more reliable results: just try to cancel every task that matches the criteria. And kill their parents, too, for good measure. It’s a (virtual) bloodbath, but it seems to result in more rapid actual search cancellation than just killing the parent tasks and waiting for ES to clean everything up.