Elasticsearch Request Timeouts
Recently my team at Zignal Labs has been working on protecting our Elasticsearch cluster from runaway searches, which can suck up our cluster resources and impact all users.
One crude tool for this is the request timeout: if a search has been running for longer than X seconds then we expect that it is likely to be a “bad” search, and we should kill it. This has negative side-effects for the user running the query: they can’t get their data. But protecting the availability of the cluster as a whole trumps that concern.
Fortunately for us, request timeouts are supported by Elasticsearch! This setting “Specifies the period of time to wait for a response. If no response is received before the timeout expires, the request fails and returns an error”.
Except that it doesn’t. We bashed our head against this repeatedly: we would set a limit of 1 second, and the search would run for 30s before it was killed; with a limit of 30s we’d have searches running for minutes, and often not being killed at all. Searches with 10s timeouts would cancel eventually in staging, but never in production. What was going on?
Well, it turns out that the timeout setting is not a limit for how long a search can run for from the time it is dispatched by the client. Obviously that’s what you’d expect it to be that kind of timeout because
- it’s called
timeout
and - that’s what the docs say it is!
But then you look in Github and find issues like
- Add a hard time limit for the entire search request
- Apply timeout on the coordinating node too
- SearchRequest cancellation based on timeout
And it turns out that the timeout parameter is actually…
[…] only applied at the shard level. Shards are expected to wrap up ongoing work as soon as they hit the timeout and return a partial response. This is usually good enough, unless shard requests spend time in the search queue.
So, it’s not doing what we think it’s doing. Or what the docs says it’s doing. If your cluster is not loaded then your search queue is probably empty (or items don’t spend very long in it) and timeout
is going to work pretty much like how it’s documented. But if you don’t live in that world (perhaps, say, your cluster is experiencing heavy load and your search queue is filling up) then you’re going to get a pretty different kind of behaviour.
The kind of behaviour that leads to spending a couple of days being utterly confused about what’s going on.
As is often the case, the truth is in the Github comments. I ended up implementing our own task cancellation.