Regular Expressions in Elasticsearch behave differently, and have some interesting and noteworthy points that are worth knowing.
There are 3 important things to know about regular expressions in Elasticsearch.
Matching is done at the token level, not the string
The operators and syntax are different from most other languages
Wildcard matchers dramatically affect performance
Below is a short description of my biggest ta
Here is the key sentence from the documentation: "Elasticsearch will apply the regexp to the terms produced by the tokenizer for that field, and not to the original text of the field.”. This means that regular expressions only work against the tokenizer results of the text field you are querying against.
You can see what the regular expressions run against for a given string by running a query like this, though you will need to make sure that the tokenizer matches the field you are querying.
POST /_analyze
{
"tokenizer": "standard",
"text": “This text will get tokenized"
}
If you run this query, you can see that the string gets broken up into tokens: “This”, “Text”, “Will”, “Get”, “Tokenized”. Running a regular expression will attempt to match against each of these individual tokens. That means that you will not get matches for results like “This.*Tokenized” because there is no single token that contains both “This” and “Tokenized”.
There is a workaround if you store or copy the full string to an unanalyzed text field. You can read more about that solution here
Elasticsearch is missing some of the regex features common in other languages. The two most noticeable to me are shorthand characters like \d and \s and lookaheads.
That said, there is still a lot you can do. There are too many capabilities to list here, but it’s important that you don’t rely on your “muscle memory” from other programming languages because you may not be getting the results you expect…and you may not get an error message letting you know there’s a problem.
Read more about Elasticsearch’s regular expressions.
The documentation offers a few warnings about performance, advising you to use a long prefix before your regular expression starts and avoiding large wildcard searches.
Here’s a great quote:
"Regular expressions are dangerous because it’s easy to accidentally create an innocuous looking one that requires an exponential number of internal determinized automaton states (and corresponding RAM and CPU) for Lucene to execute."
There are a couple of settings that limit the performance impact of regular expressions. All of them may be adjusted at query time, although these limits seem pretty gratuitous to me already.
maxdeterminizedstatessetting (default 10000)
maxregexlength (default 1000)
I learned a few things when I was reading up about regular expressions in Elasticsearch, and I hope you did too. I would love to hear about anything interesting or weird you learned while working with them.