My FeedDiscussionsHeadless CMS
New
Sign in
Log inSign up
Learn more about Hashnode Headless CMSHashnode Headless CMS
Collaborate seamlessly with Hashnode Headless CMS for Enterprise.
Upgrade ✨Learn more

Substring search in Elasticsearch

escoder's photo
escoder
·Jun 12, 2021·

2 min read

Consider a scenario when you need to replicate a SQL query having a LIKE operator, or you want to perform a partial match or a substring match using elasticsearch.

In these cases, you might wonder how to achieve them. As the commonly used queries like the match, term, multi-match queries with a standard analyzer will not help you achieve this task. You will need to use either wildcard query, query_string or you have to configure your index with some custom analyzer and tokenizers.

Some ways to do a substring search

Suppose the following data is indexed in your index. Your task is to find all those documents that contain foo

image.png

image.png

image.png This task can be done in different ways, according to your use cases. But three of the methods are shown below.

Wildcard query

  • It is a term-level query. This query returns those documents that match the specified wildcard pattern. The wildcard expression is not analyzed. It only evaluates either * or ?.

  • For example: *foo?ar would match "myfoobar", "asfoozar", etc. * can be any number of characters whereas ? represents only a single character
  • Leading wildcard queries should be avoided as it leads to the scanning of the entire index. This is because leading wildcards increases the number of iterations required to get the matched documents.

image.png

Query string

This query returns documents based on a given query string. In query_string also, you can use wildcard search along with * or ?.Wildcards along with query_string should be used very wisely, as it can affect the cluster performance badly.

image.png

Note: Query string and wildcard query both can be slow on bigger data sets. You should look into using ngram or edge ngrams to improve the speed issues.

N-gram tokenizer

Ngrams and edge ngrams are two very unique ways of tokenizing text in elasticsearch. This tokenizer tokenizes the words into multiple subtokens of a specified length. This length depends on the value of min_gram and max_gram. These settings control the size of the token that is generated. Based on these indexed tokens the respective document matches. Index Mapping for generating ngrams would be -

image.png

You can check the tokens generated by using Analyze API. For example: myfoobar will generate tokens like "foo", "foob", "fooba", "oba", etc.

Search Query -

image.png

All the three queries shown above will give all the three documents in the search result.