Fuzzy search for keywords in free text
Free text entered by users might have typos. How to find approximate, misspelt matches to the searched string?
--
Advantages and drawbacks of free text
Some people at work fill text fields in some forms. Later their colleagues analyse or process the free text in thousands of forms and draw some meaningful conclusions. For example, imagine doctors who fill patients records and also do clinical research using the data collected inside their hospital.
Compared to complicated and inflexible forms with lots of controls with lots of predefined options, the important advantage of the forms made up mostly of text fields is that text fields allow entering any information quickly. And later when it is really needed, with some additional effort the text can be transformed into more structured data.
When we type we inevitably make typos. The more users are in a rush, the more mistakes they make. Spell checkers help to correct grammar errors, but not the errors in the words outside the language dictionary, for example fancy product names or scientific terms.
Sample use cases of fuzzy search
To detect even misspelt searched terms in user-entered text, approximate string matching should be used. The search for inexact, misspelt, matches is also called fuzzy search. Compared to the search term, some results of a fuzzy search might have inserted, deleted, substituted or swapped letters.
Potentially fuzzy search can be used for finding similar words (like in a spell checker) or for finding similar substrings (possibly composed of several words) in a string. Different algorithms should be used for those cases. In this post I consider only the search for words, which has more common applications.
For example, two common tasks that can be implemented in a similar way. Considering that the search terms (product names or technical terms) might be misspelt in the source text:
- Find documents that contain even inexact matches of the search term.
- Highlight approximate matches in displayed text so that a user can more easily manually split the…