/ Programming

Highlighting Large Documents in ElasticSearch

In December 2016 we started working on Ambar - a documents searching system. Ambar uses ElasticSearch as a core search engine.

During the process of Ambar development we dealt with many issues related to ES, and we'd like to share the precious experience we gained. Let's start with a substantial function of every search system - search results highlighting.

Proper results highlighting is the most valuable part of any search system's usability, at first, it provides a user with the necessary info about internal search logic and why the result was retrieved. Also it makes it possible to estimate results only by a quick glance at highlights instead of downloading and exploring the whole document.

Since Ambar is a documents searching system, and by documents I mean files too, it has to handle really large (in terms of full text search) documents of sizes larger that 100Mb. This paper tells how to achieve high performance in highlighting large documents with ElasticSearch.

Highlighting exmaple

The First Step Or Defining The Problem

Ambar uses ES as an engine to search through parsed files/documents content and their meta. Here is an example of a document Ambar stores in ES:

    sha256: "1a4ad2c5469090928a318a4d9e4f3b21cf1451c7fdc602480e48678282ced02c",
    meta: [
            id: "21264f64460498d2d3a7ab4e1d8550e4b58c0469744005cd226d431d7a5828d0",
            short_name: "quarter.pdf",
            full_name: "//winserver/store/reports/quarter.pdf",
            source_id: "crReports",
            extension: ".pdf",
            created_datetime: "2017-01-14 14:49:36.788",
            updated_datetime: "2017-01-14 14:49:37.140",
            extra: [],
            indexed_datetime: "2017-01-16 18:32:03.712"
    content: {
        size: 112387192,
        indexed_datetime: "2017-01-16 18:32:33.321",
        author: "John Smith",
        processed_datetime: "2017-01-16 18:32:33.321",
        length: "",
        language: "",
        state: "processed",
        title: "Quarter Report (Q4Y2016)",
        type: "application/pdf",
        text: ".... laaaaaarge text here ...."

The JSON document above is a parsed .pdf file with a financial report inside, the size of the file is about 100Mb. The content.text field contains of the report's parsed text, its length is also about 100Mb.

Let's make a simple experiment. Index one thousand of documents like one I specified before without any index tuning or custom mappings defined. And see how quick will ES search through them and retrieve highlights for the content.text field.

Here are the results:

  • A match_phrase query search in the content.text field takes from 5 to 30 seconds
  • Highlight retrieval for the content.text field takes in average 10 seconds per hit

This kind of performance is unacceptable. Any user of a search system would expect getting search results instantly after hitting "Search" button, but not waiting for half a minute for the first result to appear. Let's look into the slow highlighting issue and sort it out.

Choosing Highlighter

ES and underlying Lucene has three kinds of highlighters to choose from, take a look at the official manual. A quick guide:

  • Plain - the default highlighter in ES. The slowest one, but it does the most precise highlighting that almost completely matches Lucene's search logic. To retrieve a highlight it has to load the whole document into the memory and re-analyze it.

  • Postings - a faster one. It splits the document's field into sentences and retrieves only the sentences with the matching tokens using BM25 to sort the results. Requires additional sentences positions to be stored within the index.

  • Fast Vector Highlighting (FVH) - seems to be the fastest one, especially for large documents. Requires storing position offsets for every token within the index. In this case to retrieve highlights it doesn't need to retrieve the whole document, but just tokens close to the hit, which is quite fast due to known positions of every token.

So, now you can guess why ES out-of-the-box handles large documents highlighting so badly. Retrieving the whole document for every hit and reanalyzing it is quite expensive in terms of performance, especially when it comes to documents larger than 1Mb.

Since we definitely could not use Plain highlighter, we tested both Postings and FVH. The final choice turned out to be FVH, and that's why:

  • A 100Mb document highlighting takes about 10-20 milliseconds if you use FVH, Postings does in in about one second

  • Postings doesn't always correctly divide document's field into sentences, that's why highlights sizes can vary much (from 50 words to thousands in some cases). FVH doesn't have this kind of problem since it retrieves fixed number of tokens, not sentences.

  • Postings highlights tokens in any order and doesn't work properly with complex queries. For the reference, it won't highlight properly the results of a match_phrase query with specified slop value. It will interpret it as a bool query highlighting every matching token in the whole document's field.


During FVH testing we found one pretty nasty issue with it. It does interpret match_phrase query not exactly as the Lucene's search. It highlights tokens only in the order specified in the query, but Lucene's search interprets tokens in any order as hits. If you're searching for 'John Smith' phrase, but the document has in its field 'Smith John' value, ES will retrieve this document as a hit, but FVH won't highlight it. The solution to this issue we made up is phrase permutation. We submit different queries to search and highlighting. Search gets the default query and highlight gets the query built by permuting all the variation of the words positions in the source phrase.

To Sum up

ES can actually deal with large documents and still deliver quite a performance, it's just important to set up your index correctly and keep in mind all the ES-specific issues.

Subscribe to our updates, and stay tuned!

Originally piblished at http://blog.rdseventeen.com/