Information Retrieval

Until now, you understood how to save massive amount of data in a distributed file system. Saving problem is sorted out. But now, how will you search for any data in this massive pile of distributed storage? Indeed, you will be trying to find a needle in the haystack and that is where Information Retrieval (IR) techniques are applied.

Google was one the pioneers in providing efficient IR solutions by giving the ability to search any document on the internet. Ever since, IR is only of the hot research areas.

The New Way

The traditional RDBMS systems was not good enough for storing massive amount of semi-structured textual data like web pages for constant searching and retrieval. A new technique was invented and this technique was basically based on the below steps:

  • Create a reverse index of the words (keywords) that are in the textual content - called the Indexing step
  • Save the textual data as a Document that contains certain field values
  • Search this index on the words that are indexed to retrieve the Document with the fields that were saved

All these were implemented in open source Lucene. Today using Lucene is the first option when it comes to implementing search in any software system.

Let us now understand the concepts in greater detail

The Document

results matching ""

    No results matching ""