Inverted Index : The basic ingredient behind the recipe called Search Engine

The wiki definition of Inverted Index is –

“An inverted index (also referred to as postings file or inverted file) is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents”

Now refer the below diagram to have pictorial description of the concepts explained-

Say, I am running a website and one of the pages(url-www.DelightedSandip.com/CoreContent.aspx) is having some content .Now, a search engine when crawls the webpage corresponding to the (mentioned or any other)url and extracts the text from the webpage, it can save the url,text etc in some Database(RDBMS or NoSQL DB) or XML file etc.So the DB or the XML file (say DB1 or XML1) can be containing some structure having columns(in case of a DB) or attributes(in case of an XML file) docID,url,text ,timestamp etc.

N.B. -> Basic details of crawling and extracting text from a webpage are not provided in this post – probably I will write another post on the same – but first things first – so lets concentrate on Inverted Index.

So now what we can do is get the text from the DB or XML file(DB1 or XML1) and use the space separator (in .NET one can use text.Split(” “)) to get all the words within the text.In addition one can get the position of the words within the text also(in .NET one can use text.IndexOf(word)). So now, we can save all these in another DB or XML file (say DB2 or XML2) having the structure having columns(in case of a DB) or attributes(in case of an XML file) word,docid,position etc.All these can be done in the background using a Windows Service (in java world something like Quartz can be used for the same – .NET port of Quartz also is available here).This background process may run on a weekly or a monthly basis.

Now when the user inputs his\her “search text” in the search box and clicks the Search button, the button click event handler can extract all the words within the “search text” along with the positions within the “search text” and run a query in the DB or XML file(DB2 or XML2) to get all the docids having nearest matches with respect to the words in the “search text” and their positions within the “search text”.Using these docids one can run a join query in the DB or XML file (DB1 or XML1) to get all the respective urls and the actual content.Then in the UI,all the URLs can be shown along with a trimmed version of the respective contents.

So, as can be understood, the usage of Inverted Index lets you search millions of  web documents in a lightning fast manner.

P.S. – Whatever explained here is a very basic version of Indexing(using Inverted Index concept – that happens in the background) and Querying.For a full-fledged Search Engine, other than Indexing and Querying, one needs to consider issues related to Crawling,Forward Indexer/Content Processor(such indexers/processors do stemming, lemmatization, synonym expansion, entity extraction, part of speech tagging etc) and Ranking(Google became famous not because of its knowledge of Information Retrieval basics – Information Retrieval knowledge was there long back then – it became famous for it’s PageRank algorithm – a solid understanding of Linear Algebra and Probability Theory is a must to understand the PageRank algorithm).Even then there are lots of other things to be considered in case of Indexing and Querying(e.g for availability ,scalability and fast retrieval of search results,one needs to deploy each of these in a distributed environment).To have a gist of how the basic architecture (and related stuffs) of a real Search engine looks like, one can go through this post by Ricky Ho.Also, here I have tried to explain Web Search in its very basic form – one can use similar concepts in Enterprise Search where one needs to do indexing(using the same concept of Inverted Index) on textual DB columns for full text search.An open source search engine framework that deals with most of the search stuffs is Lucene (Apache Lucene.NET page). Elasticsearch is another awesome option based on Lucene. FlexSearch and SearchRoo are 2 end to end good open source samples of Search Engines developed in .NET. A very good .NET client of Solr is SolrNet.Also, check the Magic Quadrant for Enterprise Search from Gartner to have gist of the vendors in the Enterprise Search market.To have a solid understanding of Information Retrieval and associated stuffs, one can refer Stanford’s Information Retrieval Resources Recommendations.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s