Code on Mars: Lucene for beginners

In this tutorial I will guide you brief about lucene search to get you started with it. Lucene library is an open source project under apache meant for searching full text.

It is really powerful & used by many companies.

What is the prerequiste for understanding lucene ?

a) should know what indexing means

b) you should have built basic search algorithm for searching strings

c) understand vector space model algorithm (optional)

d) should have passion for search & how it works

Is lucene scalable ?

Yes , lucene is high-performance , scalable search library.

What are the basics of Lucene ?

Indexwriter

This class creates/open/update/remove documents in index.

Directory

This class represents the location of a lucene index.

Analyzer

This class is responsible for extracting relevant words(tokens) before it can be indexed.

By relevant we mean removing words like('a','the,'for','this') from phrase/sentence.

Types of Analyzers

WhitespaceAnalyzer : splits text into tokens on whitespace characters also it doesn’t lowercase each token.

SimpleAnalyzer : first splits tokens at nonletter characters, then lowercases each token.This analyzer quietly discards numeric characters but keeps all other characters.

StopAnalyzer : It removes common words specific to the English language (an,the,this etc..) . Rest it works
same as Simple Analyzer.

StandardAnalyzer : It has quite a bit of logic to identify certain kinds of tokens, such as company names, email addresses, and hostnames. It also lowercases each token and removes stop words and punctuation.
In most cases this analyzer is used.

Document

This class is a collection of fields. Like id , title , body etc.

Core searching classes

IndexSearcher

Indexsearcher is to searching as indexwriter to indexing . Indexsearcher open index in a

readonly mode.

Term

Term takes 2 parameters i.e fieldname & search term . During searching, you may construct Term objects and use them together with Termquery.

Query

Query is the common, abstract parent class. It contains several utility methods , like Query.Boost .

TermQuery

TermQuery is the most basic type of query supported by Lucene. It’s used for matching documents that contain fields with specific values.

TopDocs

It is responsible for getting top relevant search results.

Okay , lets start with basic search application

Indexing data

Suppose we have data that includes id , body . So our task will be storing this documents & indexing body for search & just storing id for reference (to work on something using id).

We need to 2 paths :

a) A path to a directory where we store the Lucene index.

b) A path to a directory that contains the files we want to index.

FSDirectory - for directory on hard disk

RAMDirectory - for directory on RAM (for faster access , but indexing get lost once application stops executing/over).

new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30) : like you are using higher version of lucene which get bugs in your scenario , so you can always downgrade to other version which is working fine in your case.

true or false to create or update existing directory.

IndexWriter.MaxFieldLength.LIMITED : not all tokens to be indexed

IndexWriter.MaxFieldLength.UNLIMITED : all tokens to be indexed

Field.Store.YES : store field value

Field.Store.NO : do not store value although you can still index it (field value is searchable) , but it will not appear in results.

Field.Index.NO : will not index the field

Field.Index.ANALYZED : will convert text into relevant tokens.

Architecture of Segments in lucene

Every lucene index consist of one or more segements. Each segment is a standalone index , holding subsets of indexed documents. During search , each segment is visited seperately & results are combined together.

Deleting an index

deleteDocuments(Term) : deletes all documents containing the provided term.

deleteDocuments(Query) : deletes all documents matching the provided query.deleteAll() : deletes all documents in the index. This is exactly the same as closing the writer and opening a new writer with create=true, without having to close your writer.

writer.deleteDocuments(new Term("ID", documentID)); "field :ID should be indexed to delete it

Updating an Index

Sometimes you need to update certain fields in a documents , for this lucene provide function

called updateDocument(term,document) , but it works like , first it delete the document , then create new one & add it into index.

Boost factor

Boosting documents

If you want to boost certain documents , which have more relevance according to your scenario , you can set boost factor by any float value & hence those results will appear more on top in search.

Boosting Fields

You can also set boost factor for certain fields.

Field field = new Field("id", i.ToString(), Field.Store.YES, Field.Index.ANALYZED); // i : id

field.Boost = 5f;

document.Add(field);

Search Methods

Lucene built in Query methods :

TermQuery

Term t = new Term("contents", "java");

A TermQuery accepts a single Term:

Query query = new TermQuery(t);

it will search all documents having word "java" in them.Term query is especially used for retrieving documents by key.

TermRange-Query

For searching for words starting with some character & ending with some character.

TermRangeQuery query = new TermRangeQuery("title", "b", "h", true, true);

This function is slow.

Query Parser
It is used for turning user entered text into query object with proper analyzer for splitting the terms into tokens removing all unwanted words from the sentence/phrase.

QueryParser parser = new QueryParser(Version matchVersion, String field , Analyzer analyzer)
Query query = parser.Parse(searchTerm);

helpful for searching string like "server and its applications" ,

PrefixQuery

For searching terms begining with specific string.

BooleanQuery

Can be combined to normal queries using boolean query. Helpful for clauses like OR , AND , MUST , MUST_NOT etc.
Below example is for normal TermQuery.

Wildcard-Query

If a term contains an asterisk or a question mark, it’s considered a WildcardQuery . It is very slow & should be used under condition of no option.

FuzzyQuery

It is used for searching similar terms . Like if you are searching for "distance" , it will also fetch results having words "distances"

Code on Mars

Friday 25 October 2013

Lucene for beginners

Okay , lets start with basic search application

Search Methods

Basics of Tree data structure