Friday, 25 October 2013

Lucene for beginners

In this tutorial I will guide you brief about lucene search to get you started with it. Lucene library is an open source project under apache meant for searching full text.

It is really powerful & used by many companies.

What is the prerequiste for understanding lucene ?
a) should know what indexing means
b) you should have built basic search algorithm for searching strings
c) understand vector space model algorithm (optional)
d) should have passion for search & how it works

Is lucene scalable ?
Yes , lucene is high-performance , scalable search library.




What are the basics of Lucene ?
Indexwriter
This class creates/open/update/remove documents in index.

Directory
This class represents the location of a lucene index.

Analyzer
This class is responsible for extracting relevant words(tokens) before it can be indexed.
By relevant we mean removing words like('a','the,'for','this') from phrase/sentence.

Types of Analyzers

WhitespaceAnalyzer splits text into tokens on whitespace characters also it doesn’t lowercase each token.

SimpleAnalyzer :  first splits tokens at nonletter characters, then lowercases each token.This analyzer quietly discards numeric characters but keeps all other characters.

StopAnalyzer : It removes common words specific to the English language (an,the,this etc..) . Rest it works
same as Simple Analyzer.

StandardAnalyzer : It has quite a bit of logic to identify certain kinds of tokens, such as company names, email addresses, and hostnames. It also lowercases each token and removes stop words and punctuation.
In most cases this analyzer is used.


Document
This class is a collection of fields. Like id , title , body etc.

Core searching classes
IndexSearcher
Indexsearcher is to searching as indexwriter to indexing . Indexsearcher open index in a
readonly mode.

Term
Term takes 2 parameters i.e fieldname & search term . During searching, you may construct Term objects and use them together with Termquery.

Query
Query is the common, abstract parent class. It contains several utility methods , like Query.Boost .

TermQuery
TermQuery is the most basic type of query supported by Lucene. It’s used for matching documents that contain fields with specific values.

TopDocs
It is responsible for getting top relevant search results.

Okay , lets start with basic search application 

Indexing data
Suppose we have data that includes id , body . So our task will be storing this documents & indexing  body for search & just storing id for reference (to work on something using id).










We need to 2 paths : 
a) A path to a directory where we store the Lucene index.
b) A path to a directory that contains the files we want to index.

FSDirectory - for directory on hard disk
RAMDirectory - for directory on RAM (for faster access , but indexing get lost once application stops executing/over).

new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30) : like you are using higher version of lucene which get bugs in your scenario , so you can always downgrade to other version which is working fine in your case.

true or false to create or update existing directory.

IndexWriter.MaxFieldLength.LIMITED :    not all tokens to be indexed
IndexWriter.MaxFieldLength.UNLIMITED :  all tokens to be indexed

Field.Store.YES  : store field value
Field.Store.NO : do not store value although you can still index it (field value is searchable) , but it will not appear in results.

Field.Index.NO : will not index the field
Field.Index.ANALYZED :  will convert text into relevant tokens.

















Architecture of Segments in lucene
Every lucene index consist of one or more segements. Each segment is a standalone  index , holding subsets of indexed documents. During search , each segment is visited seperately & results are combined together.


Deleting an index
deleteDocuments(Term) :  deletes all documents containing the provided term.
deleteDocuments(Query) :  deletes all documents matching the provided query.deleteAll() :  deletes all documents in the index. This is exactly the same as closing the writer and opening a new writer with create=true, without having to close your writer.

writer.deleteDocuments(new Term("ID", documentID));  "field :ID should be indexed to delete it
Updating an Index
Sometimes you need to update certain fields in a documents , for this lucene provide function
called updateDocument(term,document) , but it works like , first it delete the document , then create new one & add it into index.

Boost factor

Boosting documents
If you want to boost certain documents , which have more relevance according to your scenario , you can set boost factor by any float value & hence those results will appear more on top in search.



Boosting Fields
You can also set boost factor for certain fields.
Field field = new Field("id", i.ToString(), Field.Store.YES, Field.Index.ANALYZED); // i : id
field.Boost = 5f;
document.Add(field);


Search Methods

Lucene built in Query methods :

TermQuery
Term t = new Term("contents", "java");
A TermQuery accepts a single Term:
Query query = new TermQuery(t);
it will search all documents having word "java" in them.Term query is especially used for retrieving documents by key.

TermRange-Query
For searching for words starting with some character & ending with some character.
TermRangeQuery query = new TermRangeQuery("title", "b", "h", true, true);
This function is slow.

Query Parser
It is used for turning user entered text into query object with proper analyzer for splitting the terms into tokens removing all unwanted words from the sentence/phrase.

QueryParser parser = new QueryParser(Version matchVersion, String field , Analyzer analyzer)
Query query = parser.Parse(searchTerm);

helpful for searching string like "server and its applications" ,



PrefixQuery
For searching terms begining with specific string.

BooleanQuery
Can be combined to normal queries using boolean query. Helpful for clauses like OR , AND , MUST , MUST_NOT etc.
Below example is for normal TermQuery.



























Wildcard-Query
If a term contains an asterisk or a question mark, it’s considered a WildcardQuery .  It is very slow & should be used under condition of no option.


FuzzyQuery
It is used for searching similar terms . Like if you are searching for "distance" , it will also fetch results having words "distances" 


Basics of Tree data structure

Tree data structure simulates a  hierarchical tree structure, with root and subtrees represented by linked nodes. Some Terminology Root...