Friday, January 16, 2015

Apache Solr 2014: A Year In Review

The last few years have seen Apache Solr move at a reasonably aggressive pace. 2013 marked the debut of SolrCloud: a newly architected, truly distributed mode of Solr. 2014 saw Solr, specifically SolrCloud, mature in terms of greater ease of use, more predictable scalability and more reliability in production environments with each new release.

Continue reading the post on the official Lucidworks blog here:

Saturday, March 22, 2014

Tuesday, October 29, 2013

Third Bangalore Apache Solr/Lucene Meetup

We just had the third Bangalore Apache Solr/Lucene meetup this last weekend. It's good fun to see the community grow to 200+. Actually, as I type this, we're already 3 short of the 250 mark.

As per the requests from a reasonable number of members, we had a "Solr 101" by Varun Thacker form Unbxd. Right before he leaves for his talk at Lucene Revolution Europe 2013, he did get a lot of attendees both introduced and interested in Solr.




His talk was followed up by a talk on a "DIY Bookmarks manager using Lucene" by Umar Shah from SimplyPhi, Bangalore. It was a nice demo and I'm sure this would have again motivated people to try out similar DIY stuff using either Lucene/Solr.



This was followed with a talk by Ganesh M from Dell Sonicwall. Ganesh gave a quick talk about "Building Custom Analyzers in Lucene". A short and quick take on a complex but interesting part of Lucene certainly got the advanced users interested.



After his talk, with the last talk, I spoke about "MoreLikeThis in Solr/Lucene". I started off with some history on the current MLT stuff and how it really works along with what are the kind of issues that people are known to run into when using this feature. I also spoke about a MoreLikeThis QueryParser for Solr that I've been working on as a part of my official work at LucidWorks. We plan to Open Source it as soon as I have time to put it out and document it a bit.





This may well be the last meetup I'd organize and attend in Bangalore for the year. Good luck to Varun and Shalin for organizing this going forward.

Thanks to Microsoft Accelerator for giving us the venue to host the meetup yet again. It's one of the most centrally located and well equipped spaces in Bangalore for such meetups.

In case you'd like to join the meetup group and be a part of the active community, here's where to do that: http://www.meetup.com/Bangalore-Apache-Solr-Lucene-Group/ .

Collection Aliasing in SolrCloud

One of the many features that have come out for SolrCloud has been collection aliasing. As the name suggests, it aliases the collection, in other words gives another name or a pointer to a collection (which can be changed) behind the scenes.
Among other things the most important uses of aliasing a collection could the power to change an index without having to change or modify the client applications. It helps in disconnecting the view from the actual index. So let's see how can we practically use this feature for rather common stuff.

Collection aliasing command:
http://<hostname>:<port>/solr/admin/collections?
                                         action=CREATEALIAS&
                                         name=alias-name&
                                         collections=list-of-collections

Fig 1. myindex with a read and a write alias each

Firstly, it gives users the ability to reindex their content into another collection and then swap it out. If you begin with a setup as in Fig. 1, you'd have to follow the following steps:

  • Switch the write alias to a new index.
  • Start re-indexing using the index update client that you use. That way you never change the name/alias but behind the scenes, all the updates go to a new index.
  • Once the re-indexing is complete, change the read alias to use the new index too.

Updating an existing alias:
An existing alias can be updated with just a fresh CREATEALIAS call with the new alias specifications.

Secondly, the collection aliasing command lets users to specify a single name for a set of collections. This comes in handy if the data has time windows e.g. month-wise. Every month can be a collection by itself and things like last-month, last-quarter can be aliases of appropriate months. It can also be useful when the data gets added e.g. in case of travel/geo search. A continent could be an alias, consisting of collections holding data for certain countries. As data for other countries from the continent comes in, you may create a new collection for those countries and add those to the existing continent aliases.

There's no limit as to what aliasing can be practically used for as far as use cases are concerned, but hope the ones mentioned above help you get an idea of what aliasing is broadly about.

Related Readings:
JIRA: https://issues.apache.org/jira/browse/SOLR-4497

Apache Solr Guide


Sunday, July 28, 2013

Second Bangalore Apache Solr/Lucene Meetup

After having a rather good first meetup, I, along with Shalin Mangar and Varun Thacker organised the second one for the Bangalore Apache Solr/Lucene community on the 27th July, 2013. The meetup group has gone up from 85 odd members, around the time of the first meetup, a month and a half ago, to almost 200.

Though we weren't able to get the same venue as last time i.e. Microsoft Accelerator, due to availability issues, we'd like to thank Flipkart for providing us with more than what we bargained for. From a nice space to host the event to constant flow of breakfast and coffee.

Talking about the event, the audience, like last time was a mixed bag. Though there were more than a few people who were completely new to Solr and actually very few who were advanced Solr users, it was good to have as many people take interest in the Apache Lucene/Solr project.

Shalin Shekhar Mangar from LucidWorks, started the talks for the day with a short talk on 'Introduction to Solr' that enabled newbies in the audience to relate to most of the things that were spoken through the rest of the day. He also  demoed people a working Solr instance, the admin interface and the document upload from UI (new in Solr 4.4).

This was followed by a talk by Venkatesh Prasanna from Infosys Technologies. He spoke about 'Knowledge Management at Infosys' and how they use Solr and Nutch within Infosys.

Flipkart didn't just give us a venue but also a talk about what happens in Flipkart Search, along with some insight on e-commerce search challenges in general. Umesh Prasad and Thejus VM spoke about their architecture and why the vector space model doesn't work for them.

The last talk of the day by Jaideep Dhok from Inmobi kept people interested until we called it a day. Talking about "DIY Percolator using Lucene", he highlighted possibilities, use cases and also demoed his code.

All in all, it was a productive day and I assume everyone found it worth waking up early, on a Saturday morning.

In case you're interested in attending these events, feel free to join the meetup group [here]. Also, if you would want to present at the next meetup, kindly drop me a line.

P.S: The presentations from the meetup should up soon. Also, we are figuring out if the video recording (specially audio) quality was good enough to be uploaded and shared. If we end up posting those, I'll share the links at the meetup page.

Thursday, June 20, 2013

Shard Splitting in SolrCloud

It's been around two years since my last post and after this hiatus, I intend to be more regular at it.
I've been a part of a couple of organizations since then. After being a part of the small team that launched the first ever AWS Search Service, CloudSearch, I am back to the open source world.

I joined LucidWorks in Jan '13, the primary company backing and supporting the Apache Solr project. The last few months have been spent trying to absorb the changes that the Apache Lucene/Solr project went through over an year and a half, when I was busy with A9. I've also actively worked and contributed on the Shard Splitting feature (SOLR-3755) for SolrCloud, with Shalin Shekhar Mangar since then.

It certainly feels good to be back with this stuff and to begin with, here's my post on Shard Splitting in SolrCloud : http://searchhub.org/2013/06/19/shard-splitting-in-solrcloud/.


Monday, August 29, 2011

Indexing MongoDB Data for Solr

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project.
MongoDB (from "humongous") is a scalable, high-performance, open source, document-oriented data store. 

I was happy using MongoDB and my very own search engine written using/extending lucene, until the trunks for Solr and Lucene were merged. This merge translated to Solr using the same release of lucene that I was using, unlike the past when there was some disconnect between the two. I realized that a lot of what I was trying to build was available through Solr.
Though Solr is used by a lot of organizations (which can be found here) and I'm sure that at least a few of them using Mongo, for some reason there was/is no straight forward out of the box import handler for data stored in MongoDB.

This made me search for a framework/module/plug to do the same, but in vain.

All said and done, here's a way that I finally was able to index my mongodb data into Solr.
I've used SolrJ to access my Solr instance and a mongo connector to connect to Mongo. Having written my own sweet layer that has access to both the elements of the app, I have been able to inject data as required.

--snip--
public SolrServer getSolrServer(String solrHost, String solrPort) throws MalformedURLException {
    String urlString = "http://"+solrHost+":"+solrPort+"/solr";
    return new CommonsHttpSolrServer(urlString);
}

--/snip--

Fire the mongo query, iterate and add to the index

--snip--

SolrServer server = getSolrServer(..); //Get a server instance

DBCursor curr = ..; //Fire query @ mongo, get the cursor

while (curr.hasNext()) { //iterate over the result set
        BasicDBObject record = (BasicDBObject) curr.next();
        //Do some magic, get a document bean       
        server.addBean(doc);
}
server.commit();
--/snip--

This will get you started on your track to index mongo data into a running Solr instance.
Also, remember to configure Solr correctly for this to run smooth.

Download Resources:

Tuesday, August 9, 2011

Searching off the RAM

Search engines are a lot about precision, recall and speed. These three factors pretty much define the quality of a search engine. I'd only talk about the last point here, speed. The time taken to search for a search engine is such a critical factor that an improvement of a few hundred milliseconds is of extreme importance to anyone associated with developing/designing search engines.
More often than not, as a short term gain, all of us look at putting in more money on the hardware to improve a system's performance. Though this might look like a solution, its bound to fail if you try to run away from actually fixing the application architecture, which happens to be the root cause for poorly performing applications generally.
For those who have already done whatever it takes to optimize the search process, here are a few ways that are generally used to host the search index on the RAM, in order to improve the search speed.
You may mount a tmpfs/ramfs on your machine and copy the index on it. You may then open index reader/searcher on this copy of the index. This would help in reducing the I/O latency and improve the search speed.
The difference between using tmpfs vs ramfs are:
  • ramfs will grow dynamically unlike tmpfs which is static and initialized with a fixed size.
  • tmpfs uses swap memory whereas ramfs doesn't.
I have personally used tmpfs and it works rather efficiently.
One thing to remember is that both tmpfs and ramfs are volatile. They both get erased on a system restart and hence you'd need to re-mount and copy the index on system startup.

Mounting tmpfs: 
mkdir -p /mnt/tmp 
mount -t tmpfs -o size=20m tmpfs /mnt/tmp
 
Mounting ramfs:
mkdir -p /mnt/ram 
mount -t ramfs -o size=20m ramfs /mnt/ram

The other approach in case you're using lucene is to have the IndexReader use a Ram Directory. In this case you'd need a JVM instance with enough memory to load all of the index contents into memory and have more for processing search queries.
Also, that may translate to 'requiring'a 64-bit JVM so that it could use pseudo unlimited address space.

--Snip--

NIOFSDirectory nDir = new NIOFSDirectory(new File(indexDir));
RAMDirectory directory = new RAMDirectory(nDir);
IndexReader ir = IndexReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(ir);


--Snip--

I haven't been really fond of using this approach as it forces a 64-bit architecture to be used but it may surely work as there's no overhead of using and manipulating the data manually as in the case of maintaining a tmpfs. The cleanups etc...

These are the basic 2 techniques to be used if you want your index to be fed off the RAM. It is a frequent question on the lucene users mailing list, so perhaps people can now stop asking that question... well.. almost...
All said and done.. don't stop optimizing the engine/app if your search is slow.. 99% of the times.. that is where it has to be handled.