<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-6888929711570901099</id><updated>2011-11-08T22:33:27.563+05:30</updated><category term='solr'/><category term='sparse data'/><category term='technology'/><category term='new delhi'/><category term='mongo'/><category term='sphinx'/><category term='search engine'/><category term='info edge'/><category term='benchmark'/><category term='open source'/><category term='go dutch'/><category term='generic code'/><category term='naukri labs'/><category term='shiksha'/><category term='lucene 2.3.1'/><category term='new media'/><category term='extensible code'/><category term='lucene optimization'/><category term='lucid imagination'/><category term='improving search speed'/><category term='solrj'/><category term='data solution'/><category term='apache'/><category term='facebook'/><category term='vector space'/><category term='lucid gaze'/><category term='teche'/><category term='auto complete'/><category term='ramdirectory'/><category term='anshum'/><category term='ramfs'/><category term='intent'/><category term='comparision'/><category term='lucene'/><category term='anshum gupta'/><category term='bomb blast'/><category term='government'/><category term='anti terrorism'/><category term='indexing'/><category term='lucene statistical analysis'/><category term='naukri'/><category term='india'/><category term='mongodb'/><category term='open source search engine'/><category term='image search'/><category term='official blog'/><category term='document representation'/><category term='twitter'/><category term='tmpfs'/><category term='search'/><category term='configuration vs extending'/><category term='IR'/><category term='lucene-2.4.1'/><category term='google'/><title type='text'>Anshum's Blog : The Artificial Intelligence Cafe`</title><subtitle type='html'>Sharing thoughts on AI. Primarily stuff related to Search Engines.
Expect a lot of NLP, Machine Learning and Collective Intelligence (and Wisdom of Crowd). Also at times, my personal thoughts on other engines (the critic that I am)!</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://ai-cafe.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6888929711570901099/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://ai-cafe.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Anshum</name><uri>http://www.blogger.com/profile/12495708906968224036</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>12</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-6888929711570901099.post-7775297459445277090</id><published>2011-08-29T14:53:00.002+05:30</published><updated>2011-08-29T15:00:27.310+05:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='indexing'/><category scheme='http://www.blogger.com/atom/ns#' term='solr'/><category scheme='http://www.blogger.com/atom/ns#' term='lucene'/><category scheme='http://www.blogger.com/atom/ns#' term='solrj'/><category scheme='http://www.blogger.com/atom/ns#' term='mongo'/><category scheme='http://www.blogger.com/atom/ns#' term='search'/><category scheme='http://www.blogger.com/atom/ns#' term='mongodb'/><title type='text'>Indexing MongoDB Data for Solr</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;div style="text-align: left;"&gt;&lt;a href="http://lucene.apache.org/solr/"&gt;&lt;b&gt;&lt;span style="font-size: x-large;"&gt;S&lt;/span&gt;&lt;/b&gt;olr&lt;/a&gt; is the popular, blazing fast open source enterprise search platform from the Apache &lt;a href="http://lucene.apache.org/"&gt;Lucene&lt;/a&gt; project.&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;a href="http://www.mongodb.org/"&gt;MongoDB&lt;/a&gt; (from "hu&lt;b&gt;mongo&lt;/b&gt;us") is a scalable,     high-performance, &lt;a href="http://www.mongodb.org/display/DOCS/Source+Code"&gt;open source&lt;/a&gt;, document-oriented data store.&amp;nbsp;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;I was happy using MongoDB and my very own search engine written using/extending lucene, until the trunks for Solr and Lucene were merged. This merge translated to Solr using the same release of lucene that I was using, unlike the past when there was some disconnect between the two. I realized that a lot of what I was trying to build was available through Solr.&lt;/div&gt;&lt;div style="text-align: left;"&gt;Though Solr is used by a lot of organizations (which can be found &lt;a href="http://wiki.apache.org/solr/PublicServers#Solr_Powered"&gt;here&lt;/a&gt;) and I'm sure that at least a few of them using Mongo, for some reason there was/is no straight forward out of the box import handler for data stored in MongoDB.&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;This made me search for a framework/module/plug to do the same, but in vain.&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;All said and done, here's a way that I finally was able to index my mongodb data into Solr.&lt;/div&gt;&lt;div style="text-align: left;"&gt;I've used &lt;a href="http://wiki.apache.org/solr/Solrj"&gt;SolrJ&lt;/a&gt; to access my Solr instance and a mongo connector to connect to Mongo. Having written my own sweet layer that has access to both the elements of the app, I have been able to inject data as required.&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;b&gt;&lt;i&gt;--snip--&lt;/i&gt;&lt;/b&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;i&gt;public SolrServer getSolrServer(String solrHost, String solrPort) throws MalformedURLException {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; String urlString = "http://"+solrHost+":"+solrPort+"/solr";&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; return new CommonsHttpSolrServer(urlString);&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;i&gt;}&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;b&gt;&lt;i&gt;--/snip--&lt;/i&gt;&lt;/b&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;Fire the mongo query, iterate and add to the index&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;b&gt;&lt;i&gt;--snip--&lt;/i&gt;&lt;/b&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;i&gt;SolrServer server = getSolrServer(..); //Get a server instance&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;i&gt;DBCursor curr = ..; //Fire query @ mongo, get the cursor&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;i&gt;while (curr.hasNext()) { //iterate over the result set&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; BasicDBObject record = (BasicDBObject) curr.next();&lt;br /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; //Do some magic, get a document bean &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;i&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; server.addBean(doc);&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;i&gt;}&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;i&gt;server.commit();&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;b&gt;&lt;i&gt;--/snip--&lt;/i&gt;&lt;/b&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;This will get you started on your track to index mongo data into a running Solr instance.&lt;/div&gt;&lt;div style="text-align: left;"&gt;Also, remember to configure Solr correctly for this to run smooth.&lt;/div&gt;&lt;br /&gt;&lt;b&gt;&lt;i&gt;Download Resources:&lt;/i&gt;&lt;/b&gt;&lt;br /&gt;&lt;ul style="text-align: left;"&gt;&lt;li&gt;&lt;a href="http://www.mongodb.org/display/DOCS/Drivers"&gt;Mongo Drivers&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://wiki.apache.org/solr/Solrj1.3%20"&gt;SolrJ&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6888929711570901099-7775297459445277090?l=ai-cafe.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-cafe.blogspot.com/feeds/7775297459445277090/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6888929711570901099&amp;postID=7775297459445277090&amp;isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6888929711570901099/posts/default/7775297459445277090'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6888929711570901099/posts/default/7775297459445277090'/><link rel='alternate' type='text/html' href='http://ai-cafe.blogspot.com/2011/08/indexing-mongodb-data-for-solr.html' title='Indexing MongoDB Data for Solr'/><author><name>Anshum</name><uri>http://www.blogger.com/profile/12495708906968224036</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6888929711570901099.post-5426075428543009445</id><published>2011-08-09T17:21:00.000+05:30</published><updated>2011-08-09T17:21:07.297+05:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='lucene'/><category scheme='http://www.blogger.com/atom/ns#' term='ramdirectory'/><category scheme='http://www.blogger.com/atom/ns#' term='tmpfs'/><category scheme='http://www.blogger.com/atom/ns#' term='open source search engine'/><category scheme='http://www.blogger.com/atom/ns#' term='improving search speed'/><category scheme='http://www.blogger.com/atom/ns#' term='ramfs'/><title type='text'>Searching off the RAM</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;b&gt;&lt;span style="font-size: x-large;"&gt;S&lt;/span&gt;&lt;/b&gt;earch engines are a lot about precision, recall and speed. These three factors pretty much define the quality of a search engine. I'd only talk about the last point here, speed. The time taken to search for a search engine is such a critical factor that an improvement of a few hundred milliseconds is of extreme importance to anyone associated with developing/designing search engines.&lt;br /&gt;&lt;b&gt;M&lt;/b&gt;ore often than not, as a short term gain, all of us look at putting in more money on the hardware to improve a system's performance. Though this might look like a solution, its bound to fail if you try to run away from actually fixing the application architecture, which happens to be the root cause for poorly performing applications generally.&lt;br /&gt;&lt;b&gt;F&lt;/b&gt;or those who have already done whatever it takes to optimize the search process, here are a few ways that are generally used to host the search index on the RAM, in order to improve the search speed.&lt;br /&gt;&lt;b&gt;Y&lt;/b&gt;ou may mount a tmpfs/ramfs on your machine and copy the index on it. You may then open index reader/searcher on this copy of the index. This would help in reducing the I/O latency and improve the search speed.&lt;br /&gt;&lt;b&gt;T&lt;/b&gt;he difference between using&lt;b&gt; tmpfs vs ramfs&lt;/b&gt; are:&lt;br /&gt;&lt;ul style="text-align: left;"&gt;&lt;li&gt;ramfs will grow dynamically unlike tmpfs which is static and initialized with a fixed size.&lt;/li&gt;&lt;li&gt;tmpfs uses swap memory whereas ramfs doesn't.&lt;/li&gt;&lt;/ul&gt;I have personally used tmpfs and it works rather efficiently.&lt;br /&gt;One thing to remember is that both tmpfs and ramfs are volatile. They both get erased on a system restart and hence you'd need to re-mount and copy the index on system startup.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Mounting tmpfs:&lt;/b&gt;&lt;i&gt;&lt;span style="font-size: small;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/i&gt;&lt;br /&gt;&lt;i&gt;&lt;span style="font-size: small;"&gt;mkdir -p /mnt/tmp&lt;/span&gt;&lt;/i&gt;&lt;i&gt;&lt;span style="font-size: small;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/i&gt;&lt;br /&gt;&lt;i&gt;&lt;span style="font-size: small;"&gt;mount -t tmpfs -o size=20m tmpfs /mnt/tmp&lt;/span&gt;&lt;/i&gt;&lt;br /&gt;&lt;pre&gt;&amp;nbsp;&lt;/pre&gt;&lt;b&gt;Mounting ramfs:&lt;/b&gt;&lt;br /&gt;&lt;i&gt;&lt;span style="font-size: small;"&gt;mkdir -p /mnt/ram&lt;/span&gt;&lt;/i&gt;&lt;i&gt;&lt;span style="font-size: small;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/i&gt;&lt;br /&gt;&lt;i&gt;&lt;span style="font-size: small;"&gt;mount -t ramfs -o size=20m ramfs /mnt/ram&lt;/span&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;&lt;span style="font-size: large;"&gt;T&lt;/span&gt;&lt;/i&gt;he other approach in case you're using lucene is to have the IndexReader use a &lt;a href="http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/store/RAMDirectory.html"&gt;Ram Directory&lt;/a&gt;. In this case you'd need a JVM instance with enough memory to load all of the index contents into memory and have more for processing search queries.&lt;br /&gt;Also, that may translate to '&lt;b&gt;requiring&lt;/b&gt;'a 64-bit JVM so that it could use pseudo unlimited address space.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;--Snip--&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;NIOFSDirectory nDir = &lt;/i&gt;&lt;i&gt;new NIOFSDirectory(new File(indexDir));&lt;/i&gt;&lt;br /&gt;&lt;i&gt;RAMDirectory directory = new RAMDirectory(nDir);&lt;br /&gt;IndexReader ir = IndexReader.open(directory);&lt;br /&gt;IndexSearcher indexSearcher = new IndexSearcher(ir);&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;--Snip--&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;I haven't been really fond of using this approach as it forces a 64-bit architecture to be used but it may surely work as there's no overhead of using and manipulating the data manually as in the case of maintaining a tmpfs. The cleanups etc...&lt;br /&gt;&lt;br /&gt;These are the basic 2 techniques to be used if you want your index to be fed off the RAM. It is a frequent question on the lucene users mailing list, so perhaps people can now stop asking that question... well.. almost...&lt;br /&gt;All said and done.. don't stop optimizing the engine/app if your search is slow.. 99% of the times.. that is where it has to be handled.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6888929711570901099-5426075428543009445?l=ai-cafe.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-cafe.blogspot.com/feeds/5426075428543009445/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6888929711570901099&amp;postID=5426075428543009445&amp;isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6888929711570901099/posts/default/5426075428543009445'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6888929711570901099/posts/default/5426075428543009445'/><link rel='alternate' type='text/html' href='http://ai-cafe.blogspot.com/2011/08/searching-off-ram.html' title='Searching off the RAM'/><author><name>Anshum</name><uri>http://www.blogger.com/profile/12495708906968224036</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6888929711570901099.post-7603165477969823835</id><published>2011-08-04T12:07:00.001+05:30</published><updated>2011-08-04T12:15:59.397+05:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='sparse data'/><category scheme='http://www.blogger.com/atom/ns#' term='IR'/><category scheme='http://www.blogger.com/atom/ns#' term='open source search engine'/><category scheme='http://www.blogger.com/atom/ns#' term='vector space'/><category scheme='http://www.blogger.com/atom/ns#' term='data solution'/><category scheme='http://www.blogger.com/atom/ns#' term='document representation'/><title type='text'>Dealing with High Dimensional Data</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;&lt;div style="text-align: left;"&gt;&lt;span style="font-size: x-large;"&gt;I&lt;/span&gt; believe that data is best represented as a vector. For those who haven't heard this before, well let me start with a very basic example of 'how' this is done.&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;Lets assume a corpus of documents which have only 2 unique words(A,B) in its dictionary (If that was hard to follow, comment and I shall follow up with what that means). Now a document containing only 'A' is a unit vector along the direction of 'A' and so with a document containing only a single occurrence of 'B'. Documents with 'x' As and 'y' Bs can hence be represented as :&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;i&gt;&lt;b&gt;x a + y b &lt;/b&gt;(a and b are unit vectors along A and B).&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;When the corpus comprises of documents wherein there are a lot of terms with a very low document frequency, it is referred to as high dimensional data. An example would be a list of proper nouns e.g. hotel names.&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;High dimensional data, poses a lot of issues primarily due to its sparseness in the vector space. The sparseness of data makes a lot of tasks like clustering and tagging challenging. In order to process this data, more often than not, there is a need for reducing the dimension of the documents (sparseness). I'll discuss a relatively easy way to reduce the dimension of such data.&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;Given a corpus of high dimensional data, create document vectors for each of them. Create a term frequency matrix for the corpus and follow it up with dropping off all terms that occur in less than 10% (might vary as per the corpus/dataset) documents. Statistically this should remove around 60% of the documents&lt;/div&gt;&lt;div style="text-align: left;"&gt;Also, removing the terms that occur in more than 80% of the documents would lead to removing a considerable ratio of terms that are redundant and too frequent. Such terms are generally tagged as stop-words and removed under all normal data/text processing algorithms.&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;The residue that remains now is of a considerably reduced dimension. This is a straightforward way of projecting the original data on a multi dimensional plane. A plane comprising of all dimensions that were reduced.&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;This data can now be consumed for any processing viz. clustering, classification etc..&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;Posts on how to cluster and various clustering techniques would soon follow.. unlike this one which took ages!&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6888929711570901099-7603165477969823835?l=ai-cafe.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-cafe.blogspot.com/feeds/7603165477969823835/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6888929711570901099&amp;postID=7603165477969823835&amp;isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6888929711570901099/posts/default/7603165477969823835'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6888929711570901099/posts/default/7603165477969823835'/><link rel='alternate' type='text/html' href='http://ai-cafe.blogspot.com/2011/08/dealing-with-high-dimensional-data.html' title='Dealing with High Dimensional Data'/><author><name>Anshum</name><uri>http://www.blogger.com/profile/12495708906968224036</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6888929711570901099.post-4490250637135451653</id><published>2010-07-15T11:54:00.008+05:30</published><updated>2010-07-15T12:33:44.715+05:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='anshum'/><category scheme='http://www.blogger.com/atom/ns#' term='facebook'/><category scheme='http://www.blogger.com/atom/ns#' term='twitter'/><category scheme='http://www.blogger.com/atom/ns#' term='auto complete'/><category scheme='http://www.blogger.com/atom/ns#' term='google'/><category scheme='http://www.blogger.com/atom/ns#' term='intent'/><title type='text'>Google Intelligence!</title><content type='html'>&lt;div&gt;&lt;div style="text-align: center; "&gt;&lt;div style="text-align: left;"&gt;&lt;b&gt;&lt;div style="text-align: left; display: inline !important; "&gt;&lt;span class="Apple-style-span" style="font-weight: normal; "&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span" style="font-size: x-large;"&gt;G&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;oogle seems to be the definitive word when it comes to household usage for the term, intelligent! Detecting the intent and context has always been a tough nut to crack but it seems like the search major has its own way to handle it.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/b&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="text-align: left; "&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;So when you type in '&lt;i&gt;facebook is&lt;/i&gt;' or '&lt;i&gt;twitter is' &lt;/i&gt;@ Google's search box, this is exactly the kind of suggestions it'd throw at you. Try it out at yourself.&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left; "&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;Source for idea/thought: &lt;/span&gt;&lt;a href="http://blog.cleartrip.com/"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;Cleartrip official blog&lt;/span&gt;&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_ordInieEeXw/TD6yliDyN1I/AAAAAAAAI18/nCvZEsjFQrg/s1600/facebook-google-cropped.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 253px;" src="http://1.bp.blogspot.com/_ordInieEeXw/TD6yliDyN1I/AAAAAAAAI18/nCvZEsjFQrg/s400/facebook-google-cropped.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5494024953306167122" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_ordInieEeXw/TD6ylFA1JEI/AAAAAAAAI10/CGtHJJgMr7M/s1600/twitter-google-cropped.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 281px;" src="http://2.bp.blogspot.com/_ordInieEeXw/TD6ylFA1JEI/AAAAAAAAI10/CGtHJJgMr7M/s400/twitter-google-cropped.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5494024945509147714" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6888929711570901099-4490250637135451653?l=ai-cafe.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-cafe.blogspot.com/feeds/4490250637135451653/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6888929711570901099&amp;postID=4490250637135451653&amp;isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6888929711570901099/posts/default/4490250637135451653'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6888929711570901099/posts/default/4490250637135451653'/><link rel='alternate' type='text/html' href='http://ai-cafe.blogspot.com/2010/07/google-intelligence.html' title='Google Intelligence!'/><author><name>Anshum</name><uri>http://www.blogger.com/profile/12495708906968224036</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_ordInieEeXw/TD6yliDyN1I/AAAAAAAAI18/nCvZEsjFQrg/s72-c/facebook-google-cropped.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6888929711570901099.post-7777913824858722796</id><published>2009-09-07T16:50:00.014+05:30</published><updated>2009-09-07T18:05:09.730+05:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='open source'/><category scheme='http://www.blogger.com/atom/ns#' term='anshum gupta'/><category scheme='http://www.blogger.com/atom/ns#' term='lucene statistical analysis'/><category scheme='http://www.blogger.com/atom/ns#' term='lucene'/><category scheme='http://www.blogger.com/atom/ns#' term='lucene optimization'/><category scheme='http://www.blogger.com/atom/ns#' term='lucid gaze'/><category scheme='http://www.blogger.com/atom/ns#' term='lucene-2.4.1'/><category scheme='http://www.blogger.com/atom/ns#' term='lucid imagination'/><category scheme='http://www.blogger.com/atom/ns#' term='search engine'/><title type='text'>Lucid Gaze - Tough Nut!</title><content type='html'>&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="font-size: x-large;"&gt;&lt;b&gt;&lt;i&gt;A&lt;/i&gt;&lt;/b&gt; &lt;/span&gt;One Stop Solution to a lot of statistical analysis for lucene internals is the all new lucid gaze for Lucene. Perhaps it has been around for a while for Solar and I'm left to wonder.. dude...where's the documentation? There aren't many places on the information superhighway where I could spot info on how to use lucid gaze. A Google for the same would prove my point.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;After I did figure out how to, it seemed like a good tool, easy to use(after the eureka moment, at least for me). Here's how I analyzed various things using Lucid Gaze by Lucid Imagination.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Pre-requisites:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li style="text-align: justify;"&gt;lucene core jar from Lucid imagination [&lt;a href="http://www.lucidimagination.com/Downloads/LucidWorks-for-Lucene"&gt;Here&lt;/a&gt;]&lt;/li&gt;&lt;/ul&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;i&gt;&lt;b&gt;W&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;rite the indexing/search logic. Open the Reader/Writer/Searcher as usual.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Create a RamUsageEstimator:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li style="text-align: justify;"&gt;&lt;i&gt;RamUsageEstimator estimator = new RamUsageEstimator();&lt;/i&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;At point where you'd want to analyze, do a &lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;ul&gt;&lt;li&gt;&lt;i&gt;estimator.estimateRamUsage(ir);&lt;/i&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;Where &lt;b&gt;&lt;i&gt;ir&lt;/i&gt;&lt;/b&gt; is an IndexReader/IndexWriter/IndexSearcher.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;i&gt;__Snip__&lt;/i&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;snip&gt;&lt;/snip&gt;&lt;/div&gt;&lt;div&gt;&lt;i&gt;&lt;b&gt;Stats s;&lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div&gt;&lt;i&gt;&lt;b&gt;s = LuceneCore.getIndexStats(); //For getting IndexStats&lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div&gt;&lt;i&gt;&lt;b&gt;s = LuceneCore.getStoreStats();  //For getting StoreStats&lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div&gt;&lt;i&gt;&lt;b&gt;s = LuceneCore.getSearchStats(); //For getting SearchStats&lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div&gt;&lt;i&gt;&lt;b&gt;s = LuceneCore.getAnalysisStats(); //For getting AnalysisStats&lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;b&gt;&lt;i&gt;__Snip Ends__&lt;/i&gt;&lt;/b&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;b&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;/b&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Once the above step is done, s is populated with a Stats Object containing the Index/Store/Search/Analysis Stats (as per the function call).&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;b&gt;&lt;i&gt;__Snip__&lt;/i&gt;&lt;/b&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;snip&gt;&lt;/snip&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;i&gt;&lt;b&gt;HashMap&lt;string,number&gt; h = (HashMap&lt;string,&gt;) s.getCurrentCounters(); // &lt;/string,&gt;&lt;/string,number&gt;&lt;/b&gt;&lt;/i&gt;&lt;span class="Apple-style-span"  style="font-family:'Times New Roman';"&gt;&lt;i&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Retrieve counters accumulated since last &lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;code&gt;&lt;i&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;#resetStats()&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/code&gt;&lt;i&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;.&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;b&gt;&lt;i&gt;__Snip Ends__&lt;/i&gt;&lt;/b&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;This HashMap is populated with current stat counters. The key to these are found in the javadoc.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;The following is the kind of output expected on iterating through the entire HashMap.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;b&gt;&lt;i&gt;__Code__&lt;/i&gt;&lt;/b&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;complete-code&gt;&lt;/complete-code&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;IndexReader ir = IndexReader.open(indexName);&lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;        &lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;RamUsageEstimator estimator = new &lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;estimator.estimateRamUsage(ir);&lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;        &lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;Stats s;&lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;HashMap&lt;string,number&gt; h;&lt;/string,number&gt;&lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;        &lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;s=LuceneCore.getIndexStats();&lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;h = (HashMap&lt;string,&gt;) s.getCurrentCounters();&lt;/string,&gt;&lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;for(String key:h.keySet())&lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;  System.out.println("indexStats: "+key+"/"+h.get(key));&lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;        &lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;s = LuceneCore.getIndexStats();&lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;h = (HashMap&lt;string,&gt;) s.getCurrentCounters();&lt;/string,&gt;&lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;for(String key:h.keySet())&lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;  System.out.println("storeStats: "+key+"/"+h.get(key));&lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;        &lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;s=LuceneCore.getSearchStats();&lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;h = (HashMap&lt;string,&gt;) s.getCurrentCounters();&lt;/string,&gt;&lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;for(String key:h.keySet())&lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;  System.out.println("searchStats: "+key+"/"+h.get(key));&lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;        &lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;s=LuceneCore.getAnalysisStats();&lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;h = (HashMap&lt;string,&gt;) s.getCurrentCounters();&lt;/string,&gt;&lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;for(String key:h.keySet())&lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;  System.out.println("analysisStats: "+key+"/"+h.get(key));&lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;        &lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;b&gt;ir.close();&lt;/b&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;b&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;/b&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;b&gt;&lt;i&gt;__Code Ends__&lt;/i&gt;&lt;/b&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;b&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;/b&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;b&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;/b&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;b&gt;&lt;i&gt;__Output__&lt;/i&gt;&lt;/b&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;output&gt;&lt;/output&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;indexStats: iw_adT/2000180&lt;/div&gt;&lt;div style="text-align: justify;"&gt;indexStats: ir_C/0&lt;/div&gt;&lt;div style="text-align: justify;"&gt;indexStats: ir_newC/1&lt;/div&gt;&lt;div style="text-align: justify;"&gt;indexStats: iw_C/1&lt;/div&gt;&lt;div style="text-align: justify;"&gt;indexStats: iw_segs/0&lt;/div&gt;&lt;div style="text-align: justify;"&gt;indexStats: iw_adC/15&lt;/div&gt;&lt;div style="text-align: justify;"&gt;indexStats: iw_buf/0&lt;/div&gt;&lt;div style="text-align: justify;"&gt;indexStats: iw_segC/1&lt;/div&gt;&lt;div style="text-align: justify;"&gt;indexStats: ir_ram/0&lt;/div&gt;&lt;div style="text-align: justify;"&gt;indexStats: iw_newC/1&lt;/div&gt;&lt;div style="text-align: justify;"&gt;indexStats: iw_ram/10487&lt;/div&gt;&lt;div style="text-align: justify;"&gt;storeStats: iw_adT/2000180&lt;/div&gt;&lt;div style="text-align: justify;"&gt;storeStats: ir_C/0&lt;/div&gt;&lt;div style="text-align: justify;"&gt;storeStats: ir_newC/1&lt;/div&gt;&lt;div style="text-align: justify;"&gt;storeStats: iw_C/1&lt;/div&gt;&lt;div style="text-align: justify;"&gt;storeStats: iw_segs/0&lt;/div&gt;&lt;div style="text-align: justify;"&gt;storeStats: iw_adC/15&lt;/div&gt;&lt;div style="text-align: justify;"&gt;storeStats: iw_buf/0&lt;/div&gt;&lt;div style="text-align: justify;"&gt;storeStats: iw_segC/1&lt;/div&gt;&lt;div style="text-align: justify;"&gt;storeStats: ir_ram/0&lt;/div&gt;&lt;div style="text-align: justify;"&gt;storeStats: iw_newC/1&lt;/div&gt;&lt;div style="text-align: justify;"&gt;storeStats: iw_ram/10487&lt;/div&gt;&lt;div style="text-align: justify;"&gt;analysisStats: toks/1&lt;/div&gt;&lt;div style="text-align: justify;"&gt;analysisStats: tss/30&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;i&gt;&lt;/i&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;b&gt;&lt;i&gt;__Output Ends__&lt;/i&gt;&lt;/b&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;b&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;/b&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;b&gt;&lt;i&gt;T&lt;/i&gt;&lt;/b&gt;&lt;/span&gt;his gives pretty much all of the desired information to optimize the search/index or any other process involving a lucene index Reader/Writer/Searcher.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Thanks to the developers @ Lucid Imagination for coming up with this.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Thanks to&lt;b&gt; &lt;/b&gt;&lt;a href="http://jayant7k.blogspot.com/"&gt;&lt;b&gt;&lt;i&gt;Jayant&lt;/i&gt;&lt;/b&gt;&lt;/a&gt;&lt;b&gt;&lt;i&gt; , Nitish&lt;/i&gt;&lt;/b&gt; for the help. :)&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Download Lucid Gaze for Lucene &lt;a href="http://www.lucidimagination.com/Downloads/LucidWorks-for-Lucene"&gt;Here&lt;/a&gt; [@Lucid Imagination]&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6888929711570901099-7777913824858722796?l=ai-cafe.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-cafe.blogspot.com/feeds/7777913824858722796/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6888929711570901099&amp;postID=7777913824858722796&amp;isPopup=true' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6888929711570901099/posts/default/7777913824858722796'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6888929711570901099/posts/default/7777913824858722796'/><link rel='alternate' type='text/html' href='http://ai-cafe.blogspot.com/2009/09/lucid-gaze-tough-nut.html' title='Lucid Gaze - Tough Nut!'/><author><name>Anshum</name><uri>http://www.blogger.com/profile/12495708906968224036</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6888929711570901099.post-5464717589850170926</id><published>2009-08-29T00:10:00.001+05:30</published><updated>2009-08-29T00:13:24.342+05:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='generic code'/><category scheme='http://www.blogger.com/atom/ns#' term='anshum gupta'/><category scheme='http://www.blogger.com/atom/ns#' term='extensible code'/><category scheme='http://www.blogger.com/atom/ns#' term='configuration vs extending'/><title type='text'>Generic Code? Extensible Code?</title><content type='html'>&lt;div  style="text-align: justify;font-family:times new roman;"&gt;&lt;span style="font-family: arial;font-size:100%;" &gt;&lt;span style="font-size:180%;"&gt;I&lt;/span&gt;t has been a question that pops up every time (at-least I) write code. 'How generic should this be?' By Generic, I mean the power to (re)use the same piece of code, without changing anything 'inside' the code and only changing a configuration file (xml or whatever is the implementation choice).&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family: arial;font-size:100%;" &gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family: arial;font-size:100%;" &gt;More often than not I end up just trying to write a code so generic, that the purpose it was primarily built for (whatever application) is complicated. Correct, everything now is in the conf file, but at the same time writing the conf file in itself is a task wherein the only kind of people who would want to write that conf instead of rewriting an application specific code would be those who are 'programming challenged'.&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family: arial;font-size:100%;" &gt;I've realized, perhaps if only a few questions are answered prior to writing the so generic code, the developer/designer would be at such a level of ease.&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family: arial;font-size:100%;" &gt;&lt;span style="font-style: italic;"&gt;* Who asked for it?&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family: arial;font-size:100%;" &gt;&lt;span style="font-style: italic;"&gt;* Would someone else ever use it? Really? Or is it just a mere assumption that someday the world would run on it?&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family: arial;font-size:100%;" &gt;&lt;span style="font-style: italic;"&gt;* Assuming that the world might run on it someday, do I need to write code for all of that right now? Can I just write what I want, optimize it for what is required at the moment and a little more and then just let it be? On the lines of the early ages of internet [Design it now, and let it get corrected as it goes on, with the future users correcting it themselves]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family: arial;font-size:100%;" &gt;There are a lot of other questions which should be answered before the attempt to write 'the universal machine' is made. All attempts to write code are generally towards writing a universal machine which would do all we can think of, all we can imagine, and all that the machine would be able to imagine years from now on! :)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family: arial;font-size:100%;" &gt;Lets write for 'now' and write it well designed.&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family: arial;font-size:100%;" &gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Let them extend it rather than configure it....&lt;/span&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6888929711570901099-5464717589850170926?l=ai-cafe.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-cafe.blogspot.com/feeds/5464717589850170926/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6888929711570901099&amp;postID=5464717589850170926&amp;isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6888929711570901099/posts/default/5464717589850170926'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6888929711570901099/posts/default/5464717589850170926'/><link rel='alternate' type='text/html' href='http://ai-cafe.blogspot.com/2009/08/generic-code-extensible-code_29.html' title='Generic Code? Extensible Code?'/><author><name>Anshum</name><uri>http://www.blogger.com/profile/12495708906968224036</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6888929711570901099.post-7989888065463690388</id><published>2009-08-12T16:59:00.008+05:30</published><updated>2009-08-12T17:36:51.458+05:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='anshum gupta'/><category scheme='http://www.blogger.com/atom/ns#' term='comparision'/><category scheme='http://www.blogger.com/atom/ns#' term='lucene'/><category scheme='http://www.blogger.com/atom/ns#' term='open source search engine'/><category scheme='http://www.blogger.com/atom/ns#' term='lucene 2.3.1'/><category scheme='http://www.blogger.com/atom/ns#' term='apache'/><category scheme='http://www.blogger.com/atom/ns#' term='search engine'/><category scheme='http://www.blogger.com/atom/ns#' term='sphinx'/><category scheme='http://www.blogger.com/atom/ns#' term='benchmark'/><title type='text'>Lucene Vs Sphinx - A Showdown on a large dataset</title><content type='html'>&lt;div style="text-align: justify;"&gt;&lt;span style="font-size:100%;"&gt;&lt;span style="font-weight: bold; font-style: italic;font-family:arial;font-size:180%;"  &gt;T&lt;/span&gt;here has long existed a battle among the "pro java" and "pro c" developer community. Not that I'd want to strictly be associated with one, but would always say the robustness/exception handling/stability vector have a better cosine value with java as compared to c [With all respect to C, for being the progenitor for Java]. Let me not move off on the tangent and rather more towards the core here. Last few weeks were spent trying to benchmark 'java' lucene as a search engine with 'C' sphinx. Though not exactly in their vanilla form, and a lot of modifications to both, we finally ran a lot of tests on both the engines.&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Keeping a common playground with the following specifications:&lt;br /&gt;&lt;/span&gt;&lt;span style="font-weight: bold; font-style: italic;font-size:100%;" &gt;Processor(s)      : Intel Quad Core X 2&lt;/span&gt;&lt;span style="font-size:100%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-weight: bold; font-style: italic;font-size:100%;" &gt;RAM               : 24G&lt;/span&gt;&lt;span style="font-size:100%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-weight: bold; font-style: italic;font-size:100%;" &gt;Operating System  : RHEL 32 Bit&lt;/span&gt;&lt;span style="font-size:100%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-weight: bold; font-style: italic;font-size:100%;" &gt;Document Corpus   : 18 million+ Documents&lt;/span&gt;&lt;span style="font-size:100%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-weight: bold; font-style: italic;font-size:100%;" &gt;Source Size       : 90G [RDBMS Table]&lt;/span&gt;&lt;span style="font-size:100%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;On a corpus of approximately 18 million records, indexed and not all of them stored. Multiple field queries with varying boost values and some good level of complexity. Here is the result sheet:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://lucene.apache.org/"&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Lucene&lt;/span&gt;&lt;/a&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-weight: bold;font-size:100%;" &gt;Index Size : 20G&lt;/span&gt;&lt;span style="font-size:100%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-weight: bold;font-size:100%;" &gt;Concurrency : 30 [5*6 Daemons, 2G each]&lt;/span&gt;&lt;span style="font-size:100%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Total Searches :  64931&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Slow Query Count (&gt;=10 secs) :   3803     ( 5.86%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Total Duration (secs)        : 238094.574&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Mean Duration            :      3.667&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Mode Duration            :      0.835&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Minimum Duration             :      0.001&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Maximum Duration             :   1174.757&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Duration Standard Deviation  :     15.441&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Search Time (secs) Distribution :-&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;    [0,0.25) :   2770 ( 4.27%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;  [0.25,0.5) :   5666 ( 8.73%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;     [0.5,1) :  13515 (20.81%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;     [1,1.5) :  10928 (16.83%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;     [1.5,2) :   7330 (11.29%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;       [2,3) :   8476 (13.05%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;       [3,5) :   7222 (11.12%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;      [5,10) :   5221 ( 8.04%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;     [10,20) :   2335 ( 3.60%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;   [20,+inf) :   1468 ( 2.26%)&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-weight: bold;font-size:100%;" &gt;Concurrency : 5 [1 Daemon * 2G]&lt;/span&gt;&lt;span style="font-size:100%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Total Searches               : 225906&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Slow Query Count (&gt;=10 secs) :    972     ( 0.43%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Total Duration (secs)        : 186700.646&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Mean Duration                :      0.826&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Mode Duration                :      0.003&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Minimum Duration             :      0.001&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Maximum Duration             :    467.647&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Duration Standard Deviation  :      2.864&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Duration (secs) Bins         :-&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;    [0,0.25) :  64621 (28.61%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;  [0.25,0.5) :  58947 (26.09%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;     [0.5,1) :  56894 (25.18%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;     [1,1.5) :  19836 ( 8.78%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;     [1.5,2) :   9397 ( 4.16%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;       [2,3) :   7941 ( 3.52%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;       [3,5) :   4810 ( 2.13%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;      [5,10) :   2488 ( 1.10%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;     [10,20) :    684 ( 0.30%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;   [20,+inf) :    288 ( 0.13%)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://sphinxsearch.com/"&gt;Sphinx&lt;/a&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-weight: bold;font-size:100%;" &gt;Index Size: 60G&lt;/span&gt;&lt;span style="font-size:100%;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Concurrency: 30&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Total Searches               : 244431&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Slow Query Count (&gt;=10 secs) :  27479     (11.24%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Total Duration (secs)        : 1243474.213&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Mean Duration                :      5.087&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Mode Duration                :      0.007&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Minimum Duration             :      0.001&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Maximum Duration             :   1869.063&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Duration Standard Deviation  :     17.833&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Average Queries              :      2.783&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Duration (secs) Bins         :-&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;    [0,0.25) :  51186 (20.94%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;  [0.25,0.5) :  27798 (11.37%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;     [0.5,1) :  32372 (13.24%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;     [1,1.5) :  20490 ( 8.38%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;     [1.5,2) :  16915 ( 6.92%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;       [2,3) :  21833 ( 8.93%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;       [3,5) :  23550 ( 9.63%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;      [5,10) :  22808 ( 9.33%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;     [10,20) :  14540 ( 5.95%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;   [20,+inf) :  12939 ( 5.29%)&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-weight: bold;font-size:100%;" &gt;Concurrency: 5&lt;/span&gt;&lt;span style="font-size:100%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Total Searches               : 226528&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Slow Query Count (&gt;=10 secs) :   9895     ( 4.37%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Total Duration (secs)        : 453296.517&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Mean Duration                :      2.001&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Mode Duration                :      0.007&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Minimum Duration             :      0.001&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Maximum Duration             :    164.713&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Duration Standard Deviation  :      4.543&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Average Queries              :      2.773&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Duration (secs) Bins         :-&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;    [0,0.25) :  71001 (31.34%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;  [0.25,0.5) :  36500 (16.11%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;     [0.5,1) :  32799 (14.48%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;     [1,1.5) :  20416 ( 9.01%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;     [1.5,2) :  16385 ( 7.23%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;       [2,3) :  13951 ( 6.16%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;       [3,5) :  13330 ( 5.88%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;      [5,10) :  12251 ( 5.41%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;     [10,20) :   7563 ( 3.34%)&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;   [20,+inf) :   2332 ( 1.03%)&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;As per the analysis, &lt;span style="font-weight: bold;"&gt;for the dataset&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;analyzed,&lt;/span&gt; Lucene was found to win convincingly over its rival.&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;More details on the same to come soon!&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;&lt;span style="font-weight: bold;"&gt;P.S: &lt;/span&gt;Though lucene works great for a lot of cases, so does sphinx. Here, lucene seemed to have an upper hand&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6888929711570901099-7989888065463690388?l=ai-cafe.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-cafe.blogspot.com/feeds/7989888065463690388/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6888929711570901099&amp;postID=7989888065463690388&amp;isPopup=true' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6888929711570901099/posts/default/7989888065463690388'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6888929711570901099/posts/default/7989888065463690388'/><link rel='alternate' type='text/html' href='http://ai-cafe.blogspot.com/2009/08/lucene-vs-sphinx-showdown-on-large.html' title='Lucene Vs Sphinx - A Showdown on a large dataset'/><author><name>Anshum</name><uri>http://www.blogger.com/profile/12495708906968224036</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6888929711570901099.post-4666752688102656650</id><published>2008-09-28T17:48:00.002+05:30</published><updated>2008-09-28T17:56:24.292+05:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='anshum'/><category scheme='http://www.blogger.com/atom/ns#' term='go dutch'/><category scheme='http://www.blogger.com/atom/ns#' term='shiksha'/><category scheme='http://www.blogger.com/atom/ns#' term='naukri'/><category scheme='http://www.blogger.com/atom/ns#' term='teche'/><category scheme='http://www.blogger.com/atom/ns#' term='naukri labs'/><category scheme='http://www.blogger.com/atom/ns#' term='official blog'/><category scheme='http://www.blogger.com/atom/ns#' term='info edge'/><title type='text'>Go Dutch! @ Info Edge</title><content type='html'>Nice read....&lt;br /&gt;&lt;br /&gt;&lt;a href="http://teche-go-dutch.blogspot.com/"&gt;http://teche-go-dutch.blogspot.com/&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6888929711570901099-4666752688102656650?l=ai-cafe.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-cafe.blogspot.com/feeds/4666752688102656650/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6888929711570901099&amp;postID=4666752688102656650&amp;isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6888929711570901099/posts/default/4666752688102656650'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6888929711570901099/posts/default/4666752688102656650'/><link rel='alternate' type='text/html' href='http://ai-cafe.blogspot.com/2008/09/go-dutch-info-edge.html' title='Go Dutch! @ Info Edge'/><author><name>Anshum</name><uri>http://www.blogger.com/profile/12495708906968224036</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6888929711570901099.post-616204998333044303</id><published>2008-09-28T09:10:00.008+05:30</published><updated>2009-09-08T00:26:15.449+05:30</updated><category scheme='http://www.blogger.com/atom/ns#' term='new delhi'/><category scheme='http://www.blogger.com/atom/ns#' term='anti terrorism'/><category scheme='http://www.blogger.com/atom/ns#' term='new media'/><category scheme='http://www.blogger.com/atom/ns#' term='technology'/><category scheme='http://www.blogger.com/atom/ns#' term='india'/><category scheme='http://www.blogger.com/atom/ns#' term='image search'/><category scheme='http://www.blogger.com/atom/ns#' term='government'/><category scheme='http://www.blogger.com/atom/ns#' term='search engine'/><category scheme='http://www.blogger.com/atom/ns#' term='bomb blast'/><title type='text'>Intelligent Video Surveillance</title><content type='html'>&lt;div style="text-align: justify;"&gt;&lt;span style="font-weight: bold; font-style: italic;font-family:arial;font-size:180%;"  &gt;I&lt;/span&gt;n the light of the bombing in New Delhi, I thought of steering my thoughts  (actually seeded by Manisha) to a solution to the important problem of monitoring an over populous city. If we look at it and analyze, we could compare it to a typical search problem. Excessive amounts of data, limited time to process and high level of accuracy.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;I&lt;/span&gt;f we could design a search system, so potent and so intelligent so that  it detects and notifies anything/anyone that it thinks is worth a mention, and integrate it with new media to flash it over the cell phones and billboards around the area it is detected in, we would have an amazing (though ideal) system. And trust me, unlike all ideal systems, this is do-able.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;G&lt;/span&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;etting&lt;/span&gt; into the technicalities of the issue, question 1 is.... what do we already have in place? A terrorist database with photographs and other details, a police force (though not adequate in volumes to monitor and act, both at the same time)&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;S&lt;/span&gt;o taking all that data, and first of all digitizing it (which I believe the government would already have done) is the way to start it off. Once the data is digitized, it has to be preprocessed (to prepare it for indexing), exactly the way all data is treated for search engines.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;O&lt;/span&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;nce&lt;/span&gt; we structure the data and order it accordingly(which includes, and primarily includes images), we are ready to index it. Now image indexing is the pivotal thing here as the images are immense and numerous.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;W&lt;/span&gt;hat after we are ready with the indexed data? (which happens to be a lot of images).&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;W&lt;/span&gt;e need to build an image search engine. &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;Ok&lt;/span&gt;.. So how does it differ from a Google image search or a Yahoo image search? Unlike those search engines which are a function of text and not &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;RGB&lt;/span&gt; values this has to be one that matches an image for an image ( and similar images).&lt;br /&gt;In other words, this runs a search on an input 'image' and not a keyword search to pull up all images tagged with the keyword.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;T&lt;/span&gt;his is the most important part as it involves bit stream matches, needs an algorithm that knows how to filter out noise over time (so that its noise removal works better with each passing day). Also understands that a person could wear a helmet/scarf and still would have to be detected.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;A&lt;/span&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;lso&lt;/span&gt;, there could be voice &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_5"&gt;matchers&lt;/span&gt; that match the voice to make sure that the person is the same and build a mechanism to learn about human voice modulations and variations.&lt;br /&gt;There's a lot more that the search engine could work on and handle (I'm sure more people thinking on it would get better ideas..at least more ideas.. on the issue)&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;A&lt;/span&gt;nother question, which again is a pivotal one, would a human match such stuff? as in.. would the input to the system come from a human being? I say... could be.. but rarely... as mostly.... it would have to be integrated with cameras.. '&lt;span style="font-weight: bold; font-style: italic;"&gt;Intelligent Video &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_6"&gt;Surveillance&lt;/span&gt;&lt;/span&gt;' cameras. With the current age of technology and canon having its amazing multiple face detection &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_7"&gt;technology&lt;/span&gt;, we are almost there on this front. An integration of the technology with the frames (on a sampled basis as all could not be handled) from video cameras could perform a search for faces using the already defined engine. This search would be an ongoing process and as soon as something fishy or known is detected by the system, it could raise an alarm.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;W&lt;/span&gt;e could then integrate it with video devices in police stations and control rooms to flash the captured and detected 'may-be terrorist images' which go in as a lead to the existing police forces.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;T&lt;/span&gt;his I believe would help the country , the police force and the law enforcement agencies like nothing else(&lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_8"&gt;at least&lt;/span&gt; as far as the current issue is concerned).&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;T&lt;/span&gt;here's a need for better technology by the government agencies, with the terrorism taking a new age format that is highly dependent on &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_9"&gt;technology&lt;/span&gt;.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;H&lt;/span&gt;ope someone reads this.. or thinks about it.. someone who has the power for this implementation at national level.&lt;br /&gt;&lt;br /&gt;--&lt;br /&gt;&lt;span style="font-size:85%;"&gt;&lt;span style="font-weight: bold;"&gt;P.S : Comments would be appreciated!&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6888929711570901099-616204998333044303?l=ai-cafe.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-cafe.blogspot.com/feeds/616204998333044303/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6888929711570901099&amp;postID=616204998333044303&amp;isPopup=true' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6888929711570901099/posts/default/616204998333044303'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6888929711570901099/posts/default/616204998333044303'/><link rel='alternate' type='text/html' href='http://ai-cafe.blogspot.com/2008/09/intelligent-video-serveillance.html' title='Intelligent Video Surveillance'/><author><name>Anshum</name><uri>http://www.blogger.com/profile/12495708906968224036</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6888929711570901099.post-1042517911446691370</id><published>2008-09-04T22:39:00.001+05:30</published><updated>2008-09-04T22:40:15.062+05:30</updated><title type='text'>An Ideal Agent</title><content type='html'>&lt;div style="text-align: justify;"&gt;&lt;span style="font-style: italic;font-size:180%;" &gt;&lt;span style="font-weight: bold;"&gt;D&lt;/span&gt;&lt;/span&gt;on't we just feel amazing (and willing to pay that extra buck) in case we have are offered  personalized service? I guess we do and so  we have personal managers. Each time a HNI (High Net-worth Individual) steps into a bank, he is sold a service with a 'personal assistant/agent'.&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;W&lt;/span&gt;hy do only the select few HNIs get this kind of benefit and satisfaction? Perhaps its the cost factor. An agent costs money(salary) and the financial institution can not afford to have an agent for a non HNI.&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;W&lt;/span&gt;hen it seems so obvious, the need for a personalized service seems the right way, what stops us from having a pseudo-personal agent? An artificial agent? Some one over then internet? The agent knows what we do, where we went for our last vacation, who is it that we went with, our last job, our girlfriend (and her birthday) and just does what we humans are ideally expected to do!&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Doesn't seem impossible but then is there something we should fear?&lt;/span&gt;&lt;br /&gt;I already sense the so called 'what happens to my personal space' and invasion of privacy issues? I say when you could share your wealth information with a human being, what harm could a machine cause? Just that it would help you remember your girlfriends` (or wifes`) birthday on time and might as well help you with other stuff. I really understand the security risks involved with it, but how many of us fear storing our emails in our gmail accounts considering the fact that Google's privacy document (that each gmail user signs) lets them store a copy of all emails sent/received by you (for so called security purposes). If we are not scared storing mails why the hype about everything else?&lt;br /&gt;What are the odds that you wouldn't want this kind of an agent considering that it'd help you each time you log on to the internet by flashing your kind of news, informing about your fav. sports star (say sachin or michael schumacher) or your fav actor (might be SRK for instance).It could also tell you about the time you spent last month visiting all the wrong websites while planning for your vacation(and so would keep you away from such sites).&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;I&lt;/span&gt;'m certainly in favour of having an ideal agent taking care of all of us.... considering our choices. Imagine girls and a machine telling them talking to them all the while they surf.... talking their language....'hey.... doesn't that daimond ring at abc.com look so awesome?' and then you'd check to realize that it exactly matches your choice.... (just because the machine can't think by itself.. it only knows how to think your way.. and by now it has started to understand your likes and your taste....). All of this though, might have one impact.... we might just start liking machines more than human beings... and then we'd (or the biotech guys) be designing 'intelligent' human beings..... :)&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6888929711570901099-1042517911446691370?l=ai-cafe.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-cafe.blogspot.com/feeds/1042517911446691370/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6888929711570901099&amp;postID=1042517911446691370&amp;isPopup=true' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6888929711570901099/posts/default/1042517911446691370'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6888929711570901099/posts/default/1042517911446691370'/><link rel='alternate' type='text/html' href='http://ai-cafe.blogspot.com/2008/09/ideal-agent.html' title='An Ideal Agent'/><author><name>Anshum</name><uri>http://www.blogger.com/profile/12495708906968224036</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6888929711570901099.post-3801335004068386490</id><published>2008-09-01T11:13:00.012+05:30</published><updated>2008-09-01T12:06:29.631+05:30</updated><title type='text'>Page Rank - My Version!</title><content type='html'>&lt;div style="text-align: justify;"&gt;&lt;span style="font-weight: bold; font-style: italic;font-size:180%;" &gt;S&lt;/span&gt;o what is '&lt;span style="font-style: italic;"&gt;Page Rank&lt;/span&gt;'?&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;font-size:130%;" &gt;T&lt;/span&gt;here are a little too many answers to the question. Seems like everyone has a 'similar' version. In other terms, its the same style of cooking but a different , leading to a distinction in taste. I would not really try to define the ranking algorithm that was finely designed by the 2 Stanford University PhD students (and as one of the thought also goes, in the name of one of the designers of the algo).&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;I &lt;/span&gt;would like to have my thought on the algo added to the already existing ones, though this one is like a more generic one. I'd say that PageRank of a page(or a document), is the probability of a web searcher ending his search at the particular page looking for 'X' keywords assuming that he/she had infinite amount of time at his/her disposal.&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;N&lt;/span&gt;ow the question is, how do we (or a machine) calculate that probability?&lt;br /&gt;There are a lot of things that would have some relationship with this calculation. Starting off with the most heard of, 'back links'. Probability and logic have it, the more the links from other page to this page, the more are the chances of someone reaching the page. Thats logic and the probability is math.&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;T&lt;/span&gt;he odds of terminating the search at a particular website also is a factor of the reliability factor of the page (or the website). The credibility is as important as anything because no one would want to call it a day with all the wrong information, would you in that position terminate it there? I'm sure 'No' is the answer (unless you have a deadline, meeting which is the priority for now as compared to reliability :P ). This is the reason why .gov, .edu, .google.co* and wikipedia are given the boost they are given while google ranks pages for your search.&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;A&lt;/span&gt;lso, having spoken about the back linking  of websites, there is a damping factor associated for each hop so that having traversed through 2 edges to reach a page would weigh less as compared to a single hop. The damping factor supposedly being 0.85.&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;N&lt;/span&gt;ext question that would come to mind would be, what about the newbies? In that case, there's a default value (Supposedly 0.15) for them.&lt;br /&gt;Those are the important components of page rank.&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;A&lt;/span&gt;ll said and done, we would still be concerned with the question of 'where do we start' and how do we get a back link etc rank until we first build the network? This boils down to the chicked and hen problem.. right!!! :)&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;W&lt;/span&gt;ell, it is supposed to start off with default values at the start and then the process is repeated 'n' times (where 'n' is a very high number) to fine tune the ranks to get closer to their real values.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;T&lt;/span&gt;his is my version of the way the best 'generic' search engine on the planet works (and there's a lot more than this that is used for fine tuning search results and which is beyond the scope of this entry).&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6888929711570901099-3801335004068386490?l=ai-cafe.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-cafe.blogspot.com/feeds/3801335004068386490/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6888929711570901099&amp;postID=3801335004068386490&amp;isPopup=true' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6888929711570901099/posts/default/3801335004068386490'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6888929711570901099/posts/default/3801335004068386490'/><link rel='alternate' type='text/html' href='http://ai-cafe.blogspot.com/2008/09/page-rank-my-version.html' title='Page Rank - My Version!'/><author><name>Anshum</name><uri>http://www.blogger.com/profile/12495708906968224036</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6888929711570901099.post-215525927983987279</id><published>2008-08-03T08:28:00.000+05:30</published><updated>2008-08-03T08:55:52.153+05:30</updated><title type='text'>Search Engine and Artificial Intelligence!</title><content type='html'>&lt;strong&gt;&lt;span style="font-size:180%;"&gt;S&lt;/span&gt;&lt;/strong&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;earch&lt;/span&gt; Engines encompass a variety of sub domains or specialized fields of computer science. Starting from &lt;strong&gt;Information Retrieval &lt;/strong&gt;(as they were in the primitive form) to being highly related to &lt;strong&gt;Artificially Intelligence &lt;/strong&gt;(check out the few options mentioned below). Search engines are ideally to be self &lt;strong&gt;learning machines&lt;/strong&gt; that can rightfully decipher the queries provided to them as &lt;strong&gt;Natural Languages&lt;/strong&gt;, also understanding the &lt;strong&gt;intent &lt;/strong&gt;and &lt;strong&gt;context &lt;/strong&gt;of the searcher viz. differentiating between Jaguar - the car and Jaguar - the animal.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Anatomy of a Search Engine, &lt;/strong&gt;&lt;br /&gt;Generally indexing comprises of&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Porting/Crawling (to for a data source)&lt;/li&gt;&lt;li&gt;Mapping, Cleanup&lt;/li&gt;&lt;li&gt;Indexing&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;And searching&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Index reader&lt;/li&gt;&lt;li&gt;Query formation&lt;/li&gt;&lt;li&gt;Searcher (for selecting data)&lt;/li&gt;&lt;li&gt;Scoring mechanism (for ordering data)&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Other ideas that are of use while designing a search engine:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Mapping&lt;/li&gt;&lt;li&gt;Clustering&lt;/li&gt;&lt;li&gt;Wisdom of crowd for &lt;ul&gt;&lt;li&gt;Relevance scoring&lt;/li&gt;&lt;li&gt;Spam filtering&lt;/li&gt;&lt;li&gt;Suggestion Engine&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;Natural Language Processing for &lt;ul&gt;&lt;li&gt;Query formation&lt;/li&gt;&lt;li&gt;Intent determination&lt;/li&gt;&lt;li&gt;One box searching - G****E like!&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;There's a lot more that could be of use, but lets start off with this for the moment!&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6888929711570901099-215525927983987279?l=ai-cafe.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-cafe.blogspot.com/feeds/215525927983987279/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6888929711570901099&amp;postID=215525927983987279&amp;isPopup=true' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6888929711570901099/posts/default/215525927983987279'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6888929711570901099/posts/default/215525927983987279'/><link rel='alternate' type='text/html' href='http://ai-cafe.blogspot.com/2008/08/search-engine-and-artificial.html' title='Search Engine and Artificial Intelligence!'/><author><name>Anshum</name><uri>http://www.blogger.com/profile/12495708906968224036</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry></feed>
