Quest for search Thursday, October 2, 2008

This was originally posted on Vidya blog - Digital Information Archive at Amrita

Here's the post for blog :-)

Quest for Search Ended

All this started when I met Ajai Narendran sir(Vidya's architect and's webmaster). For the first time, it was just a normal meeting. I always wanted to have a search engine that works for Vidya, a huge archive, which makes life a lot easier. Sometime when I met him while talking in general, I told him that a search engine for Vidya must have a place where you can add new archives so that they also become searchable. He replied me something like "human imagination has no bounds", he was right and that was imagination at its peak. Because I've been into search engine optimization for the past one year and I had the background on how search engines work and how to make a web site that show up on search results. I knew I could try solving this problem of finding a search engine for Vidya. I was sure that once I find a search solution, Ajai sir will let me deploy it on Vidya. Knowledge is not enough. We must apply! He is always open to new ideas. I went ahead to find a solution, taking this not so seriously but just like I explore any other thing. I took this also on the exploring list.

Being practical, I did not move on to write a search engine on my own. It was clear to me that there are open source products on Enterprise search. I analyzed products like Nutch, from Apache software foundation. Google, the king of search engines has a product that fits into an Enterprise and makes all documents searchable, not for free as on Internet. It costs a $2,995 which is the minimal Google Search Appliance, a hardware plus software product. It is a monster with 16GB of RAM and tera bytes of storage space. An amazing product at an amazingly large cost. Then I stumbled upon a product called simplexo, downloaded it, its a open source startup company from UK, offering and enterprise search solution. Having this in mind. Hey wait! I asked myself what is the equivalent of Google search appliance by Yahoo? These giants compete each other in a big way. There comes the answer, I googled for "Free enterprise search" and Google showed me the results that took me to a product called IBM OmniFind Yahoo! Edition. There was nothing surprising at first because it was yet another product that I stumbled upon. Downloaded it, deployed it on Linux. The installation was just by three clicks. Never before, there was a search product that is so easy to install. The next day I made OmniFind crawl 4000 pages from Wikipedia, it was more like downloading the whole web site. I tried some search queries and the results where perfect, very relevant to whatever I searched. The next day I crawled many thousands of pages and surprisingly, after indexing 8000 pages the collection occupied only 400MB. I found this incredible. I couldn't wait to tell him about what I've found. I can keep talking about its features, there are a lot more and its free! After onam holidays, I made a demo to Ajai sir and he also found it awesome. We deployed it on one of Vidya's server class machines. We freed a 25GB partition to store the index of the crawled data.

The day one, after installation just after 12 hours of installation, it indexed 15,000 documents on vidya. It includes all content inside the pdf books and not just the file names. Its when I found this server coming alive. It acted like a living being(my pet!!), always working and crawling files then stays idle for a while when it re-organizes the data to make optimal storage size. I looked at the crawler statistics whenever I got time, looking forward towards the launch of search engine. A thing that could make the Vidya's user experience better than ever and a lot of people will find it useful. It took two days to crawl 200,000+ documents and still crawling more files. I really love the way this search engine works, its easy to manage and gives good quality of search results.

Here I am today on Gandhi Jayanti, spent the whole day with Ajai sir working on one more thing that I've promised him to do along with search engine. Its to bring blogging into Vidya for sharing knowledge. It took me a week to figure out how to work with an unfamiliar environment which was Windows 2003 server. Today I've finally used all the learning about the server. Carefully and successfully installed wordpress, a blogging software and a server-side program. The blog's theme after we customized it for Vidya is something beautiful and aesthetically pleasing. The blog is meant to be a platform for knowledge sharing.

This evening while walking with Ajai sir, after a day's work, with a sense of satisfaction from a job well done. I was able to figure out that Ajai sir felt the same everyday since 2001 when he started building Vidya. I would say Vidya is "built with passion and love" for knowledge sharing. I wonder what would have been the joy of doing a great job with a passion. It was exciting to work with him, for an array of reasons. He is a friend, teacher and sometimes he takes a parental perspective. Owing to the openness, he is like this "If you have knowledge, let others light their candles in it". Lot of learning happened when unexpected problems popped up while installation, in fact he was also learning new things with me. I've never worked on such a large data set until we deployed the search engine on Vidya. I owe thanks to Ajai sir for giving the opportunity to work on Vidya's server for which he trusted me.

So its finally launch day tomorrow, we're taking search engine and blog public on Vidya. I'm thrilled to see the search engine in action. You can reach it at http://vidya/search .

Have a good day. Happy Searching!

"Imagination is more important than knowledge. For while knowledge defines all we currently know and understand, imagination points to all we might yet discover and create." - Albert Einstein

End of post***