Full text search in ruby

Full text search is a technique for searching a document or database stored in the computer. A full text search engine examines all the words, in every stored document, to find a match of the keyword searched by the user. Many web sites and application programs provide full-text search capabilities. There are quite a few choices when it comes to adding a full text search in a Ruby on Rails application. A choice can be made on the basis of the language the search engine is written in or the scalability options suited for the application.

Acts As Indexed being a pure Ruby implementation makes for a tool that is totally portable, and suitable for almost any application requiring full text search capabilities. Search queries support many standard boolean operators, namely exclusion of a term through the use of ’-’ and the matching of phrases through the use of quotation marks. It is useful in case of a simple site and need to implement a basic search very quickly. Ferret is a full text search engine library written for ruby implemented in a rails application by the Acts As Ferret plugin.

It is inspired by the Apache Lucene Java project. The first step to implementing a search is to get an index built and then the index is searched for the documents having the keyword. One of the more useful features especially in a web scenario is highlighting the matched words. This is made trivial by Index’s highlight method. It’s also possible to use Ferret as a more general purpose data store Xapian is written in C++ with bindings to allow use from Perl, Python, PHP, Java, Tcl, C# and Ruby.

An important feature of Xapian is the Ranked probabilistic search – important words get more weight than unimportant words so more relevant results appear at the top. It also supports Synonyms as an automatic form of query expansion and can even suggest spelling corrections for user supplied queries. Full range of structured boolean search operators (“ stock NOT market”, etc). Sphinx, written in C++, is the most logical successor to Ultrasphinx, since both utilize Sphinx as the search server.

Sphinx works by reading information out of the database to build the search index. Communicationwith the Sphinx server occurs by sharing C “ objects” over sockets. A variety of text processing features enable fine-tuning Sphinx for application requirements, and a number of relevance functions ensures you can tweak search quality as well. Sunspot is a Ruby library for expressive, powerful interaction with the Solr search engine. Sunspot uses Solr, a Java search server built on the Lucene search library.

It provides robust, flexible full-text search with no boolean queries and no string programming. Solr servers can be clustered and since they manage the index, Sunspot can automatically update the indexes when the model objects change. There’s no need to run a cron job to reindex the data or setup delta indexing like with Sphinx. Thus we see that Full text search has come a long way since the early days of Ferret. The incompatibility of Ultrasphinx, once the most preferred, with Rails 3. resulted in the emergence of Sphinx and Sunspot as favourites. Solr is a compelling alternative to Sphinx, since the most scalable Web apps (Facebook, Twitter) use Java behind the UI layer. Xapian can be considered as the best option whenever ranked probabilistic search is required. Acts_As_Indexed, written entirely in Ruby, works out great and is very easy to implement with automatic indexing. (ie No cron jobs needed to keep the index up to date).