Switching to Solr
This week we have migrated our search engine from Sphinx to Solr. The main reasons for doing this are:
- Too strong of an interdependency between Delayed Job and Thinking Sphinx
- Full-text searching of documents stored in ftopia
Both Sphinx and Solr are full-text search engines. Sphinx has been designed for performance, relevance, and ease of integration. It is written in C++ and runs on most systems. Solr is based on the Lucene engine; its main strengths are its relevance and its extensibility.
Sphinx and Thinking Sphinx
The ruby gem that we have been using with Sphinx is Thinking Sphinx. Thinking Sphinx has many upsides:
- Easy integration of indexing with models
- Many search options and good pagination management
- Can be coupled with Delayed Job in order to index new data
- Fast searching and low memory footprint
However some of these arguments turn out to be not so cool when the data volume or load increases:
- The engine is too tightly linked to Delayed Job; when using this update mechanism instead of an asynchronous update, all the Delayed Job workers have to run on the same server where Sphinx sits. This is a major obstacle to horizontal scalability and high availability.
- Sphinx stores strings as integers – this is a problem when indexing data that doesn’t change much, such as the name of a class for an STI model, for instance (http://freelancing-god.github.com/ts/en/common_issues.html#string_filters).
Solr and Sunspot
Solr relies on the Lucene Apache engine and is written in Java. This might not sound that cool to a number of coders, but Solr includes scripts and tutorials that make it a breeze to install the server. In a dev env, a simple script is enough to boot the server.
The ruby implementation of Solr is Sunspot. Very similar to the Sphinx and Thinking Sphinx pair, but without Sphinx’s main drawbacks:
- Indexing new data is done by sending content to the Solr server, whereas Thinking Sphinx stores them in a database before a Delayed Job worker can process them. Performances are similar as long as the Solr server response time is equivalent to the database’s. Indexing this way through the Solr server avoids unnecessary coupling with Delayed Job.
- Stored data are not converted to integers and Solr can index recurrent textual data much more easily.
Sunspot also has a nice upside for the developer: since classes are reloaded upon each request, the code that describes the index is executed upon each request too. In our app with Thinking Sphinx, this code takes more than 1000ms to run in a dev env – it’s much faster with Sunspot, about 40ms with the same indexes! In production, classes are cached upon the first execution, therefore there’s almost no difference between the two search engines.
It’s also worth mentioning that Sunspot is slightly different regarding the way the search is coded: instead of using a traditional call to a method with search params, Sunspot uses a ruby block describing an elegant DSL. Much better for code readability:
Post.search do fulltext 'best pizza' with :blog_id, 1 with(:published_at).less_than Time.now order_by :published_at, :desc paginate :page => 2, :per_page => 15 facet :category_ids, :author_id end
Performance and next steps
Compared to Sphinx, Solr’s main drawback is search speed. But Solr’s performances can be greatly improved by clustering Solr servers.
We are much more confident about the scalability of our architecture now that the search engine runs independently of our asynchronous queue management system.
The next step is to exploit more of Solr capabilities and apply full-text indexing to all the textual content stored on ftopia.