Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Sphinx and Thinking Sphinx

Sphinx and Thinking Sphinx

Presentation on basic usage of Sphinx with Thinking Sphinx given to Well Railed on 24th September 2008

Jeremy Olliver

September 24, 2008
Tweet

More Decks by Jeremy Olliver

Other Decks in Programming

Transcript

  1. About Sphinx Sphinx is full-text search engine written in C++

    Integrates Directly with either Mysql or Postgresql (or an XML pipe mechanism) Sphinx is an acronym for SQL Phrase Index Ultrasphinx is another alternative rails interface to sphinx, I chose thinking-sphinx as it’s also written by sphinx’s creator Pat Allan Loosely coupled to Rails Sphinx is very loosely coupled with your rails app. After your initial configuration, all further tasks are run outside of rails. This means that if your search daemon dies or in the unlikely event of corrupt indexes, only searching will fail in your app. The loose coupling is because Sphinx interacts with your database directly, so any text you want to search needs to be in the database. If you want to search on modified data, it needs to be either the result of an sql function or stored in the database via a before_save callback.
  2. Installation Sphinx http://www.sphinxsearch.com/downloads.html windows binaries and source Thinking-Sphinx script/plugin install

    git://github.com/freelancing-god/thinking- sphinx.git Great guide from creator Pat Allan http://ts.freelancing-gods.com/quickstart.html http://ts.freelancing-gods.com/usage.html
  3. Define Searchables class User < ActiveRecord::Base # ... belongs_to :role

    belongs_to :address has_many :posts define_index do # fields indexes [:first_name, :last_name], :as => :name, :sortable => true indexes login, :sortable => :true indexes email indexes role.name, :as => :role indexes [ address.street_address, address.city, address.state, address.country, address.postcode ], :as => :address indexes posts.subject, :as => :post_subjects indexes posts.content, :as => :post_contents # attributes has created_at, role_id has posts(:id), :as => :post_ids end # ... end
  4. Define Searchables Fields Fields are indexed text that will be

    searched using the query you pass to sphinx. They cannot be used for sorting or filtering. Sphinx just searches for the words you specify and returns the document id's that provided the best matches. • Mutli-value fields indexes [:first_name, :last_name], :as => :name, :sortable => true combines two columns into one searchable field • Association columns indexes posts.subject, :as => :post_subjects Can go as deep as you like. Make sure you validate connecting associations
  5. Define Searchables Attributes Attributes are non text values assigned to

    each document in sphinx that can be used to refine the results from searching indexes. They can be used to do things like sort results and filter out a subset. has created_at, role_id has posts(:id), :as => :post_ids Again can use association columns. Here posts(:id) means posts.id, however id is a reserved word in ruby. This syntax works around that.
  6. Setting Up rake thinking_sphinx:configure uses thinking-sphinx’s defaults, then overrides with

    anything you define in config/sphinx.yml (layout of yml is similar to database.yml with environments). Also gets connection information to your database from your database.yml rake thinking_sphinx:index Thinking Sphinx checks your database and builds it’s references to the fields and attributes you specified in your models rake thinking_sphinx:start Starts up searchd the search daemon also has :stop and :restart unsurprisingly Now we’re ready to start searching for content.
  7. Searching Model Specific Search User.search “Jeremy”, :order => :name =>

    [#<User: id: 1, name: “Jeremy”, email: “[email protected]”>] You can only sort and order by anything that is either an attribute, or marked :sortable => true (which adds an attribute in the back end) Can use :conditions => “account_id is not null” just like any sql query, but only when performing a single model search like this, otherwise thinking-sphinx won’t know what columns are available
  8. Searching Multi-model Search ThinkingSphinx::Search.search “Jeremy”, :without => { :role =>

    nil } Get a collection containing matches from all models. Don’t use :conditions => “string” here. For filtering use :with => { :account_id => {2, 3} }, :without => { :role => nil } The :with and :without options expect a hash of attributes to filter the subsets. Multi model filtering works best with attributes that are defined for all models.
  9. Customisation Pagination All search results are paginated by default. You

    don’t really want all of the 1000’s of results loaded in by your controller, or displayed in your view. integrates seamlessly with will_paginate plugin: @results = User.search “Jeremy”, :page => params[:page], :per_page => 10 <%= will_paginate @results %> Tweaking fields and attributes If it takes you a while to decide what should/shouldn’t be an index, don’t forget: You’ll need to run rake thinking_sphinx:index to rebuild your indexes, and then rake thinking_sphinx:restart to restart searchd, so it’ll know about the changes.
  10. Customisation File locations Thinking sphinx puts all its generated indexes,

    config files, and pids in various places inside the app’s root directory. If you’re using capistrano or something similar for deployment, you may want to set custom fields in sphinx.yml to change the location of these (especially the pid for searchd). Indexing of Characters You can ignore characters, which will make sphinx simply skip over them in it’s indexing as if they weren’t there. You can also define a custom character set, which will be useful when working with non english characters, or to define umlaut equivalents (a, A, å, Å...). Omit any characters to treat them as delimiters for example making characters like ‘_’ essentially a space so you can find results for “my_file.png” by searching for either “my” or “file”.
  11. Stemming Stemming is set to stem_en by default for thinking-sphinx.

    Stemming strips back all terms to their base english form. ‘Swimming’ => ‘swim’, ‘swims’ => ‘swim’ This works wonders for matching singular/plural and alternate word forms to the same set of results. Watch out for exceptions ‘Business’ => ‘busy’ Words that are incorrectly stemmed won’t fail to find what you want, but rather searches for “Business” and “Busy” will return the same set of results, so you may get irrelevant results in these few cases. It can be turned off, which will result in literal matches