Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Postgres Full Text Search Is Good Enough

Avatar for Rach Belaid Rach Belaid
January 21, 2015

Postgres Full Text Search Is Good Enough

Introduction to the full text search features of Postgres to build a small search engine

Avatar for Rach Belaid

Rach Belaid

January 21, 2015
Tweet

More Decks by Rach Belaid

Other Decks in Technology

Transcript

  1. About me • I m not a DBA but an

    application developer • Co-founder of Lost Property (service company) • Python developer fiddling with Erlang and Haskell • Love using Postgres • Running the Pyramid London Meetup Twitter: @rachbelaid
  2. Why full-text search? The magnifying glass is something that we

    now add to wireframes without even knowing yet what we are going to search.
  3. Existing Solutions • Elasticsearch & SOLR : solution base on

    Lucene • Xapian: C++ with binding in most languages • Sphinx: Well integrated with RDMS
  4. Search Core Concepts • Document based • Never just looking

    through a string • Stemming and lexemes • N-gram • Relevance
  5. Terminology • Engine • Documents (what we are searching in)

    • Stopword (eg: ‘and’, ‘a’, ‘the’, ‘but’) • Stemming (Finding the root of a word) • Relevance (Algo used to rank the results of the query) • Boost (enhancing the relevance of certains document based on condition) • Position (Position of a token inside the document)
  6. Steps • Building our search document • Tokenize and stemming

    the documents • Build query to find our document
  7. Build our document SELECT post.title || ' ' || post.content

    || ' ' || author.name || ' ' || coalesce((string_agg(tag.name, ' ')), '') as document FROM post JOIN author ON author.id = post.author_id JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id JOIN tag ON tag.id = posts_tags.tag_id GROUP BY post.id;
  8. Tokenization: tsvector Let's try to convert a simple string into

    a tsvector. ! SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value'); ! Result: ! to_tsvector ---------------------------------------------------------- 'becom':4,13 'man':6,15 'rather':10 'success':8 'tri':1,11 'valu':17 ! ! • less words than in the original sentence, • some of the words are different (try became tri) • they are all followed by numbers. (position)
  9. Tokenization: tsvector A tsvector value is a sorted list of

    distinct lexemes which are words that have been normalized to make variants of the same word look alike. ! For example: ! - folding upper-case letters to lower-case - removal of suffixes (such as 's', 'es' or 'ing' in English). ! This allows searches to find variant forms of the same word without tediously entering all the possible variants. ! The numbers represent the position of the lexeme in the original string.
  10. Build Tokenized Documents CREATE VIEW documents AS( SELECT post.id as

    post_id,
 to_tsvector(post.title) || to_tsvector(post.content) || to_tsvector(author.name) || to_tsvector(coalesce((string_agg(tag.name, ' ‘)), ‘')) as content FROM post JOIN author ON author.id = post.author_id JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id JOIN tag ON tag.id = posts_tags.tag_id GROUP BY post.id );
  11. Querying SELECT to_tsvector('If you can dream it, you can do

    it') @@ ‘dream' as result; ! result ---------- t ! SELECT to_tsvector('It''s kind of fun to do the impossible') @@ ‘impossible' as result; ! result ---------- f
  12. Querying The second query returns false because we need to

    build a tsquery which creates the same lexemes and, using the operator @@, casts the string into a tsquery. SELECT ‘impossible’::tsquery, 
 to_tsquery('impossible'); tsquery | to_tsquery
 -------------+------------
 'impossible' | 'imposs'
  13. Querying we will use to_tsquery for querying documents ! SELECT

    to_tsvector('It''s kind of fun to do the impossible') @@ to_tsquery(’impossible’) as result; ! result ---------- t
  14. Querying A tsquery value stores lexemes that are to be

    searched for, and combines them honoring the Boolean operators: & (AND) | (OR) ! (NOT). ! Parentheses can be used to enforce grouping of the operators
  15. Querying SELECT to_tsvector('If the facts don't fit the theory, change

    the facts') @@ to_tsquery('! fact’); -> FALSE ! SELECT to_tsvector('If the facts don''t fit the theory, change the facts') @@ to_tsquery('theory & !fact’); -> FALSE ! SELECT to_tsvector('If the facts don''t fit the theory, change the facts.') @@ to_tsquery('fiction | theory’); -> TRUE
  16. Querying our document SELECT post.id, post.title FROM documents JOIN post

    ON post.id = document.post_id WHERE document.content @@ to_tsquery('Endangered & Species');
  17. Language support Postgres provides built-in text search for many languages:

    
 
 Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portugese, Romanian, Russian, Spanish, Swedish, Turkish
 ...and extensions for others.
  18. Language support: Stemming SELECT to_tsvector('english', 'We are running'); ‘run':3 !

    SELECT to_tsvector('french', 'We are running'); 'are':2 'running':3 'we':1
  19. CREATE VIEW documents AS( SELECT post.id as post_id,
 to_tsvector(post.language::regconfig, post.title)

    || to_tsvector(post.language::regconfig, post.content) || to_tsvector(‘simple’, author.name) || to_tsvector( ‘simple’,
 coalesce((string_agg(tag.name, ‘’)), ‘')
 ) as content FROM post JOIN author ON author.id = post.author_id JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id JOIN tag ON tag.id = posts_tags.tag_id GROUP BY post.id ); Language support: 
 Build new documents
  20. Accented Character Supporting query of document in different languages require

    to normalise accents.
 Eg: difficult to search for french accent with an english keyboard. ! CREATE EXTENSION unaccent;
 SELECT unaccent('èéêë');
 
 unaccent
 ----------
 eeee
 (1 row)
  21. Custom dictionary We can unaccent our document or create a

    new dictionary: CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french );
 ALTER TEXT SEARCH CONFIGURATION fr ALTER MAPPING
 FOR hword, hword_part, word WITH unaccent, french_stem;
 
 When we are using this new text search config, we can see the lexemes: SELECT to_tsvector('french', 'il était une fois');
 to_tsvector
 -------------
 'fois':4
 SELECT to_tsvector('fr', 'il était une fois');
 to_tsvector
 ——————————
 'etait':2 'fois':4
  22. Ranking / Boost • When you build a search engine

    you want to be able to get search results ordered by relevance. • To order our results by revelance PostgreSQL provides a few functions but we will be using only 2 of them : ts_rank() and setweight(). • The function setweight() allows us to assign a weight value to a tsvector(); the value can be 'A', 'B', 'C' or 'D'
  23. Ranking / Boost: Relevancy SELECT ts_rank(to_tsvector('This is an example of

    document'), to_tsquery('example | document')) as relevancy; -> 0.0607927
 SELECT ts_rank(to_tsvector('This is an example of document'), to_tsquery('example | unkown')) as relevancy; -> 0.0303964
 SELECT ts_rank(to_tsvector('This is an example of document'), to_tsquery('example & document')) as relevancy; -> 0.0985009 
 SELECT ts_rank(to_tsvector('This is an example of document'), to_tsquery('example & unknown')) as relevancy; -> 1e-20
  24. Ranking / Boost: Documents CREATE VIEW documents AS( SELECT post.id

    as post_id, post.title as title, setweight(to_tsvector(post.language::regconfig, 
 post.title), 'A') || setweight(to_tsvector(post.language::regconfig, post.content), 'B') || setweight(to_tsvector('simple', author.name), 'C') || setweight(to_tsvector(‘simple', coalesce(string_agg(tag.name, ' '))), 'B') as document FROM post JOIN author ON author.id = post.author_id JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id JOIN tag ON tag.id = posts_tags.tag_id GROUP BY post.id );
  25. Ranking / Boost: Querying SELECT post_id, title FROM documents WHERE

    
 documents.content @@ to_tsquery(‘english', QUERY ) ORDER BY 
 ts_rank(document.content, to_tsquery(‘english', QUERY)) DESC; ! ! QUERY = - ‘Endangered & Species’ - ‘Endangered | Species’
  26. Indexing: single table document PostgreSQL supports function based indexes so

    you can simply create a GIN index around the tsvector() function. (one table only) CREATE INDEX idx_fts_post 
 ON post 
 USING gin( 
 setweight(
 to_tsvector(language, title),’A’)||
 setweight(
 to_tsvector(language, content),'B')
 );
  27. Indexing: 
 multiple tables document In our schema example; the

    document is spread accross multiple tables with different weights. 
 For a better performance it's necessary to denormalize the data : • MATERIALIZED VIEW (if can have indexing delay) • TRIGGERS • Application logic • ?? suggestions ??
  28. Compare to ES/SOLR Keep in mind that FTS is a

    feature of Postgres but ElasticSearch and SOLR are tool dedicated to do FTS: • PG required to have the tsquery build properly where ES make it easier to do query with custom analyser on part of document • PG make it easier to not think about data sync or canonical datastore problem • ES offers great languages (and better out of the box) than PG. Lucene • PG is rock solid and easy to run • ES better unicode case folding