Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Postgres Full Text Search Is Good Enough

Rach Belaid
January 21, 2015

Postgres Full Text Search Is Good Enough

Introduction to the full text search features of Postgres to build a small search engine

Rach Belaid

January 21, 2015
Tweet

More Decks by Rach Belaid

Other Decks in Technology

Transcript

  1. About me • I m not a DBA but an

    application developer • Co-founder of Lost Property (service company) • Python developer fiddling with Erlang and Haskell • Love using Postgres • Running the Pyramid London Meetup Twitter: @rachbelaid
  2. Why full-text search? The magnifying glass is something that we

    now add to wireframes without even knowing yet what we are going to search.
  3. Existing Solutions • Elasticsearch & SOLR : solution base on

    Lucene • Xapian: C++ with binding in most languages • Sphinx: Well integrated with RDMS
  4. Search Core Concepts • Document based • Never just looking

    through a string • Stemming and lexemes • N-gram • Relevance
  5. Terminology • Engine • Documents (what we are searching in)

    • Stopword (eg: ‘and’, ‘a’, ‘the’, ‘but’) • Stemming (Finding the root of a word) • Relevance (Algo used to rank the results of the query) • Boost (enhancing the relevance of certains document based on condition) • Position (Position of a token inside the document)
  6. Steps • Building our search document • Tokenize and stemming

    the documents • Build query to find our document
  7. Build our document SELECT post.title || ' ' || post.content

    || ' ' || author.name || ' ' || coalesce((string_agg(tag.name, ' ')), '') as document FROM post JOIN author ON author.id = post.author_id JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id JOIN tag ON tag.id = posts_tags.tag_id GROUP BY post.id;
  8. Tokenization: tsvector Let's try to convert a simple string into

    a tsvector. ! SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value'); ! Result: ! to_tsvector ---------------------------------------------------------- 'becom':4,13 'man':6,15 'rather':10 'success':8 'tri':1,11 'valu':17 ! ! • less words than in the original sentence, • some of the words are different (try became tri) • they are all followed by numbers. (position)
  9. Tokenization: tsvector A tsvector value is a sorted list of

    distinct lexemes which are words that have been normalized to make variants of the same word look alike. ! For example: ! - folding upper-case letters to lower-case - removal of suffixes (such as 's', 'es' or 'ing' in English). ! This allows searches to find variant forms of the same word without tediously entering all the possible variants. ! The numbers represent the position of the lexeme in the original string.
  10. Build Tokenized Documents CREATE VIEW documents AS( SELECT post.id as

    post_id,
 to_tsvector(post.title) || to_tsvector(post.content) || to_tsvector(author.name) || to_tsvector(coalesce((string_agg(tag.name, ' ‘)), ‘')) as content FROM post JOIN author ON author.id = post.author_id JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id JOIN tag ON tag.id = posts_tags.tag_id GROUP BY post.id );
  11. Querying SELECT to_tsvector('If you can dream it, you can do

    it') @@ ‘dream' as result; ! result ---------- t ! SELECT to_tsvector('It''s kind of fun to do the impossible') @@ ‘impossible' as result; ! result ---------- f
  12. Querying The second query returns false because we need to

    build a tsquery which creates the same lexemes and, using the operator @@, casts the string into a tsquery. SELECT ‘impossible’::tsquery, 
 to_tsquery('impossible'); tsquery | to_tsquery
 -------------+------------
 'impossible' | 'imposs'
  13. Querying we will use to_tsquery for querying documents ! SELECT

    to_tsvector('It''s kind of fun to do the impossible') @@ to_tsquery(’impossible’) as result; ! result ---------- t
  14. Querying A tsquery value stores lexemes that are to be

    searched for, and combines them honoring the Boolean operators: & (AND) | (OR) ! (NOT). ! Parentheses can be used to enforce grouping of the operators
  15. Querying SELECT to_tsvector('If the facts don't fit the theory, change

    the facts') @@ to_tsquery('! fact’); -> FALSE ! SELECT to_tsvector('If the facts don''t fit the theory, change the facts') @@ to_tsquery('theory & !fact’); -> FALSE ! SELECT to_tsvector('If the facts don''t fit the theory, change the facts.') @@ to_tsquery('fiction | theory’); -> TRUE
  16. Querying our document SELECT post.id, post.title FROM documents JOIN post

    ON post.id = document.post_id WHERE document.content @@ to_tsquery('Endangered & Species');
  17. Language support Postgres provides built-in text search for many languages:

    
 
 Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portugese, Romanian, Russian, Spanish, Swedish, Turkish
 ...and extensions for others.
  18. Language support: Stemming SELECT to_tsvector('english', 'We are running'); ‘run':3 !

    SELECT to_tsvector('french', 'We are running'); 'are':2 'running':3 'we':1
  19. CREATE VIEW documents AS( SELECT post.id as post_id,
 to_tsvector(post.language::regconfig, post.title)

    || to_tsvector(post.language::regconfig, post.content) || to_tsvector(‘simple’, author.name) || to_tsvector( ‘simple’,
 coalesce((string_agg(tag.name, ‘’)), ‘')
 ) as content FROM post JOIN author ON author.id = post.author_id JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id JOIN tag ON tag.id = posts_tags.tag_id GROUP BY post.id ); Language support: 
 Build new documents
  20. Accented Character Supporting query of document in different languages require

    to normalise accents.
 Eg: difficult to search for french accent with an english keyboard. ! CREATE EXTENSION unaccent;
 SELECT unaccent('èéêë');
 
 unaccent
 ----------
 eeee
 (1 row)
  21. Custom dictionary We can unaccent our document or create a

    new dictionary: CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french );
 ALTER TEXT SEARCH CONFIGURATION fr ALTER MAPPING
 FOR hword, hword_part, word WITH unaccent, french_stem;
 
 When we are using this new text search config, we can see the lexemes: SELECT to_tsvector('french', 'il était une fois');
 to_tsvector
 -------------
 'fois':4
 SELECT to_tsvector('fr', 'il était une fois');
 to_tsvector
 ——————————
 'etait':2 'fois':4
  22. Ranking / Boost • When you build a search engine

    you want to be able to get search results ordered by relevance. • To order our results by revelance PostgreSQL provides a few functions but we will be using only 2 of them : ts_rank() and setweight(). • The function setweight() allows us to assign a weight value to a tsvector(); the value can be 'A', 'B', 'C' or 'D'
  23. Ranking / Boost: Relevancy SELECT ts_rank(to_tsvector('This is an example of

    document'), to_tsquery('example | document')) as relevancy; -> 0.0607927
 SELECT ts_rank(to_tsvector('This is an example of document'), to_tsquery('example | unkown')) as relevancy; -> 0.0303964
 SELECT ts_rank(to_tsvector('This is an example of document'), to_tsquery('example & document')) as relevancy; -> 0.0985009 
 SELECT ts_rank(to_tsvector('This is an example of document'), to_tsquery('example & unknown')) as relevancy; -> 1e-20
  24. Ranking / Boost: Documents CREATE VIEW documents AS( SELECT post.id

    as post_id, post.title as title, setweight(to_tsvector(post.language::regconfig, 
 post.title), 'A') || setweight(to_tsvector(post.language::regconfig, post.content), 'B') || setweight(to_tsvector('simple', author.name), 'C') || setweight(to_tsvector(‘simple', coalesce(string_agg(tag.name, ' '))), 'B') as document FROM post JOIN author ON author.id = post.author_id JOIN posts_tags ON posts_tags.post_id = posts_tags.tag_id JOIN tag ON tag.id = posts_tags.tag_id GROUP BY post.id );
  25. Ranking / Boost: Querying SELECT post_id, title FROM documents WHERE

    
 documents.content @@ to_tsquery(‘english', QUERY ) ORDER BY 
 ts_rank(document.content, to_tsquery(‘english', QUERY)) DESC; ! ! QUERY = - ‘Endangered & Species’ - ‘Endangered | Species’
  26. Indexing: single table document PostgreSQL supports function based indexes so

    you can simply create a GIN index around the tsvector() function. (one table only) CREATE INDEX idx_fts_post 
 ON post 
 USING gin( 
 setweight(
 to_tsvector(language, title),’A’)||
 setweight(
 to_tsvector(language, content),'B')
 );
  27. Indexing: 
 multiple tables document In our schema example; the

    document is spread accross multiple tables with different weights. 
 For a better performance it's necessary to denormalize the data : • MATERIALIZED VIEW (if can have indexing delay) • TRIGGERS • Application logic • ?? suggestions ??
  28. Compare to ES/SOLR Keep in mind that FTS is a

    feature of Postgres but ElasticSearch and SOLR are tool dedicated to do FTS: • PG required to have the tsquery build properly where ES make it easier to do query with custom analyser on part of document • PG make it easier to not think about data sync or canonical datastore problem • ES offers great languages (and better out of the box) than PG. Lucene • PG is rock solid and easy to run • ES better unicode case folding