Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Adequate Full Text Search
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Florian Gilcher
November 25, 2014
Programming
1
160
Adequate Full Text Search
given at Elasticsearch UG in November 2014
Florian Gilcher
November 25, 2014
Tweet
Share
More Decks by Florian Gilcher
See All by Florian Gilcher
A new contract with users
skade
1
520
Using Rust to interface with my dive computer
skade
0
280
async/.await with async-std
skade
1
800
Training Rust
skade
1
130
Internet of Streams - IoT in Rust
skade
0
110
How DevRel is failing communities
skade
0
110
The power of the where clause
skade
0
670
Three Years of Rust
skade
1
210
Rust as a CLI language
skade
1
230
Other Decks in Programming
See All in Programming
IFSによる形状設計/デモシーンの魅力 @ 慶應大学SFC
gam0022
1
310
Fluid Templating in TYPO3 14
s2b
0
130
dchart: charts from deck markup
ajstarks
3
1k
Data-Centric Kaggle
isax1015
2
780
Package Management Learnings from Homebrew
mikemcquaid
0
230
コマンドとリード間の連携に対する脅威分析フレームワーク
pandayumi
1
470
OCaml 5でモダンな並列プログラミングを Enjoyしよう!
haochenx
0
150
AIで開発はどれくらい加速したのか?AIエージェントによるコード生成を、現場の評価と研究開発の評価の両面からdeep diveしてみる
daisuketakeda
1
2.5k
CSC307 Lecture 07
javiergs
PRO
1
560
AIによる開発の民主化を支える コンテキスト管理のこれまでとこれから
mulyu
3
510
AI Agent の開発と運用を支える Durable Execution #AgentsInProd
izumin5210
7
2.3k
24時間止められないシステムを守る-医療ITにおけるランサムウェア対策の実際
koukimiura
1
130
Featured
See All Featured
AI Search: Where Are We & What Can We Do About It?
aleyda
0
7k
The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMs
inesmontani
PRO
3
3.1k
技術選定の審美眼(2025年版) / Understanding the Spiral of Technologies 2025 edition
twada
PRO
117
110k
What the history of the web can teach us about the future of AI
inesmontani
PRO
1
440
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
PRO
196
71k
Reflections from 52 weeks, 52 projects
jeffersonlam
356
21k
Rebuilding a faster, lazier Slack
samanthasiow
85
9.4k
Leading Effective Engineering Teams in the AI Era
addyosmani
9
1.6k
The Curse of the Amulet
leimatthew05
1
8.7k
職位にかかわらず全員がリーダーシップを発揮するチーム作り / Building a team where everyone can demonstrate leadership regardless of position
madoxten
58
50k
Build your cross-platform service in a week with App Engine
jlugia
234
18k
HU Berlin: Industrial-Strength Natural Language Processing with spaCy and Prodigy
inesmontani
PRO
0
230
Transcript
None
$ cat .profile GIT_AUTHOR_NAME=Florian Gilcher
[email protected]
TM_COMPANY=Asquera GmbH TWITTER_HANDLE=argorak GITHUB_HANDLE=skade
• Backend developer • Focused on infrastructure and databases
• Elasticsearch Usergroup • mrgn.in meetup • Rust Usergroup (co-org)
• organizer alumni eurucamp • organizer alumni JRubyConf.EU • Ruby Berlin board member
Adequate Full Text Search
The evaluation problem
Given almost no time and an unknown problem space, how
do I evaluate "fitness for purpose"?
You can't
Given almost no time and only a glimpse of the
problem space, how do I evaluate "fitness for purpose"?
How much of a glimpse do I need?
In this talk, I’ll present: • a solution unfit for
purpose • a solution fit for purpose, but only in cer- tain boundaries • a comparison to a fully fledged solution
To the daily practitioners: I’ll gloss over a lot of
points.
• Elasticsearch • PostgreSQL • MongoDB
Issue 1 Search systems are not binary. Faults in the
system degrade the quality of the system, rarely break it.
Issue 2 Full text searchers are far more focused on
inputs then on output.
Building Block 1 An inverted index
doc id content 0 "Überlin ist auf Twitter" 1 "Ich
bin auf Twitter" 2 "Ich folge Überlin"
terms document ids uberlin 0,2 twitter 0,1 bin 1 ich
1,2 auf 0,1
Initial search rules are easy: if one or more of
the terms to the left is searched for, find the document that matches. Count the matches.
Building Block 2 Textual input
Full text searchers generally work on real world text. Get
hold of as many samples as possible. If necessary, write some on your own.
Don’t use an random generator. Or spend your next weeks
writing a sophisticated one.
Your system should bring capabilities handling real world text.
Analysis
Analysis determines which terms end up at the left side
of the table in the first place.
analysis result "ich folge Überlin" whitespace "ich" "folge" "Überlin" lowercase
"ich" "folge" "überlin" normalize "ich" "folge" "uberlin" stemming "ich" "folg" "uberlin"
analysis result "ich folge ueberlin" whitespace "ich" "folge" "ueberlin" lowercase
"ich" "folge" "ueberlin" normalize "ich" "folge" "ueberlin" stemming "ich" "folg" "uberlin"
This step happens both on indexing and queries.
Manipulating analysis is the basis for manipulating matches.
Can I manipulate analysis?
MongoDB Only choose between language presets PostgreSQL Analysis happens through
normal PL/SQL functions Elasticsearch Analyser configura- tion with a wide vari- ety of choice
Ü
Does your system comfortably speak Unicode?
doc id field value 1 Test 2 test 3 Überlin
token doc ids test 1,2 uberlin 3
MongoDB
search term no. matches Test 2 test 2 Überlin 1
überlin 0
token doc ids test 1,2 Überlin 3
input result überlin überlin Überlin Überlin
MongoDB fails at the simplest case, lowercasing german umlauts, in
german settings.
The exact analysis behaviour is not user-controllable, for simplicities sake.
The suggestion is to preprocess yourself.
None
Further down the Unicode
How well does you system handle "creative" codes?
"\u0055\u0308" "\u0075\u0308"
"\u0055\u0308" #=> Ü "\u0075\u0308" #=> ü
PostgreSQL
postgres=# SELECT unaccent(U&’\0075\0308’); unaccent ———- ü (1 row)
PostgreSQL handles UCS-2 level 1, not UTF.
No combining chars.
“ we should really reject combining chars, but can’t do
that w/o breaking BC.”
sigh, Software
If you use PostgreSQL and text manipulation, you probably have
a bug in the hiding there.
UCS-2 for all textual data is a doable constraint, though.
input result überlin überlin Überlin überlin \u0055 \u0308 Invalid input
\u0075 \u0308 Invalid input
Elasticsearch
Elasticsearch can handle all those cases and then some, using
the analysis-icu plugin.
Install it and use it.
curl -XGET ’localhost:9200/_analyze?\ tokenizer=\ icu_tokenizer\ &token_filters=\ icu_folding,icu_normalizer’\ -d ’Überlin’
input result überlin uberlin Überlin uberlin \u0055 \u0308 uberlin \u0075
\u0308 uberlin
The way the system supports you in safely inserting textual
input is of paramount importance!
Find the worst shenanegans of you language, try it out.
l’elision, c’est magnifique
Building Block 3 Scoring
Search is all about relevance and combinations thereof.
Was the match in the title or the body of
a document?
How many options do I have?
All three systems can weight matches on fields differently.
When can I decide those weights?
database index time query time MongoDB yes no PostgreSQL yes
no Elasticsearch yes yes
Weights during index time need a rebuild of the index
every time you change them.
If in doubt, choose query time weights.
Can I influence the scoring/ranking further?
database MongoDB no PostgreSQL yes, using PL/SQL functions Elasticsearch yes,
in many fashions (geo, distance, etc.)
Building Block 4 Documentation
I glossed over a lot of details.
How well is the process documented, internally and interface-wise?
database interface internal MongoDB good almost non-existent PostgreSQL great great
Elasticsearch great great
Can I grow beyond?
And this is where the fun starts and we stop.
What’s adequate?
• Allows to manipulate analysis • Assists with real world
input • Allows you to build combined, extensible queries • Good documentation
MongoDB is not fit for purpose with holes that can
only be fixed by careful preparation of that data.
That preparation needs lots of detail knowledge you probably don’t
want to aquire.
PostgreSQL is adequate and in the PostgreSQL tradition of stable,
well-documented features. It doesn’t win prices, but is workable and reliable.
A good solution if search is just a bystander. A
thousand times better than LIKE.
Elasticsearch is based on Lucene and comes with all the
goodies and also has great documentations and guides.
If search is at the core of your product, use
a proper search engine.
References on the meetup group tomorrow.
Thank you!
None
COURSES
Elasticsearch for managers: http://esmanagers2014.asquera.de/
None
December 2nd
Getting started workshop: http://purchases.elastic- search.com/class/elasticsearch/elk-work- shop/berlin-germany/2014-12-15
None
December 15th