Given almost no time and an
unknown problem space, how do
I evaluate "fitness for purpose"?
Slide 8
Slide 8 text
You can't
Slide 9
Slide 9 text
Given almost no time and only a
glimpse of the problem space, how
do I evaluate "fitness for purpose"?
Slide 10
Slide 10 text
How much of a glimpse do I need?
Slide 11
Slide 11 text
In this talk, I’ll present:
• a solution unfit for purpose
• a solution fit for purpose, but only in cer-
tain boundaries
• a comparison to a fully fledged solution
Slide 12
Slide 12 text
To the daily practitioners: I’ll
gloss over a lot of points.
Slide 13
Slide 13 text
• Elasticsearch
• PostgreSQL
• MongoDB
Slide 14
Slide 14 text
Issue 1
Search systems are not binary.
Faults in the system degrade the
quality of the system, rarely break it.
Slide 15
Slide 15 text
Issue 2
Full text searchers are far more
focused on inputs then on output.
Slide 16
Slide 16 text
Building Block 1
An inverted index
Slide 17
Slide 17 text
doc id content
0 "Überlin ist auf Twitter"
1 "Ich bin auf Twitter"
2 "Ich folge Überlin"
Slide 18
Slide 18 text
terms document ids
uberlin 0,2
twitter 0,1
bin 1
ich 1,2
auf 0,1
Slide 19
Slide 19 text
Initial search rules are easy: if one or more of
the terms to the left is searched for, find the
document that matches. Count the matches.
Slide 20
Slide 20 text
Building Block 2
Textual input
Slide 21
Slide 21 text
Full text searchers generally work on real world
text. Get hold of as many samples as possible.
If necessary, write some on your own.
Slide 22
Slide 22 text
Don’t use an random generator. Or spend your
next weeks writing a sophisticated one.
Slide 23
Slide 23 text
Your system should bring
capabilities handling real world text.
Slide 24
Slide 24 text
Analysis
Slide 25
Slide 25 text
Analysis determines which terms end up at
the left side of the table in the first place.
Manipulating analysis is the
basis for manipulating matches.
Slide 30
Slide 30 text
Can I manipulate analysis?
Slide 31
Slide 31 text
MongoDB Only choose between
language presets
PostgreSQL Analysis happens
through normal
PL/SQL functions
Elasticsearch Analyser configura-
tion with a wide vari-
ety of choice
Slide 32
Slide 32 text
Ü
Slide 33
Slide 33 text
Does your system comfortably speak Unicode?
Slide 34
Slide 34 text
doc id field value
1 Test
2 test
3 Überlin
Slide 35
Slide 35 text
token doc ids
test 1,2
uberlin 3
Slide 36
Slide 36 text
MongoDB
Slide 37
Slide 37 text
search term no. matches
Test 2
test 2
Überlin 1
überlin 0
Slide 38
Slide 38 text
token doc ids
test 1,2
Überlin 3
Slide 39
Slide 39 text
input result
überlin überlin
Überlin Überlin
Slide 40
Slide 40 text
MongoDB fails at the simplest
case, lowercasing german
umlauts, in german settings.
Slide 41
Slide 41 text
The exact analysis behaviour is not
user-controllable, for simplicities sake.
Slide 42
Slide 42 text
The suggestion is to preprocess yourself.
Slide 43
Slide 43 text
No content
Slide 44
Slide 44 text
Further down
the Unicode
Slide 45
Slide 45 text
How well does you system
handle "creative" codes?
Slide 46
Slide 46 text
"\u0055\u0308"
"\u0075\u0308"
Slide 47
Slide 47 text
"\u0055\u0308" #=> Ü
"\u0075\u0308" #=> ü
Slide 48
Slide 48 text
PostgreSQL
Slide 49
Slide 49 text
postgres=# SELECT unaccent(U&’\0075\0308’);
unaccent
———-
ü
(1 row)
Slide 50
Slide 50 text
PostgreSQL handles UCS-2 level 1, not UTF.
Slide 51
Slide 51 text
No combining chars.
Slide 52
Slide 52 text
“ we should really reject combining chars,
but can’t do that w/o breaking BC.”
Slide 53
Slide 53 text
sigh,
Software
Slide 54
Slide 54 text
If you use PostgreSQL and text manipulation,
you probably have a bug in the hiding there.
Slide 55
Slide 55 text
UCS-2 for all textual data is a
doable constraint, though.