Slide 1

Slide 1 text

jsonb Deep Dive San Francisco PostgreSQL user group Peter Geoghegan [email protected] Twitter: @sternocera 1 Wednesday, June 25, 14

Slide 2

Slide 2 text

About me • Work for Heroku • Like working on performance features • Also like working on things that people want • Only started working on jsonb a month or so prior to commit 2 Wednesday, June 25, 14

Slide 3

Slide 3 text

What is jsonb? • A PostgreSQL data type. Binary storage format (based on hstore) • Accepts all valid JSON on input, and only outputs valid JSON • Earlier json datatype is text based • Not BSON. Not in competition with it either - not a new interchange format. 3 Wednesday, June 25, 14

Slide 4

Slide 4 text

Oleg & Teodor 4 Wednesday, June 25, 14

Slide 5

Slide 5 text

Andrew Dunstan, Alexander Korotkov 5 Hat tip: Ines Sombra of Engine Yard for supporting Oleg, Teodor, and Alexander Wednesday, June 25, 14

Slide 6

Slide 6 text

-- Simple scalar/primitive values contain only the identical value: SELECT ’”foo”'::jsonb @> '"foo"'::jsonb; -- The array on the right side is contained within the one on the left: SELECT '[1, 2, 3]'::jsonb @> '[1, 3]'::jsonb; -- The object with a single pair on the right side is contained -- within the object on the left side: SELECT '{"product": "PostgreSQL", "version": 9.4, "jsonb":true}'::jsonb @> '{"version":9.4}'::jsonb; -- The array on the right side is not considered contained within the -- array on the left, even though a similar array is nested within it: SELECT '[1, 2, [1, 3]]'::jsonb @> '[1, 3]'::jsonb; -- yields false -- But with a layer of nesting, it is contained: SELECT '[1, 2, [1, 3]]'::jsonb @> '[[1, 3]]'::jsonb; -- Similarly, containment is not reported here: SELECT '{"foo": {"bar": "baz"}}'::jsonb @> '{"bar": "baz"}'::jsonb; -- yields false 6 At a glance Wednesday, June 25, 14

Slide 7

Slide 7 text

postgres=# SELECT * FROM test WHERE j @> '{"tags":[{"term":"PostgreSQL"}, {"term":"Database"}, {"term":"plpgsql"}, {"term":"Cheatsheet"}]}'; -[ RECORD 1 ]- j | {"id": "http://delicious.com/url/ b7044a690899e0702b998db23335d3f3#chris.mcmillan", "link": "http:// www.postgresonline.com/journal/index.php?/archives/87-PostgreSQL-8.3-PLPGSQL- Cheatsheet-Overview.html", "tags": [{"term": "Cheatsheet", "label": null, "scheme": "http://delicious.com/chris.mcmillan/"}, {"term": "Database", "label": null, "scheme": "http://delicious.com/chris.mcmillan/"}, {"term": "plpgsql", "label": null, "scheme": "http://delicious.com/chris.mcmillan/"}, {"term": "PostgreSQL", "label": null, "scheme": "http://delicious.com/chris.mcmillan/"}, {"term": "Reference", "label": null, "scheme": "http://delicious.com/chris.mcmillan/"}, {"term": "SQL", "label": null, "scheme": "http://delicious.com/chris.mcmillan/"}], "links": [{"rel": "alternate", "href": "http://www.postgresonline.com/journal/ index.php?/archives/87-PostgreSQL-8.3-PLPGSQL-Cheatsheet-Overview.html", "type": "text/html"}], "title": "PostgreSQL 8.3 PL/PGSQL Cheatsheet Overview - Postgres OnLine Journal", "author": "chris.mcmillan", "source": {}, "updated": "Wed, 09 Sep 2009 16:05:31 +0000", "comments": "http://delicious.com/url/ b7044a690899e0702b998db23335d3f3", "guidislink": false, "title_detail": {"base": "http://feeds.delicious.com/v2/rss/recent?min=1&count=100", "type": "text/plain", "value": "PostgreSQL 8.3 PL/PGSQL Cheatsheet Overview - Postgres OnLine Journal", "language": null}, "wfw_commentrss": "http://feeds.delicious.com/v2/rss/url/ b7044a690899e0702b998db23335d3f3"} 7 Containment (“reaching in” to a jsonb) Wednesday, June 25, 14

Slide 8

Slide 8 text

json and jsonb: Why have both? • Not just an accident of history that we have both • jsonb has “shadow” types • jsonb deduplicates pairs on input • jsonb removes whitespace • jsonb object key order simple, fixed. Cheap binary searches rely on this. • You might not like that, but the RFC says it’s okay 8 Wednesday, June 25, 14

Slide 9

Slide 9 text

-- “Number” JSON primitive type behaves like core numeric type, mostly: SELECT ‘5’::jsonb = ‘5’::jsonb -- So, has arbitrary precision: SELECT '[”a”, “b”, 77]'::jsonb != '[”a”, “b”, 77.00000000000000000000001]'::jsonb; -- Like numeric, remembers trailing zeroes without making them semantically significant: SELECT '66'::jsonb = '66.000'::jsonb; -- String primitive type behaves like Postgres text type, using default -- database collation: SELECT j FROM jdocs ORDER BY j->‘name’ LIMIT 10; 9 “Shadow” type system, maps core types Wednesday, June 25, 14

Slide 10

Slide 10 text

json and jsonb: Why have both? • Parallel set of facilities for both, with extra stuff for jsonb • jsonb much faster in typical cases. Has indexable operator classes (unlike json). Very effective GIN indexing is possible. • Realistically, vast majority of users won’t care about deduplication, etc. Just use jsonb when you can. 10 Wednesday, June 25, 14

Slide 11

Slide 11 text

postgres=# SELECT * FROM jsonb_each('{"a":"foo", "b":"bar"}'); key | value -----+------- a | foo b | bar postgres=# SELECT * FROM json_object_keys('{"f1":"abc","f2":{"f3":"a", "f4":"b"}}'); json_object_keys ------------------ f1 f2 11 jsonb manipulation Wednesday, June 25, 14

Slide 12

Slide 12 text

12 7aae96c 7aae96c Indexing Wednesday, June 25, 14

Slide 13

Slide 13 text

Operator classes • Operator classes describe contract between datatype and system for indexing • Default B-Tree and hash operator classes available. Those two are actually not all that interesting for indexing as such. • GIN operator classes are very interesting for indexing. 13 Wednesday, June 25, 14

Slide 14

Slide 14 text

GIN Operator Classes • jsonb_path_ops and default • Both use hashing, but jsonb_path_ops more so • Indexes are very small for jsonb_path_ops • jsonb_path_ops does not support existence operator, so a bit less flexible. 14 Wednesday, June 25, 14

Slide 15

Slide 15 text

15 @> jsonb operator (containment operator) Does the left JSON value contain within it the right value? '{"a":1, "b":2}'::jsonb @> '{"b":2}'::jsonb <@ jsonb operator (containment operator’s commutator) Is the left JSON value contained within the right value? '{"b":2}'::jsonb <@ '{"a":1, "b":2}'::jsonb ? text operator Does the key/element string exist within the JSON value? '{"a":1, "b":2}'::jsonb ? 'b' ?| text[] operator Do any of these key/element strings exist? '{"a":1, "b":2, "c":3}'::jsonb ?| array['b', 'c'] ?& text[] operator Do all of these key/element strings exist? '["a", "b"]'::jsonb ?& array['a', 'b'] Wednesday, June 25, 14

Slide 16

Slide 16 text

When to use each • jsonb_path_ops works best with nested containment. Sophisticated containment very quick. • Default GIN opclass better suited to old hstore use- case. Existence of key only at top level not otherwise all that interesting. • You can use expression indexes to test nested existence. • Beware of queries that don’t contain jsonb_path_ops GIN keys. 16 Wednesday, June 25, 14

Slide 17

Slide 17 text

What is GIN, anyway? • Extensible infrastructure. B-Tree of B- Trees. “Posting tree” used when there are many duplicates. • Classic use-case is Full-text search. Array indexing another good example. • Always contains “keys”, stored in a format decided by the opclass. (Classic B-Tree opclasses don’t allow even this) 17 Wednesday, June 25, 14

Slide 18

Slide 18 text

GIN and jsonb • jsonb_path_ops stores only 32-bit integers - hash values • Default opclass stores text • GIN is good at handling many duplicates • Generally number of GIN keys will vary between two opclasses for document, reflecting differing strategies 18 Wednesday, June 25, 14

Slide 19

Slide 19 text

What the GIN opclasses really do • jsonb_path_ops stores hash of nested primitive JSON values, with any keys encountered along the way mixed in • Default (text-based) opclass generally stores full text copy of each JSON key and value separately (as text GIN keys). Even number primitives stored as text. 19 Wednesday, June 25, 14

Slide 20

Slide 20 text

SELECT ’{"id": "http://delicious.com/url/b7044a690899e0702b998db23335d3f3#chris.mcmillan", "link": "http://www.postgresonline.com/journal/index.php?/archives/87-PostgreSQL-8.3-PLPGSQL-Cheatsheet- Overview.html", "tags": [{"term": "Cheatsheet", "label": null, "scheme": "http://delicious.com/chris.mcmillan/"}, {"term": "Database", "label": null, "scheme": "http://delicious.com/chris.mcmillan/"}, {"term": "plpgsql", "label": null, "scheme": "http://delicious.com/chris.mcmillan/"}, {"term": "PostgreSQL", "label": null, "scheme": "http://delicious.com/chris.mcmillan/"}, {"term": "Reference", "label": null, "scheme": "http:// delicious.com/chris.mcmillan/"}, {"term": "SQL", "label": null, "scheme": "http://delicious.com/ chris.mcmillan/"}], "links": [{"rel": "alternate", "href": "http://www.postgresonline.com/journal/index.php?/archives/87- PostgreSQL-8.3-PLPGSQL-Cheatsheet-Overview.html", "type": "text/html"}], "title": "PostgreSQL 8.3 PL/PGSQL Cheatsheet Overview - Postgres OnLine Journal", "author": "chris.mcmillan", "source": {}, "updated": "Wed, 09 Sep 2009 16:05:31 +0000", "comments": "http://delicious.com/url/b7044a690899e0702b998db23335d3f3", "guidislink": false, "title_detail": {"base": "http://feeds.delicious.com/v2/rss/recent?min=1&count=100", "type": "text/plain", "value": "PostgreSQL 8.3 PL/PGSQL Cheatsheet Overview - Postgres OnLine Journal", "language": null}, "wfw_commentrss": "http://feeds.delicious.com/v2/rss/url/b7044a690899e0702b998db23335d3f3"}’::jsonb 20 What is a GIN key (depends on opclass)? jsonb_path_ops GIN keys, jsonb_ops GIN keys Wednesday, June 25, 14

Slide 21

Slide 21 text

What GIN knows about jsonb • GIN only knows what opclass tells it • Fundamentally, GIN only knows “here are some keys, and here is where each points to in table”. So, no direct representation of nesting level, for example. • When querying, process of breaking up indexable operator’s rhs value (to storage format) occurs within opclass support functions 21 Wednesday, June 25, 14

Slide 22

Slide 22 text

Structure • Even GIN opclasses do not know about structure of jsonb • Only return rows that might be matches (and very probably are) • Bitmap index scans must recheck to ensure correctness - otherwise, false positives possible • This is not an inherent property of GIN 22 Wednesday, June 25, 14

Slide 23

Slide 23 text

Which opclass should I use? • In my opinion, “containment” is usually the compelling way of querying jsonb • Containment can “reach in” at multiple points • Existence is more for the classic hstore use-case (it only considers top-level JSON object keys) - heterogenous structure. So, I’d pick jsonb_path_ops almost every time for semi- structured data. 23 Wednesday, June 25, 14

Slide 24

Slide 24 text

Why jsonb_path_ops is fast • Very selective - only nested values stored as GIN keys. Also, less to store initially, and less to physically query later. • Cheap int comparisons for B-Tree index scans. • Since only hashes stored, length irrelevant. • Edge-case: Be sue you have some object value (or array element) in “jsonb @>” predicate! 24 Wednesday, June 25, 14

Slide 25

Slide 25 text

GIN is better generally in 9.4 • Lots of work on GIN in Postgres 9.4. • More or less incidental to jsonb, but will help a lot • Much smaller indexes due to new posting list compression • Multi-key skipping optimization • “Rare and frequent” optimization likely particularly beneficial with default GIN opclass 25 Wednesday, June 25, 14

Slide 26

Slide 26 text

Delicious sample data benchmarks • Bookmark data, publicly available • Table size when stored as jsonb: ~1.3Gb • jsonb_ops (default) - 636 Mb (with no compression, 815Mb) • jsonb_path_ops - 295 Mb • jsonb_path_ops (tags) - 44 Mb USING gin((jb->'tags')) with jsonb_path_ops • jsonb_path_ops (tags.term) - 1.6 Mb 26 Wednesday, June 25, 14

Slide 27

Slide 27 text

Storage overhead versus json, text types 27 =# \dt+ List of relations Schema | Name | Type | Owner | Size | Description --------+------+-------+----------+---------+------------- public | jb | table | postgres | 1374 MB | overhead is < 4% public | js | table | postgres | 1322 MB | public | tx | table | postgres | 1322 MB | Wednesday, June 25, 14

Slide 28

Slide 28 text

“Delicious” sequential scan performance 28 EXPLAIN ANALYZE SELECT count(*) FROM jb WHERE jb @> '{"tags":[{"term":"NYC"}]}'::jsonb; QUERY PLAN --------------------------------------------------------------------------------------- Aggregate (cost=191521.30..191521.31 rows=1 width=0) (actual time=1263.201..1263.201 rows=1 loops=1) -> Seq Scan on jb (cost=0.00..191518.16 rows=1253 width=0) (actual time=0.007..1263.065 rows=285 loops=1) Filter: (jb @> '{"tags": [{"term": "NYC"}]}'::jsonb) Rows Removed by Filter: 1252688 Planning time: 0.065 ms Execution runtime: 1263.225 ms (6 rows) Wednesday, June 25, 14

Slide 29

Slide 29 text

Default opclass performance 29 CREATE INDEX gin_jb_idx ON jb USING gin(jb); EXPLAIN ANALYZE SELECT count(*) FROM jb WHERE jb @> '{"tags":[{"term":"NYC"}]}'::jsonb; QUERY PLAN --------------------------------------------------------------------------------------- Aggregate (cost=4772.72..4772.73 rows=1 width=0) (actual time=8.486..8.486 rows=1 loops=1) -> Bitmap Heap Scan on jb (cost=73.71..4769.59 rows=1253 width=0) (actual time=8.049..8 rows=285 loops=1) Recheck Cond: (jb @> '{"tags": [{"term": "NYC"}]}'::jsonb) Heap Blocks: exact=285 -> Bitmap Index Scan on gin_jb_idx (cost=0.00..73.40 rows=1253 width=0) (actual time=8.014..8.014 rows=285 loops=1) Index Cond: (jb @> '{"tags": [{"term": "NYC"}]}'::jsonb) Planning time: 0.115 ms Execution runtime: 8.515 ms (8 rows) Wednesday, June 25, 14

Slide 30

Slide 30 text

jsonb_path_ops performance 30 CREATE INDEX gin_jb_path_idx ON jb USING gin(jb jsonb_path_ops); EXPLAIN ANALYZE SELECT count(*) FROM jb WHERE jb @> '{"tags":[{"term":"NYC"}]}'::jsonb; QUERY PLAN --------------------------------------------------------------------------------------- Aggregate (cost=4732.72..4732.73 rows=1 width=0) (actual time=0.644..0.644 rows=1 loops=1) -> Bitmap Heap Scan on jb (cost=33.71..4729.59 rows=1253 width=0) (actual time=0.102..0.620 rows=285 loops=1) Recheck Cond: (jb @> '{"tags": [{"term": "NYC"}]}'::jsonb) Heap Blocks: exact=285 -> Bitmap Index Scan on gin_jb_path_idx (cost=0.00..33.40 rows=1253 width=0) (actual time=0.062..0.062 rows=285 loops=1) Index Cond: (jb @> '{"tags": [{"term": "NYC"}]}'::jsonb) Planning time: 0.056 ms Execution runtime: 0.668 ms (8 rows) Wednesday, June 25, 14

Slide 31

Slide 31 text

What I think of jsonb • Makes Postgres quite competitive with document DBs. jsonb development mostly about what semantics make sense. • However, doesn’t represent change of direction. Postgres is object relational database. • http://pgeoghegan.blogspot.com/2014/03/what-i-think-of-jsonb.html • hstore went in in 2006. Have native MAC address type, polygon type, XML type, etc. PostGIS is more or less just a set of datatypes. • Very rich support for custom datatypes (including things like custom selectivity estimators) a major goal of Postgres all along 31 Wednesday, June 25, 14

Slide 32

Slide 32 text

Futures • “vodka” AM - like GIN, but main tree is Suffix Tree • “jsquery” query language • Statistics • Pretty printing jsonb • Concatenate operator. Richer manipulation of jsonb. • Delta-based row versioning? Some talk of this, but not obviously all that valuable to common cases. Document DBs generally have atomic documents baked into their data model, so there is in general a large incentive to just not have massive documents - lock contention. 32 Wednesday, June 25, 14

Slide 33

Slide 33 text

Thanks! Questions? 33 Wednesday, June 25, 14