jsonb Deep Dive

jsonb Deep Dive

Peter Geoghegan, one of the major developers of the new "JSONB" binary, indexable JSON type for PostgreSQL 9.4, will be in town and will guide SFPUG members in a "deep dive" into the new technology, including:

• Both the new JSONB type and the old JSON type input and output JSON, so what's the difference?
• What new features does it offer?
• How is the new data type structured, and how does it work?
• How do you index JSONB?
• What things remain unimplemented?

Before the main event, we will have a Lightning Talk by Eric Ongerth: Running PostgreSQL in a Docker Container
Food and Drink, as well as Peter's travel, are sponsored by Heroku. It is being hosted by SwitchFly.

F9a2dba12b94d0c204d846a29da56bf5?s=128

Peter Geoghegan

June 25, 2014
Tweet

Transcript

  1. 1.

    jsonb Deep Dive San Francisco PostgreSQL user group Peter Geoghegan

    pg@heroku.com Twitter: @sternocera 1 Wednesday, June 25, 14
  2. 2.

    About me • Work for Heroku • Like working on

    performance features • Also like working on things that people want • Only started working on jsonb a month or so prior to commit 2 Wednesday, June 25, 14
  3. 3.

    What is jsonb? • A PostgreSQL data type. Binary storage

    format (based on hstore) • Accepts all valid JSON on input, and only outputs valid JSON • Earlier json datatype is text based • Not BSON. Not in competition with it either - not a new interchange format. 3 Wednesday, June 25, 14
  4. 5.

    Andrew Dunstan, Alexander Korotkov 5 Hat tip: Ines Sombra of

    Engine Yard for supporting Oleg, Teodor, and Alexander Wednesday, June 25, 14
  5. 6.

    -- Simple scalar/primitive values contain only the identical value: SELECT

    ’”foo”'::jsonb @> '"foo"'::jsonb; -- The array on the right side is contained within the one on the left: SELECT '[1, 2, 3]'::jsonb @> '[1, 3]'::jsonb; -- The object with a single pair on the right side is contained -- within the object on the left side: SELECT '{"product": "PostgreSQL", "version": 9.4, "jsonb":true}'::jsonb @> '{"version":9.4}'::jsonb; -- The array on the right side is not considered contained within the -- array on the left, even though a similar array is nested within it: SELECT '[1, 2, [1, 3]]'::jsonb @> '[1, 3]'::jsonb; -- yields false -- But with a layer of nesting, it is contained: SELECT '[1, 2, [1, 3]]'::jsonb @> '[[1, 3]]'::jsonb; -- Similarly, containment is not reported here: SELECT '{"foo": {"bar": "baz"}}'::jsonb @> '{"bar": "baz"}'::jsonb; -- yields false 6 At a glance Wednesday, June 25, 14
  6. 7.

    postgres=# SELECT * FROM test WHERE j @> '{"tags":[{"term":"PostgreSQL"}, {"term":"Database"},

    {"term":"plpgsql"}, {"term":"Cheatsheet"}]}'; -[ RECORD 1 ]- j | {"id": "http://delicious.com/url/ b7044a690899e0702b998db23335d3f3#chris.mcmillan", "link": "http:// www.postgresonline.com/journal/index.php?/archives/87-PostgreSQL-8.3-PLPGSQL- Cheatsheet-Overview.html", "tags": [{"term": "Cheatsheet", "label": null, "scheme": "http://delicious.com/chris.mcmillan/"}, {"term": "Database", "label": null, "scheme": "http://delicious.com/chris.mcmillan/"}, {"term": "plpgsql", "label": null, "scheme": "http://delicious.com/chris.mcmillan/"}, {"term": "PostgreSQL", "label": null, "scheme": "http://delicious.com/chris.mcmillan/"}, {"term": "Reference", "label": null, "scheme": "http://delicious.com/chris.mcmillan/"}, {"term": "SQL", "label": null, "scheme": "http://delicious.com/chris.mcmillan/"}], "links": [{"rel": "alternate", "href": "http://www.postgresonline.com/journal/ index.php?/archives/87-PostgreSQL-8.3-PLPGSQL-Cheatsheet-Overview.html", "type": "text/html"}], "title": "PostgreSQL 8.3 PL/PGSQL Cheatsheet Overview - Postgres OnLine Journal", "author": "chris.mcmillan", "source": {}, "updated": "Wed, 09 Sep 2009 16:05:31 +0000", "comments": "http://delicious.com/url/ b7044a690899e0702b998db23335d3f3", "guidislink": false, "title_detail": {"base": "http://feeds.delicious.com/v2/rss/recent?min=1&count=100", "type": "text/plain", "value": "PostgreSQL 8.3 PL/PGSQL Cheatsheet Overview - Postgres OnLine Journal", "language": null}, "wfw_commentrss": "http://feeds.delicious.com/v2/rss/url/ b7044a690899e0702b998db23335d3f3"} 7 Containment (“reaching in” to a jsonb) Wednesday, June 25, 14
  7. 8.

    json and jsonb: Why have both? • Not just an

    accident of history that we have both • jsonb has “shadow” types • jsonb deduplicates pairs on input • jsonb removes whitespace • jsonb object key order simple, fixed. Cheap binary searches rely on this. • You might not like that, but the RFC says it’s okay 8 Wednesday, June 25, 14
  8. 9.

    -- “Number” JSON primitive type behaves like core numeric type,

    mostly: SELECT ‘5’::jsonb = ‘5’::jsonb -- So, has arbitrary precision: SELECT '[”a”, “b”, 77]'::jsonb != '[”a”, “b”, 77.00000000000000000000001]'::jsonb; -- Like numeric, remembers trailing zeroes without making them semantically significant: SELECT '66'::jsonb = '66.000'::jsonb; -- String primitive type behaves like Postgres text type, using default -- database collation: SELECT j FROM jdocs ORDER BY j->‘name’ LIMIT 10; 9 “Shadow” type system, maps core types Wednesday, June 25, 14
  9. 10.

    json and jsonb: Why have both? • Parallel set of

    facilities for both, with extra stuff for jsonb • jsonb much faster in typical cases. Has indexable operator classes (unlike json). Very effective GIN indexing is possible. • Realistically, vast majority of users won’t care about deduplication, etc. Just use jsonb when you can. 10 Wednesday, June 25, 14
  10. 11.

    postgres=# SELECT * FROM jsonb_each('{"a":"foo", "b":"bar"}'); key | value -----+-------

    a | foo b | bar postgres=# SELECT * FROM json_object_keys('{"f1":"abc","f2":{"f3":"a", "f4":"b"}}'); json_object_keys ------------------ f1 f2 11 jsonb manipulation Wednesday, June 25, 14
  11. 13.

    Operator classes • Operator classes describe contract between datatype and

    system for indexing • Default B-Tree and hash operator classes available. Those two are actually not all that interesting for indexing as such. • GIN operator classes are very interesting for indexing. 13 Wednesday, June 25, 14
  12. 14.

    GIN Operator Classes • jsonb_path_ops and default • Both use

    hashing, but jsonb_path_ops more so • Indexes are very small for jsonb_path_ops • jsonb_path_ops does not support existence operator, so a bit less flexible. 14 Wednesday, June 25, 14
  13. 15.

    15 @> jsonb operator (containment operator) Does the left JSON

    value contain within it the right value? '{"a":1, "b":2}'::jsonb @> '{"b":2}'::jsonb <@ jsonb operator (containment operator’s commutator) Is the left JSON value contained within the right value? '{"b":2}'::jsonb <@ '{"a":1, "b":2}'::jsonb ? text operator Does the key/element string exist within the JSON value? '{"a":1, "b":2}'::jsonb ? 'b' ?| text[] operator Do any of these key/element strings exist? '{"a":1, "b":2, "c":3}'::jsonb ?| array['b', 'c'] ?& text[] operator Do all of these key/element strings exist? '["a", "b"]'::jsonb ?& array['a', 'b'] Wednesday, June 25, 14
  14. 16.

    When to use each • jsonb_path_ops works best with nested

    containment. Sophisticated containment very quick. • Default GIN opclass better suited to old hstore use- case. Existence of key only at top level not otherwise all that interesting. • You can use expression indexes to test nested existence. • Beware of queries that don’t contain jsonb_path_ops GIN keys. 16 Wednesday, June 25, 14
  15. 17.

    What is GIN, anyway? • Extensible infrastructure. B-Tree of B-

    Trees. “Posting tree” used when there are many duplicates. • Classic use-case is Full-text search. Array indexing another good example. • Always contains “keys”, stored in a format decided by the opclass. (Classic B-Tree opclasses don’t allow even this) 17 Wednesday, June 25, 14
  16. 18.

    GIN and jsonb • jsonb_path_ops stores only 32-bit integers -

    hash values • Default opclass stores text • GIN is good at handling many duplicates • Generally number of GIN keys will vary between two opclasses for document, reflecting differing strategies 18 Wednesday, June 25, 14
  17. 19.

    What the GIN opclasses really do • jsonb_path_ops stores hash

    of nested primitive JSON values, with any keys encountered along the way mixed in • Default (text-based) opclass generally stores full text copy of each JSON key and value separately (as text GIN keys). Even number primitives stored as text. 19 Wednesday, June 25, 14
  18. 20.

    SELECT ’{"id": "http://delicious.com/url/b7044a690899e0702b998db23335d3f3#chris.mcmillan", "link": "http://www.postgresonline.com/journal/index.php?/archives/87-PostgreSQL-8.3-PLPGSQL-Cheatsheet- Overview.html", "tags": [{"term": "Cheatsheet", "label":

    null, "scheme": "http://delicious.com/chris.mcmillan/"}, {"term": "Database", "label": null, "scheme": "http://delicious.com/chris.mcmillan/"}, {"term": "plpgsql", "label": null, "scheme": "http://delicious.com/chris.mcmillan/"}, {"term": "PostgreSQL", "label": null, "scheme": "http://delicious.com/chris.mcmillan/"}, {"term": "Reference", "label": null, "scheme": "http:// delicious.com/chris.mcmillan/"}, {"term": "SQL", "label": null, "scheme": "http://delicious.com/ chris.mcmillan/"}], "links": [{"rel": "alternate", "href": "http://www.postgresonline.com/journal/index.php?/archives/87- PostgreSQL-8.3-PLPGSQL-Cheatsheet-Overview.html", "type": "text/html"}], "title": "PostgreSQL 8.3 PL/PGSQL Cheatsheet Overview - Postgres OnLine Journal", "author": "chris.mcmillan", "source": {}, "updated": "Wed, 09 Sep 2009 16:05:31 +0000", "comments": "http://delicious.com/url/b7044a690899e0702b998db23335d3f3", "guidislink": false, "title_detail": {"base": "http://feeds.delicious.com/v2/rss/recent?min=1&count=100", "type": "text/plain", "value": "PostgreSQL 8.3 PL/PGSQL Cheatsheet Overview - Postgres OnLine Journal", "language": null}, "wfw_commentrss": "http://feeds.delicious.com/v2/rss/url/b7044a690899e0702b998db23335d3f3"}’::jsonb 20 What is a GIN key (depends on opclass)? jsonb_path_ops GIN keys, jsonb_ops GIN keys Wednesday, June 25, 14
  19. 21.

    What GIN knows about jsonb • GIN only knows what

    opclass tells it • Fundamentally, GIN only knows “here are some keys, and here is where each points to in table”. So, no direct representation of nesting level, for example. • When querying, process of breaking up indexable operator’s rhs value (to storage format) occurs within opclass support functions 21 Wednesday, June 25, 14
  20. 22.

    Structure • Even GIN opclasses do not know about structure

    of jsonb • Only return rows that might be matches (and very probably are) • Bitmap index scans must recheck to ensure correctness - otherwise, false positives possible • This is not an inherent property of GIN 22 Wednesday, June 25, 14
  21. 23.

    Which opclass should I use? • In my opinion, “containment”

    is usually the compelling way of querying jsonb • Containment can “reach in” at multiple points • Existence is more for the classic hstore use-case (it only considers top-level JSON object keys) - heterogenous structure. So, I’d pick jsonb_path_ops almost every time for semi- structured data. 23 Wednesday, June 25, 14
  22. 24.

    Why jsonb_path_ops is fast • Very selective - only nested

    values stored as GIN keys. Also, less to store initially, and less to physically query later. • Cheap int comparisons for B-Tree index scans. • Since only hashes stored, length irrelevant. • Edge-case: Be sue you have some object value (or array element) in “jsonb @>” predicate! 24 Wednesday, June 25, 14
  23. 25.

    GIN is better generally in 9.4 • Lots of work

    on GIN in Postgres 9.4. • More or less incidental to jsonb, but will help a lot • Much smaller indexes due to new posting list compression • Multi-key skipping optimization • “Rare and frequent” optimization likely particularly beneficial with default GIN opclass 25 Wednesday, June 25, 14
  24. 26.

    Delicious sample data benchmarks • Bookmark data, publicly available •

    Table size when stored as jsonb: ~1.3Gb • jsonb_ops (default) - 636 Mb (with no compression, 815Mb) • jsonb_path_ops - 295 Mb • jsonb_path_ops (tags) - 44 Mb USING gin((jb->'tags')) with jsonb_path_ops • jsonb_path_ops (tags.term) - 1.6 Mb 26 Wednesday, June 25, 14
  25. 27.

    Storage overhead versus json, text types 27 =# \dt+ List

    of relations Schema | Name | Type | Owner | Size | Description --------+------+-------+----------+---------+------------- public | jb | table | postgres | 1374 MB | overhead is < 4% public | js | table | postgres | 1322 MB | public | tx | table | postgres | 1322 MB | Wednesday, June 25, 14
  26. 28.

    “Delicious” sequential scan performance 28 EXPLAIN ANALYZE SELECT count(*) FROM

    jb WHERE jb @> '{"tags":[{"term":"NYC"}]}'::jsonb; QUERY PLAN --------------------------------------------------------------------------------------- Aggregate (cost=191521.30..191521.31 rows=1 width=0) (actual time=1263.201..1263.201 rows=1 loops=1) -> Seq Scan on jb (cost=0.00..191518.16 rows=1253 width=0) (actual time=0.007..1263.065 rows=285 loops=1) Filter: (jb @> '{"tags": [{"term": "NYC"}]}'::jsonb) Rows Removed by Filter: 1252688 Planning time: 0.065 ms Execution runtime: 1263.225 ms (6 rows) Wednesday, June 25, 14
  27. 29.

    Default opclass performance 29 CREATE INDEX gin_jb_idx ON jb USING

    gin(jb); EXPLAIN ANALYZE SELECT count(*) FROM jb WHERE jb @> '{"tags":[{"term":"NYC"}]}'::jsonb; QUERY PLAN --------------------------------------------------------------------------------------- Aggregate (cost=4772.72..4772.73 rows=1 width=0) (actual time=8.486..8.486 rows=1 loops=1) -> Bitmap Heap Scan on jb (cost=73.71..4769.59 rows=1253 width=0) (actual time=8.049..8 rows=285 loops=1) Recheck Cond: (jb @> '{"tags": [{"term": "NYC"}]}'::jsonb) Heap Blocks: exact=285 -> Bitmap Index Scan on gin_jb_idx (cost=0.00..73.40 rows=1253 width=0) (actual time=8.014..8.014 rows=285 loops=1) Index Cond: (jb @> '{"tags": [{"term": "NYC"}]}'::jsonb) Planning time: 0.115 ms Execution runtime: 8.515 ms (8 rows) Wednesday, June 25, 14
  28. 30.

    jsonb_path_ops performance 30 CREATE INDEX gin_jb_path_idx ON jb USING gin(jb

    jsonb_path_ops); EXPLAIN ANALYZE SELECT count(*) FROM jb WHERE jb @> '{"tags":[{"term":"NYC"}]}'::jsonb; QUERY PLAN --------------------------------------------------------------------------------------- Aggregate (cost=4732.72..4732.73 rows=1 width=0) (actual time=0.644..0.644 rows=1 loops=1) -> Bitmap Heap Scan on jb (cost=33.71..4729.59 rows=1253 width=0) (actual time=0.102..0.620 rows=285 loops=1) Recheck Cond: (jb @> '{"tags": [{"term": "NYC"}]}'::jsonb) Heap Blocks: exact=285 -> Bitmap Index Scan on gin_jb_path_idx (cost=0.00..33.40 rows=1253 width=0) (actual time=0.062..0.062 rows=285 loops=1) Index Cond: (jb @> '{"tags": [{"term": "NYC"}]}'::jsonb) Planning time: 0.056 ms Execution runtime: 0.668 ms (8 rows) Wednesday, June 25, 14
  29. 31.

    What I think of jsonb • Makes Postgres quite competitive

    with document DBs. jsonb development mostly about what semantics make sense. • However, doesn’t represent change of direction. Postgres is object relational database. • http://pgeoghegan.blogspot.com/2014/03/what-i-think-of-jsonb.html • hstore went in in 2006. Have native MAC address type, polygon type, XML type, etc. PostGIS is more or less just a set of datatypes. • Very rich support for custom datatypes (including things like custom selectivity estimators) a major goal of Postgres all along 31 Wednesday, June 25, 14
  30. 32.

    Futures • “vodka” AM - like GIN, but main tree

    is Suffix Tree • “jsquery” query language • Statistics • Pretty printing jsonb • Concatenate operator. Richer manipulation of jsonb. • Delta-based row versioning? Some talk of this, but not obviously all that valuable to common cases. Document DBs generally have atomic documents baked into their data model, so there is in general a large incentive to just not have massive documents - lock contention. 32 Wednesday, June 25, 14