Slide 1

Slide 1 text

schema liberation with json and plv8 selena deckelmann @selenamarie [email protected]

Slide 2

Slide 2 text

Schema design and JSON 3 before plv8 & JSON datatype:

Slide 3

Slide 3 text

What do I mean by schema liberation?

Slide 4

Slide 4 text

Data in MongoDB has a flexible schema. ... Schema flexibility means that you can model your documents in MongoDB so that they can closely resemble and reflect application-level objects. -The MongoDB Documentation

Slide 5

Slide 5 text

Data in MongoDB has a flexible schema. ... Schema flexibility means that you can model your documents in MongoDB so that they can closely resemble and reflect application-level objects. -The MongoDB Documentation

Slide 6

Slide 6 text

Example: Bird sightings Example: Bird sightings

Slide 7

Slide 7 text

create table birds (id serial, name text); create table locations (id serial, state text, city text); create table sightings (id serial, seen_on date DEFAULT now()::date, bird_id int, location_id int); insert into table birds (name) values(‘yellow throated warbler’); insert into table locations (state, city) values(‘FL’, ‘Amelia Island’); insert into sightings (bird_id, location_id) ( with b_id as (select id from birds where name = ‘painted bunting’), with l_id as (select id from location where state = ‘FL’ and city = ‘Amelia Island’) select b_id.id, l_id.id from b_id, l_id );

Slide 8

Slide 8 text

vs. create table birds (id serial, name text); create table locations (id serial, state text, city text); create table sightings (id serial, seen_on date DEFAULT now()::date, bird_id int, location_id int); insert into table birds (name) values(‘yellow throated warbler’); insert into table locations (state, city) values(‘FL’, ‘Amelia Island’); insert into sightings (bird_id, location_id) ( with b_id as (select id from birds where name = ‘painted bunting’), with l_id as (select id from location where state = ‘FL’ and city = ‘Amelia Island’) select b_id.id, l_id.id from b_id, l_id );

Slide 9

Slide 9 text

{"sightings": [{"name": "yellow throated warbler", "where": "Amelia Island, FL", "when": "2013-05-28"}] } vs. create table birds (id serial, name text); create table locations (id serial, state text, city text); create table sightings (id serial, seen_on date DEFAULT now()::date, bird_id int, location_id int); insert into table birds (name) values(‘yellow throated warbler’); insert into table locations (state, city) values(‘FL’, ‘Amelia Island’); insert into sightings (bird_id, location_id) ( with b_id as (select id from birds where name = ‘painted bunting’), with l_id as (select id from location where state = ‘FL’ and city = ‘Amelia Island’) select b_id.id, l_id.id from b_id, l_id );

Slide 10

Slide 10 text

Can Postgres be flexible?

Slide 11

Slide 11 text

ingredients 9.3 Postgres* plv8 * 9.2 Postgres + json_enhancements works too

Slide 12

Slide 12 text

v8: JS engine plv8:JS engine embedded in Postgres https://code.google.com/p/plv8js/wiki/PLV8

Slide 13

Slide 13 text

reports

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

postgres# \d reports Table "public.reports" Column | Type | Modifiers ---------------------+--------------------------+------------------------------------------------------ id | integer | not null default nextval('reports_id_seq'::regclass) client_crash_date | timestamp with time zone | date_processed | timestamp with time zone | uuid | character varying(50) | not null product | character varying(30) | version | character varying(16) | build | character varying(30) | signature | character varying(255) | install_age | integer | last_crash | integer | uptime | integer | cpu_name | character varying(100) | cpu_info | character varying(100) | reason | character varying(255) | address | character varying(20) | os_name | character varying(100) | os_version | character varying(100) | user_id | character varying(50) | started_datetime | timestamp with time zone | completed_datetime | timestamp with time zone | success | boolean | truncated | boolean | processor_notes | text | user_comments | character varying(1024) | app_notes | character varying(1024) | distributor | character varying(20) | distributor_version | character varying(20) | topmost_filenames | text | addons_checked | boolean | flash_version | text | hangid | text | process_type | text | release_channel | text | productid | text | exploitability | text | flash_process_dump | text | url | text | email | text | (38 columns)

Slide 16

Slide 16 text

postgres# \d reports Table "public.reports" Column | Type | Modifiers ---------------------+--------------------------+------------------------------------------------------ id | integer | not null default nextval('reports_id_seq'::regclass) client_crash_date | timestamp with time zone | date_processed | timestamp with time zone | uuid | character varying(50) | not null product | character varying(30) | version | character varying(16) | build | character varying(30) | signature | character varying(255) | install_age | integer | last_crash | integer | uptime | integer | cpu_name | character varying(100) | cpu_info | character varying(100) | reason | character varying(255) | address | character varying(20) | os_name | character varying(100) | os_version | character varying(100) | user_id | character varying(50) | started_datetime | timestamp with time zone | completed_datetime | timestamp with time zone | success | boolean | truncated | boolean | processor_notes | text | user_comments | character varying(1024) | app_notes | character varying(1024) | distributor | character varying(20) | distributor_version | character varying(20) | topmost_filenames | text | addons_checked | boolean | flash_version | text | hangid | text | process_type | text | release_channel | text | productid | text | exploitability | text | flash_process_dump | text | url | text | email | text | (38 columns) Roseate spoonbill says: “Yuck!”

Slide 17

Slide 17 text

Let’s convert to JSON!

Slide 18

Slide 18 text

select * from reports;

Slide 19

Slide 19 text

SELECT row_to_json(reports) FROM reports;

Slide 20

Slide 20 text

CREATE TABLE liberated AS ( );

Slide 21

Slide 21 text

CREATE TABLE liberated AS ( SELECT row_to_json(reports) FROM reports );

Slide 22

Slide 22 text

postgres# \d liberated Table "public.liberated" Column | Type | Modifiers -------------+---------+----------- row_to_json | json |

Slide 23

Slide 23 text

postgres# create table liberated as (select row_to_json(row) from (select * from reports_20130506) as row); SELECT 2695824 Time: 110470.716 ms

Slide 24

Slide 24 text

postgres# create table liberated as (select row_to_json(row) from (select * from reports_20130506) as row); SELECT 2695824 Time: 110470.716 ms Took about 2 minutes on an older dev system.

Slide 25

Slide 25 text

postgres# select pg_size_pretty(pg_table_size('repo rts_20130506')) as “original size”; original size --------------- 2120 MB postgres# select pg_size_pretty(pg_table_size('libe rated')) as “liberated size”; liberated size ---------------- 4076 MB

Slide 26

Slide 26 text

Liberate the database!

Slide 27

Slide 27 text

CREATE SCHEMA liberated; CREATE OR REPLACE FUNCTION public.liberate() RETURNS boolean LANGUAGE plv8 IMMUTABLE AS $function$ var tables = plv8.execute( "SELECT relname FROM pg_catalog.pg_class c" + " LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace" + " WHERE relkind = 'r' AND n.nspname = 'public'"); for (var i = 0; i < tables.length; i++) { plv8.elog(NOTICE, tables[i]['relname']); var code = "CREATE TABLE liberated." + tables[i]['relname'] + " AS ( SELECT row_to_json(" + tables[i]['relname'] + ") AS liberated_json FROM " + tables[i]['relname'] + ")"; plv8.execute(code); } return 1; $function$; https://gist.github.com/selenamarie/5646494

Slide 28

Slide 28 text

CREATE SCHEMA liberated; CREATE OR REPLACE FUNCTION public.liberate() RETURNS boolean LANGUAGE plv8 IMMUTABLE AS $function$ var tables = plv8.execute( "SELECT relname FROM pg_catalog.pg_class c" + " LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace" + " WHERE relkind = 'r' AND n.nspname = 'public'"); for (var i = 0; i < tables.length; i++) { plv8.elog(NOTICE, tables[i]['relname']); var code = "CREATE TABLE liberated." + tables[i]['relname'] + " AS ( SELECT row_to_json(" + tables[i]['relname'] + ") AS liberated_json FROM " + tables[i]['relname'] + ")"; plv8.execute(code); } return 1; $function$; https://gist.github.com/selenamarie/5646494

Slide 29

Slide 29 text

CREATE SCHEMA liberated; -- A namespace for our new tables! -- a cluster contains databases -- a database contains schemas -- a schema contains tables

Slide 30

Slide 30 text

CREATE OR REPLACE FUNCTION liberate() RETURNS boolean -- Say what we’re returning LANGUAGE plv8 -- Say we’ll be using plv8 AS $function$ -- Function goes here! $function$;

Slide 31

Slide 31 text

-- Find all the names of -- all postgres tables! SELECT relname AS table_name FROM pg_catalog.pg_class c LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace WHERE relkind = 'r' -- meaning a table! and n.nspname = 'public';

Slide 32

Slide 32 text

CREATE TABLE liberated AS ( SELECT row_to_json(reports) FROM reports );

Slide 33

Slide 33 text

CREATE SCHEMA liberated; CREATE OR REPLACE FUNCTION public.liberate() RETURNS boolean LANGUAGE plv8 IMMUTABLE AS $function$ var tables = plv8.execute( "SELECT relname FROM pg_catalog.pg_class c" + " LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace" + " WHERE relkind = 'r' AND n.nspname = 'public'"); for (var i = 0; i < tables.length; i++) { plv8.elog(NOTICE, tables[i]['relname']); var code = "CREATE TABLE liberated." + tables[i]['relname'] + " AS ( SELECT row_to_json(" + tables[i]['relname'] + ") AS liberated_json FROM " + tables[i]['relname'] + ")"; plv8.execute(code); } return 1; $function$; https://gist.github.com/selenamarie/5646494

Slide 34

Slide 34 text

https://gist.github.com/selenamarie/5646494

Slide 35

Slide 35 text

First class datatype create table birds (sighting json);

Slide 36

Slide 36 text

postgres# create table birds (sighting json); CREATE TABLE postgres# insert into birds VALUES( '{"name": "Great Egret", "where": "Amelia Island, FL", "when":"2013-05-28"}' ); INSERT 0 1 postgres# select * from birds; sighting -------------------------------------------------------------------- {"name": "Great Egret", "where": "Amelia Island, FL", "when": "2013-05-28"} (1 row)

Slide 37

Slide 37 text

postgres# select json_object_field_text(sighting, 'name') from birds; json_object_field_text ------------------------ Great Egret postgres# select json_object_field(sighting, 'name') from birds; json_object_field ------------------- "Great Egret" postgres# select json_object_keys(sighting) from birds; json_object_keys ------------------ name where when And we can create indexes, compare, join against and use JSON columns like any other Postgres column.

Slide 38

Slide 38 text

plv8 is trusted. can be run by users.

Slide 39

Slide 39 text

plv8 automatically parses JSON input.

Slide 40

Slide 40 text

create or replace function json_value( j json, key text) returns text AS $$ if (typeof j != 'object') return NULL; return JSON.stringify(j[key]); $$ language plv8;

Slide 41

Slide 41 text

plv8 supports running raw SQL, prepared statements and cursors: rows = plv8.execute(...) plan = plv8.prepare(...) rows = plan.execute(...) plan = plv8.prepare(...) cursor = plan.cursor(...) while (row = cursor.fetch()) {}

Slide 42

Slide 42 text

create or replace function inject_js(src text) returns text as $$ return eval( "(function() {" + src + "})")(); $$ LANGUAGE plv8;

Slide 43

Slide 43 text

Ok that’s cute and all.

Slide 44

Slide 44 text

What about in production?

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

postgres# \d reports Table "public.reports" Column | Type | Modifiers ---------------------+--------------------------+------------------------------------------------------ id | integer | not null default nextval('reports_id_seq'::regclass) client_crash_date | timestamp with time zone | date_processed | timestamp with time zone | uuid | character varying(50) | not null product | character varying(30) | version | character varying(16) | build | character varying(30) | signature | character varying(255) | install_age | integer | last_crash | integer | uptime | integer | cpu_name | character varying(100) | cpu_info | character varying(100) | reason | character varying(255) | address | character varying(20) | os_name | character varying(100) | os_version | character varying(100) | user_id | character varying(50) | started_datetime | timestamp with time zone | completed_datetime | timestamp with time zone | success | boolean | truncated | boolean | processor_notes | text | user_comments | character varying(1024) | app_notes | character varying(1024) | distributor | character varying(20) | distributor_version | character varying(20) | topmost_filenames | text | addons_checked | boolean | flash_version | text | hangid | text | process_type | text | release_channel | text | productid | text | exploitability | text | flash_process_dump | text | url | text | email | text | (38 columns)

Slide 48

Slide 48 text

create table raw_crashes ( crash_id uuid, date_processed timestamptz, raw_crash json );

Slide 49

Slide 49 text

postgres# select pg_size_pretty(pg_table_size('reports _20130513')) as reports_20130513_size; reports_20130513_size ----------------------- 2136 MB postgres# select pg_size_pretty(pg_table_size('raw_cra shes_20130513')) as raw_crashes_20130513_size; raw_crashes_20130513_size --------------------------- 5246 MB

Slide 50

Slide 50 text

Per crash storage JSON: ~2.0k normalized table: ~0.8k

Slide 51

Slide 51 text

What’s the killer advantage?

Slide 52

Slide 52 text

SQL: 24 seconds vs Map Reduce: 30 minutes

Slide 53

Slide 53 text

Why use Postgres?

Slide 54

Slide 54 text

1. Use “bulkbag” schema design + schema evolution JSON to start, normalize to optimize Why use Postgres?

Slide 55

Slide 55 text

1. Use “bulkbag” schema design + schema evolution JSON to start, normalize to optimize 2. Easily scale to multi-terabyte DBs For write- or read-heavy loads, non-cloud storage Why use Postgres?

Slide 56

Slide 56 text

1. Use “bulkbag” schema design + schema evolution JSON to start, normalize to optimize 2. Easily scale to multi-terabyte DBs For write- or read-heavy loads, non-cloud storage 3. Manage your data with a language you love Put the .js for big data jobs with the big data Why use Postgres?

Slide 57

Slide 57 text

Try out 9.3beta today! http://www.postgresql.org/developer/beta/

Slide 58

Slide 58 text

TLDR Datatypes Conditional Indexes Transactional DDL Foreign Data Wrappers Concurrent Index Creation Extensions Common Table Expressions Fast Column Addition Listen/Notify Table Inheritance Per Transaction sync replication Window Functions Materialized Views Updatable Views Momentum

Slide 59

Slide 59 text

liberate your schema! with json and plv8 selena deckelmann @selenamarie [email protected] https://code.google.com/p/plv8js/wiki/PLV8 http://www.postgresql.org/developer/beta/ https://gist.github.com/selenamarie/5646494 http://www.pgxn.org/dist/json_enhancements/doc/ json_enhancements.html

Slide 60

Slide 60 text

Special thanks to: Scott Deckelmann Andrew Dunstan (JSON datatype developer) Hitoshi Harada (plv8 author) Will Leinweber (source) Craig Kerstiens (source) reviewers: Brenda Wallace Aurynn Shaw Doug Gorley Andrew Grangaard George London Jacinta Richardson

Slide 61

Slide 61 text

This work is licensed under the Creative Commons Attribution- ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/ or send a letter to Creative Commons, 444 Castro Street, Suite 900, Mountain View, California, 94041, USA. Photos from slides 1, 3, 6, 8, 9, 10, 12, 13, 22, 31, 39, 40, 48, 49, 51, 53 © Scott Deckelmann, used with permission.