Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PostgreSQL Hearts MongoDB: Querying Mongo with SQL

PostgreSQL Hearts MongoDB: Querying Mongo with SQL

Developers are adopting MongoDB like crazy, but the transition isn't always smooth for other teams at the company. PostgreSQL provides a great solution for bridging the gap by allowing SQL users to query live MongoDB data and process it with the PostgreSQL engine. This is done through a new and rapidly evolving feature of PostgreSQL called foreign data wrappers, which lets anyone write an extension to fetch data from an external source.

David Crawford

April 23, 2013
Tweet

Other Decks in Technology

Transcript

  1. PostgreSQL — MongoDB Experiences with the MongoDB Foreign Data Wrapper

    at Metrica David Crawford, [email protected] (@getmetrica) Tuesday, April 23, 13
  2. Doesn’t SQL suck? • Mongo abandons SQL for good reason

    • Other stakeholders to consider • SQL and NoSQL in harmony Tuesday, April 23, 13
  3. Wanna See? • Demo time! • CREATE EXTENSION mongo_fdw; •

    CREATE SERVER mongo FOREIGN DATA WRAPPER mongo_fdw OPTIONS (address 'localhost', port '27017'); • CREATE FOREIGN TABLE table_name ([columns]) SERVER mongo OPTIONS (database ‘test’, collection ‘blog’); • SELECT * FROM table_name; -- woohoo! Tuesday, April 23, 13
  4. Joins across collections • http://tebros.com/2011/07/using-mongodb-mapreduce-to-join-2- collections/ select country, age, sum("FY2009")

    from life_expectancy le join us_economic_assistance uea on le.country = uea.country_name group by 1, 2; Tuesday, April 23, 13
  5. MongoDB mapreduce join life_expect_map = function() { // Simply emit

    the age and 0 for the dollar amount. // The dollar amount will come from the other collection. emit(this.country, {life_expectancy: this.age, dollars: 0}); } us_econ_map = function() { // The data set contains grant amounts going back to 1946. I // am only interested in 2009 grants. if (this.FY2009 !== undefined && this.FY2009 !== null) { emit(this.country_name, { dollars: this.FY2009, life_expectancy: 0 }); } } Tuesday, April 23, 13
  6. MongoDB mapreduce join r = function(key, values) { var result

    = {dollars: 0, life_expectancy: 0}; values.forEach(function(value) { // Sum up all the money from all the 2009 grants for this // country (key) result.dollars += (value.dollars !== null) ? value.dollars : 0; // Only set life expectancy once if (result.life_expectancy === 0 && value.life_expectancy !== null ) { result.life_expectancy = value.life_expectancy; } }); return result; } res = db.life_expectancy.mapReduce(life_expect_map, r, {out: {reduce: 'joined'}}) res = db.us_economic_assistance.mapReduce(us_econ_map, r, {out: {reduce: 'joined'}}) db.joined.find() Tuesday, April 23, 13
  7. Don’t join, embed • If you embed lists in your

    documents, you usually don’t need to join. • But sometimes you do • MapReduce lets you do it at scale (if you know what you’re doing) Tuesday, April 23, 13
  8. What’s going on? • PostgreSQL Foreign Data Wrappers: read-only* external

    data access SELECT * FROM pg_table JOIN mongo_table ... PG SQL engine PG storage FDW Mongo C Library pg_table mongo_table *for now Tuesday, April 23, 13
  9. How it works • Before query (optional) • Analyze •

    At query time • Planning • Begin, Iterate, End Tuesday, April 23, 13
  10. Foreign Data Wrappers (in 9.2) • http://www.postgresql.org/docs/9.2/static/fdwhandler.html • Minimum API

    void GetForeignPaths (PlannerInfo *root, ...); ForeignScan *GetForeignPlan(PlannerInfo *root, ...); void BeginForeignScan (ForeignScanState *node, int eflags); TupleTableSlot * IterateForeignScan (ForeignScanState *node); void ReScanForeignScan (ForeignScanState *node); void EndForeignScan (ForeignScanState *node); Tuesday, April 23, 13
  11. Mongo_FDW Development • First version released by CitusData in October

    2012 • Metrica contributing • Lots more to do Tuesday, April 23, 13
  12. Performance • Setup: • MongoDB 2.2.3* fresh install on EC2

    micro instance. No indexes. • PostgreSQL 9.2 fresh install on separate EC2 micro instance. Separate availability zone. • Same tests run with both installed on macbook pro with 16gb ram. • Collection of ~500,000 tweets, with full api info (user, retweet users, etc). 1.4GBs according to mongo stats(). *Got memory errors on 2.4.3 Tuesday, April 23, 13
  13. Performance Average number of statuses by each tweeting user MongoDB:

    53 seconds (ec2 micro), 8 seconds (mbp) db.twitter.aggregate({ $group: { _id: 0, avgStatuses: { $avg: "$user.statuses_count" } } }) PostgreSQL: 58 seconds (ec2 micro),11 seconds (mbp) select avg("user.statuses_count") from twitter; 0 14.5 29 43.5 58 mongo pg Tuesday, April 23, 13
  14. Performance Top 5 tweets by user status count MongoDB: 26

    seconds (ec2 micro), 15 seconds (mbp) db.twitter.aggregate([{ $sort: { "user.statuses_count" : -1 }}, { $limit: 5 }, { $project: { "user.id": 1, "user.statuses_count": 1 }}]) PostgreSQL: 53 seconds (ec2 micro), 11 seconds (mbp) select “user.id”, “user.statuses_count” from twitter order by 2 desc limit 5; 0 15 30 45 60 mongo pg Tuesday, April 23, 13
  15. Performance Count of collection MongoDB: 0.5 seconds (ec2 micro), 0.02

    seconds (mbp) db.twitter.count() PostgreSQL: 135 seconds (ec2 micro), 11 seconds (mbp) select count(1) from twitter; 0 37.5 75 112.5 150 mongo pg Tuesday, April 23, 13
  16. Performance Count of tweets with a user MongoDB: 62 seconds

    (ec2 micro), 0.6 seconds (mbp) db.twitter.count({$exists: {user: 1}}) PostgreSQL: 71 seconds (ec2 micro), 11 seconds (mbp) select count(“user.id”) from twitter; 0 17.75 35.5 53.25 71 mongo pg Tuesday, April 23, 13
  17. Performance Count of tweets by language MongoDB: 80 seconds (ec2

    micro), 5 seconds (mbp) db.twitter.aggregate({ $group: { _id: "$lang", count: { $sum: 1 } } }) PostgreSQL: 57 seconds (ec2 micro), 11 seconds (mbp) select lang, count(1) from twitter group by 1; 0 20 40 60 80 mongo pg Tuesday, April 23, 13
  18. In the trenches • Mongo users get fast and loose

    with types { “_id”: ObjectID(5de12a323f4), “last_login”: 1366551505, “total_logins”: “25” } • In SQL, you want them cleaned up CREATE FOREIGN TABLE users ( _id varchar(32), last_login TIMESTAMP, total_logins INT )... Tuesday, April 23, 13
  19. In the trenches • Embedded arrays db.blogs.findOne() { "_id" :

    ObjectId("510385792aa3ec6d3c97f4ae"), "published" : ISODate("2013-01-01T08:01:00Z"), ... "comments" : [ { "name" : "John Snow", "comment" : "Back in my day..." "time" : ISODate("2013-01-01T08:21:03Z") } ] } Tuesday, April 23, 13
  20. In the trenches • Embedded arrays -- Count number of

    posts commented on by commenter select name, count(distinct “parent._id”) as posts from blog_comments group by name; Tuesday, April 23, 13
  21. In the trenches • In MongoDB: db.blog.aggregate([ {$unwind: “$comments”}, {$group:

    {_id: {name: “$comments.name”, post: “$_id”}}}, {$group: {_id: “$_id.name”, posts: {$sum: 1}}} ]); Tuesday, April 23, 13
  22. In the trenches • Mongo documents tend to be very

    large (especially when flattened) { "_id" : ObjectId("5169bf9a56a88c438334cdf5"), "created_at" : "Sat Apr 13 20:27:03 +0000 2013", "id" : NumberLong("323170489868623872"), "id_str" : "323170489868623872", "text" : "RT @U_KAUS: New joint #POISONEDMIND coming soon yall!..pls RT!", "source" : "<a href=\"http://blackberry.com/twitter\" rel=\"nofollow\">Twitter for BlackBerry®</a>", "truncated" : false, "in_reply_to_status_id" : null, "in_reply_to_status_id_str" : null, "in_reply_to_user_id" : null, "in_reply_to_user_id_str" : null, "in_reply_to_screen_name" : null, "user" : { "id" : 226802658, "id_str" : "226802658", "name" : "WICKEDEST DJ DINO", "screen_name" : "WickedestDjDino", "location" : "Lagos Nigeria", "url" : null, "description" : "Dare Ezekiel ojo aka WICKEDEST DJ DINO is smooth/battle dj/radio(88.9)brilafm dj/club dj.contact:0805917/375 or 08101675527/[email protected].", "protected" : false, "followers_count" : 1160, "friends_count" : 203, "listed_count" : 2, "created_at" : "Wed Dec 15 03:21:53 +0000 2010", "favourites_count" : 8, "utc_offset" : 3600, "time_zone" : "Amsterdam", "geo_enabled" : false, "verified" : false, "statuses_count" : 3263, "lang" : "en", "contributors_enabled" : false, "is_translator" : false, "profile_background_color" : "C0DEED", "profile_background_image_url" : "http://a0.twimg.com/profile_background_images/ 786695620/57f50498413a3d2055d8e12d4c1e033b.jpeg", "profile_background_image_url_https" : "https://si0.twimg.com/profile_background_images/786695620/57f50498413a3d2055d8e12d4c1e033b.jpeg", "profile_background_tile" : true, "profile_image_url" : "http://a0.twimg.com/profile_images/3090057327/796c56e6957599da4f5328adc0025945_normal.jpeg", "profile_image_url_https" : "https://si0.twimg.com/profile_images/ 3090057327/796c56e6957599da4f5328adc0025945_normal.jpeg", "profile_banner_url" : "https://si0.twimg.com/profile_banners/226802658/1351766661", "profile_link_color" : "0084B4", "profile_sidebar_border_color" : "FFFFFF", "profile_sidebar_fill_color" : "DDEEF6", "profile_text_color" : "333333", "profile_use_background_image" : true, "default_profile" : false, "default_profile_image" : false, "following" : null, "follow_request_sent" : null, "notifications" : null }, "geo" : null, "coordinates" : null, "place" : null, "contributors" : null, "retweeted_status" : { "created_at" : "Sat Apr 13 20:17:57 +0000 2013", "id" : NumberLong("323168198759440385"), "id_str" : "323168198759440385", "text" : "New joint #POISONEDMIND coming soon yall!..pls RT!", "source" : "<a href=\"http://blackberry.com/twitter\" rel=\"nofollow\">Twitter for BlackBerry®</a>", "truncated" : false, "in_reply_to_status_id" : null, "in_reply_to_status_id_str" : null, "in_reply_to_user_id" : null, "in_reply_to_user_id_str" : null, "in_reply_to_screen_name" : null, "user" : { "id" : 145641091, "id_str" : "145641091", "name" : "U-KAUS iz a JEJE Boi", "screen_name" : "U_KAUS", "location" : "Backyard!", "url" : "http://www.facebook.com/pages/U-Kaus/165654578040?ref=hl", "description" : "Artiste/Producer/Entertainer certified JEJE Boi!!.. #POiSONED M!ND# Contact: [email protected] or +2348023239608. Thanks!", "protected" : false, "followers_count" : 1515, "friends_count" : 1148, "listed_count" : 0, "created_at" : "Wed May 19 13:25:03 +0000 2010", "favourites_count" : 137, "utc_offset" : -32400, "time_zone" : "Alaska", "geo_enabled" : true, "verified" : false, "statuses_count" : 30519, "lang" : "en", "contributors_enabled" : false, "is_translator" : false, "profile_background_color" : "EDECE9", "profile_background_image_url" : "http://a0.twimg.com/images/themes/theme3/bg.gif", "profile_background_image_url_https" : "https://si0.twimg.com/images/themes/theme3/bg.gif", "profile_background_tile" : false, "profile_image_url" : "http://a0.twimg.com/profile_images/ 3349715815/153a7a84dddd4568bfa6f4c4db91d679_normal.jpeg", "profile_image_url_https" : "https://si0.twimg.com/profile_images/3349715815/153a7a84dddd4568bfa6f4c4db91d679_normal.jpeg", "profile_link_color" : "088253", "profile_sidebar_border_color" : "D3D2CF", "profile_sidebar_fill_color" : "E3E2DE", "profile_text_color" : "634047", "profile_use_background_image" : true, "default_profile" : false, "default_profile_image" : false, "following" : null, "follow_request_sent" : null, "notifications" : null }, "geo" : { "type" : "Point", "coordinates" : [ 6.60471, 3.38571 ] }, "coordinates" : { "type" : "Point", "coordinates" : [ 3.38571, 6.60471 ] }, "place" : null, "contributors" : null, "retweet_count" : 1, "favorite_count" : 0, "entities" : { "hashtags" : [ { "text" : "POISONEDMIND", "indices" : [ 10, 23 ] } ], "urls" : [ ], "user_mentions" : [ ] }, "favorited" : false, "retweeted" : false, "lang" : "en" }, "retweet_count" : 0, "favorite_count" : 0, "entities" : { "hashtags" : [ { "text" : "POISONEDMIND", "indices" : [ 22, 35 ] } ], "urls" : [ ], "user_mentions" : [ { "screen_name" : "U_KAUS", "name" : "U-KAUS iz a JEJE Boi", "id" : 145641091, "id_str" : "145641091", "indices" : [ 3, 10 ] } ] }, "favorited" : false, "retweeted" : false, "filter_level" : "low" } Tuesday, April 23, 13
  23. In the trenches • Mongo keys are often formulaic {

    "_id" : ObjectId("510385792aa3ec6d3c97f4ae"), "map00:00": 1, "map01:00": 0, "map01:01": 1 } Tuesday, April 23, 13
  24. In the trenches • Postgres has you covered • Fancy

    data types: HSTORE, JSON, array columns • http://www.postgresql.org/docs/current/static/hstore.html • http://www.postgresql.org/docs/current/static/datatype-json.html • http://www.postgresql.org/docs/current/static/functions-json.html Tuesday, April 23, 13
  25. FDWs with Python 11 def google(search): 12 """Retrieves results from

    google using the json api""" 13 query = urllib.urlencode({'q': search}) 14 url = ('http://ajax.googleapis.com/ajax/' 15 'services/search/web?v=1.0&%s' % query) 16 response = urllib.urlopen(url) 17 results = response.read() 18 results = json.loads(results) 19 data = results['responseData'] 20 hits = data['results'] 21 for hit in hits: 22 yield {'url': hit['url'].encode("utf-8"), 23 'title': hit["titleNoFormatting"].encode("utf-8"), 24 'search': search.encode("utf-8")} 25 26 27 class GoogleFdw(ForeignDataWrapper): 28 """A Google search foreign data wrapper. 29 30 Parses the quals to find anything ressembling a search criteria, and 31 returns the google search result for it. 32 Available columns are: url, title, search. 33 34 """ 35 36 def execute(self, quals, columns): 37 if not quals: 38 return ("No search specified",) 39 for qual in quals: 40 if qual.field_name == "search" or qual.operator == "=": 41 return google(qual.value) http://multicorn.org/ Tuesday, April 23, 13
  26. How can you help? • Better planning • More ways

    to map data models • Instrumentation (especially integration w/ pg_stat_activity) • Query cancelation • Back a table with a query • Use it and file bug reports Tuesday, April 23, 13
  27. Contact us • http://github.com/citusdata/mongo_fdw (official version) • http://github.com/davidcrawford/mongo_fdw (my fork,

    will have important functionality contributed to citusdata’s version) • David Crawford, CEO Metrica - [email protected] • @getmetrica on Twitter Tuesday, April 23, 13