Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PgREST: Node.js in the Database

唐鳳
April 20, 2013

PgREST: Node.js in the Database

OSDC.tw 2013 talk on plv8x, g0v, 3du, and https://moedict.tw/

唐鳳

April 20, 2013
Tweet

More Decks by 唐鳳

Other Decks in Technology

Transcript

  1. PgREST is… • JSON document store • Running inside PostgreSQL

    • Working with existing relational data • Capable of loading Node.js modules • Compatible with MongoLab’s REST API • = LiveScript + PLV8 + plv8x + OneJS
  2. JSON { "title": "萌", "bopomofo": "ㄇㄥˊ", "pinyin": "méng", "definitions": [

    { "type": "名", "def": "草木初生的芽。" }, { "type": "名", "def": "事物發生的開端或徵兆。" }, { "type": "名", "def": "人民。" } ] }
  3. PostgreSQL CREATE TABLE moe ( "entry" JSON ); INSERT INTO

    moe VALUES ($$ { "title":" 萌 ", "bopomofo": " ㄇㄥˊ", "pinyin": "méng", "definitions": [ { "type": " 名 ", "def": " 草木初生的芽。 " }, { "type": " 名 ", "def": " 事物發生的開端或徵兆。 " }, { "type": " 名 ", "def": " 人民。 " } ] } $$); INSERT INTO moe VALUES (' 這不是 ㄓㄟ ㄙㄣ ˇ'); -- type error
  4. PLV8 CREATE EXTENSION plv8; CREATE FUNCTION get_json_key(obj JSON, key TEXT)

    returns JSON AS $$ return JSON.stringify( obj[key] ); $$ LANGUAGE plv8; SELECT get_json_key(entry, 'bopomofo') FROM moe; -- " ㄇㄥˊ"
  5. plv8x: Operators SELECT entry |> 'this.bopomofo' FROM moe; -- "ㄇㄥˊ"

    SELECT entry ~> '@bopomofo' FROM moe; -- "ㄇㄥˊ" SELECT '@bopomofo' <~ entry FROM moe; -- "ㄇㄥˊ" SELECT ~> 'new Date'; -- "2013-04-17T12:31:57.523Z"
  6. plv8x: Command Line npm i -g plv8x export PLV8XCONN=dbname plv8x

    -r script.ls # .js works too plv8x -E 'plv8.execute("SELECT entry FROM moe").0.entry.definitions' # [ { type: ' 名 ', def: ' 草木初生的芽。 ' }, # { type: ' 名 ', def: ' 事物發生的開端或徵兆。 ' }, # { type: ' 名 ', def: ' 人民。 ' } ]
  7. plv8x: Modules npm i -g uax11 plv8x -i uax11 plv8x

    -E 'require "uax11" .toFullwidth "méng"' # meng SELECT entry ~> 'require "uax11" .toFullwidth @pinyin' FROM moe; -- "meng"
  8. plv8x: Functions plv8x -f 'text fullwidth(text)=uax11:toFullwidth' plv8x -f 'text PINYIN(json)=:&0.pinyin.toUpperCase!'

    SELECT fullwidth('Ingy döt Net'); -- Ingy dot Net SELECT fullwidth( PINYIN(entry) ) FROM moe; -- MENG
  9. Summary • V8 : JavaScript engine • PLV8 : Stored

    procedures in JavaScript • plv8x : Package manager for PLV8 • Turns NPM modules into SQL functions • JSON expressions with ~> and <~ • Code reuse for browser + server + database !
  10. Cutting out the Middleware • Serve JSON API from SQL

    • Shared models & validation code • Put Business Logic into DB • Perfect fit for Medium Data™
  11. The Good • 160,000+ entries • Official, high quality sources

    • Rich etymology and historical usage • Full text search with regular expressions • Still frequently updated!
  12. The Bad • Results are not bookmarkable • Requires N

    clicks to get to a definition • Rare characters become low-res bitmaps • Difficult to use on mobile devices • ”Optimized for IE 5.0 and Netscape 4.7+”!?
  13. ⧸/葉平〈還文於民〉 Yeh’s Ping, 2013.1.26. 所以我要 響應 零時政府 g0v.tw 的活動,來做 3du.tw,把字、詞、成語、定義、例句等等正

    體中文資料,用開放的文字 API 釋放出來,加 上索引和搜尋的功能,讓任何想加值的個人或 公司都可以使用。 “
  14. g0v hackath1n, 2013.1.27. • Scrape 2741 idioms as HTML (@TonyQ,

    @MnO2) • Scrape 3000 characters as raw HTML (@au) • Design JSON schema from samples (@pingooo) • Design SQL schema from samples (@albb0920) • Parse HTML into JSON & SQLite (@kcwu) • …and for those 24x24 bitmaps…
  15. Finished in 24 hours! Thanks to: Favonia, Jun-Yuan Yan, Yao

    Wei, Yaoting Huang, Poka, Caasi Huang, Daniel Liang, Grey Lee, Irvin Chen, Gugod, Schee…
  16. Applications • XUL Desktop App (@racklin) • OS X Dictionary

    (@yllan) • Windows 8 App (@wenpei) • iOS Client (@tomjpsun, @jamessa, @pct) • iOS Offline App (@zonble)
  17. Integrations • Rails API server (@albb0920) • AngularJS Client+Server (@viirya)

    • Chrome Extension (@tonytonyjan) • Sublime Text plugin (@zonble) • WinRT Component (@eriksk)
  18. 5 Stars of Open Data 1. ⊙☉ Open License 2.

    Structured Data 3. Non-Proprietary Format 4. ✧ Each Item has an URI 5. ✩ Linking between Items
  19. URI Endpoints • https://moedict.tw/#文字 • 3 APIs (for non-Unicode characters):

    • /raw/文字.json {[8ff0]} • /uni/文字.json ⿰亻壯 • /pua/文字.json U+F8FF0
  20. Web Fonts for Private-Use Area • Initially based on Hán

    Nôm font (@YaoWei) • Subset everything outside Big5 range • Hand-drawn PUA chars like ⿰亻壯 • Later on, switched to Hanazono 花園明朝 font • 75,619 + 8,236 glyphs • From 花園大学国際禅学研究所
  21. Reaching the Fifth Star 1. ⊙☉ Open License 2. Structured

    Data 3. Non-Proprietary Format 4. ✧ Each Item has an URI 5. ✩ Linking between Items
  22. Chinese Segmentation • Therearenowhitespacesbetweenwords • Lots of heuristic algorithms •

    Naive solution: Longest-token match • Requires a large dictionary • …wait, we just got one here
  23. Worked well, but… • Freezes IE8, crashes IE7 • Broken

    on Android 2.x, too • So let’s pre-segment on server • Needs a tool to move JS into DB • …wait, we just got one here
  24. Let’s PhoneGap it! • Freezes XCode, crashes Eclipse • Solution:

    Pack into 1024 .txt files • Take the first character, mod 1024 • Related words share the same bucket • Great success!
  25. User-Driven Development • Wildcard and part-of-word searching (@esor) • Two-column

    layout for tablets (@hlb) • Toggle between Pinyin and Bopomofo (@matic) • Volume key on Android resizes fonts (@ivan) • Top Request: Taiwanese Bân-lâm-gi
  26. Personal Motivation • My main caretakers were my grandparents •

    Grandma from Lo̍k-káng, Taiwan • Grandpa from Sì-chuān, China • Raised biligually as a pre-schooler • But only Mandarin had a writing system • Editing her memoir brought back memories
  27. Good Parts • Unified Romanization system (TL) • Standardized Ideographic

    characters (RHC) • Full text search with Mandarin, TL & RHC • MP3 pronounciations of all entries • Licensed under CC-BY-ND 3.0
  28. Not-so-good Parts • Entries are in non-bookmarkable <iframe>s • No

    equivalent Mandarin field for entries • Still uses bitmaps for Ext-B+ fonts • Easy to scrape but hard to parse • …as discovered by @happyman_eric
  29. Database received, 2013.3.27. • 詞目總檔.xls 詞目總檔.屬性對照.xls • 釋義.xls 釋義.詞性對照.xls •

    又音.xls 又音.屬性對照.xls • 近義詞對應.xls 反義詞對應.xls • 詞彙方言差.xls 語音方言差.xls • 例句.xls
  30. Data Cleanup, 2013.3.30. • Convert all .xsl to .csv with

    LibreOffice 4 • 3 stars: Non-Proprietary Format • Replace PUA characters with mapped Unicode • Add x-造字.csv and x-華語對照表.csv • Time to put PgREST to work!
  31. PgREST: MongoLab API Server • GET /collections/table_or_view • q=&c=true&f=&fo=true&s=&sk=&l= curl

    $LY/collections/bills?q={"proposal.0":" 吳育昇 "} curl $MOE/collections/entries?q={" 部首 ":" 一 "}&c=1 • PUT /collections/table_or_view
  32. PgREST: Import/Export pgrest dbname export MOE=http://127.0.0.1:3000 curl -i -X PUT

    -H "Content-Type: text/csv" \ --data-binary @uni/ 詞目總檔 .csv $MOE/collections/entries curl $MOE/collections/entries # [{" 主編號 ","1"," 屬性 ":"1"," 詞目 ":" 一 "," 音讀 ":"ts i ̍ t", # " 文白俗替 ":" 替 "," 部首 ":" 一 "," 部首序 ":"001-00-01"," 方言差 ":""}]
  33. Lessons Learned • Open Data is a beginning, not an

    end • Keep conversations with all participants • Turn detractors into collaborators • Keep a kind heart • Assume the best intentions
  34. — Aaron Swartz, «Open Government» When is Transparency Useful? 眾人為了共同目標聚在一起,才能做出改變,

    科技人很難獨力完成。 衡量成功的標準,可以是有多少人的生命因你 獲得改善,而不只是有多少人看你架的網站。 “