Slide 1

Slide 1 text

PgREST Node.js in the Database Audrey Tang

Slide 2

Slide 2 text

只講程式 不講故事

Slide 3

Slide 3 text

PgREST is… • JSON document store • Running inside PostgreSQL • Working with existing relational data • Capable of loading Node.js modules • Compatible with MongoLab’s REST API • = LiveScript + PLV8 + plv8x + OneJS

Slide 4

Slide 4 text

JSON { "title": "萌", "bopomofo": "ㄇㄥˊ", "pinyin": "méng", "definitions": [ { "type": "名", "def": "草木初生的芽。" }, { "type": "名", "def": "事物發生的開端或徵兆。" }, { "type": "名", "def": "人民。" } ] }

Slide 5

Slide 5 text

PostgreSQL CREATE TABLE moe ( "entry" JSON ); INSERT INTO moe VALUES ($$ { "title":" 萌 ", "bopomofo": " ㄇㄥˊ", "pinyin": "méng", "definitions": [ { "type": " 名 ", "def": " 草木初生的芽。 " }, { "type": " 名 ", "def": " 事物發生的開端或徵兆。 " }, { "type": " 名 ", "def": " 人民。 " } ] } $$); INSERT INTO moe VALUES (' 這不是 ㄓㄟ ㄙㄣ ˇ'); -- type error

Slide 6

Slide 6 text

PLV8 CREATE EXTENSION plv8; CREATE FUNCTION get_json_key(obj JSON, key TEXT) returns JSON AS $$ return JSON.stringify( obj[key] ); $$ LANGUAGE plv8; SELECT get_json_key(entry, 'bopomofo') FROM moe; -- " ㄇㄥˊ"

Slide 7

Slide 7 text

plv8x: Operators SELECT entry |> 'this.bopomofo' FROM moe; -- "ㄇㄥˊ" SELECT entry ~> '@bopomofo' FROM moe; -- "ㄇㄥˊ" SELECT '@bopomofo' <~ entry FROM moe; -- "ㄇㄥˊ" SELECT ~> 'new Date'; -- "2013-04-17T12:31:57.523Z"

Slide 8

Slide 8 text

plv8x: Command Line npm i -g plv8x export PLV8XCONN=dbname plv8x -r script.ls # .js works too plv8x -E 'plv8.execute("SELECT entry FROM moe").0.entry.definitions' # [ { type: ' 名 ', def: ' 草木初生的芽。 ' }, # { type: ' 名 ', def: ' 事物發生的開端或徵兆。 ' }, # { type: ' 名 ', def: ' 人民。 ' } ]

Slide 9

Slide 9 text

plv8x: Modules npm i -g uax11 plv8x -i uax11 plv8x -E 'require "uax11" .toFullwidth "méng"' # meng SELECT entry ~> 'require "uax11" .toFullwidth @pinyin' FROM moe; -- "meng"

Slide 10

Slide 10 text

plv8x: Functions plv8x -f 'text fullwidth(text)=uax11:toFullwidth' plv8x -f 'text PINYIN(json)=:&0.pinyin.toUpperCase!' SELECT fullwidth('Ingy döt Net'); -- Ingy dot Net SELECT fullwidth( PINYIN(entry) ) FROM moe; -- MENG

Slide 11

Slide 11 text

Summary • V8 : JavaScript engine • PLV8 : Stored procedures in JavaScript • plv8x : Package manager for PLV8 • Turns NPM modules into SQL functions • JSON expressions with ~> and <~ • Code reuse for browser + server + database !

Slide 12

Slide 12 text

Cutting out the Middleware • Serve JSON API from SQL • Shared models & validation code • Put Business Logic into DB • Perfect fit for Medium Data™

Slide 13

Slide 13 text

@clkao++

Slide 14

Slide 14 text

3du.tw

Slide 15

Slide 15 text

The Revised MoE Dictionary (1994)

Slide 16

Slide 16 text

The Good • 160,000+ entries • Official, high quality sources • Rich etymology and historical usage • Full text search with regular expressions • Still frequently updated!

Slide 17

Slide 17 text

The Bad • Results are not bookmarkable • Requires N clicks to get to a definition • Rare characters become low-res bitmaps • Difficult to use on mobile devices • ”Optimized for IE 5.0 and Netscape 4.7+”!?

Slide 18

Slide 18 text

⧸/教育部國語推行委員會〈有關授權〉 The Sad 本會非常歡迎各位來連結「國語辭典」,但是 本會目前只開放以超連結 (hyperlink) 的方式與 國語辭典 首頁 連結,至於其他方式本會並未對 外開放授權。若還有疑問或建議,歡迎來信。 “

Slide 19

Slide 19 text

.…and the Very Crazy • 不需登入的網頁,會自動把你登出!

Slide 20

Slide 20 text

⧸/葉平〈還文於民〉 Yeh’s Ping, 2013.1.26. 所以我要 響應 零時政府 g0v.tw 的活動,來做 3du.tw,把字、詞、成語、定義、例句等等正 體中文資料,用開放的文字 API 釋放出來,加 上索引和搜尋的功能,讓任何想加值的個人或 公司都可以使用。 “

Slide 21

Slide 21 text

Hackpad for 3du.tw

Slide 22

Slide 22 text

零時黑客 集體砍站事件

Slide 23

Slide 23 text

g0v hackath1n, 2013.1.27. • Scrape 2741 idioms as HTML (@TonyQ, @MnO2) • Scrape 3000 characters as raw HTML (@au) • Design JSON schema from samples (@pingooo) • Design SQL schema from samples (@albb0920) • Parse HTML into JSON & SQLite (@kcwu) • …and for those 24x24 bitmaps…

Slide 24

Slide 24 text

← Big-5 → UTF-8

Slide 25

Slide 25 text

Crowd-OCR for 1000+ glyphs

Slide 26

Slide 26 text

Finished in 24 hours! Thanks to: Favonia, Jun-Yuan Yan, Yao Wei, Yaoting Huang, Poka, Caasi Huang, Daniel Liang, Grey Lee, Irvin Chen, Gugod, Schee…

Slide 27

Slide 27 text

粗略的共識 會動的程式

Slide 28

Slide 28 text

Applications • XUL Desktop App (@racklin) • OS X Dictionary (@yllan) • Windows 8 App (@wenpei) • iOS Client (@tomjpsun, @jamessa, @pct) • iOS Offline App (@zonble)

Slide 29

Slide 29 text

Integrations • Rails API server (@albb0920) • AngularJS Client+Server (@viirya) • Chrome Extension (@tonytonyjan) • Sublime Text plugin (@zonble) • WinRT Component (@eriksk)

Slide 30

Slide 30 text

Fair Use 為非營利之教育目的,依著作權法第 50 條, 「以中央或地方機關或公法人之名義公開發表 之著作,在合理範圍內,得重製、公開播送或 公開傳輸。」此處轉換格式、重新編排的編輯 著作權(如果有的話)由 @kcwu 以 CC0 釋出。 “

Slide 31

Slide 31 text

CC0: Public Domain 除前述資料檔之外,本目錄下的所有其他檔 案,由作者 唐鳳 在法律許可的範圍內,拋棄該 著作依著作權法所享有之權利,包括所有相關 與鄰接的法律權利,並宣告將該著作貢獻至公 眾領域。 “

Slide 32

Slide 32 text

moedict.tw

Slide 33

Slide 33 text

5 Stars of Open Data 1. ⊙☉ Open License 2. Structured Data 3. Non-Proprietary Format 4. ✧ Each Item has an URI 5. ✩ Linking between Items

Slide 34

Slide 34 text

URI Endpoints • https://moedict.tw/#文字 • 3 APIs (for non-Unicode characters): • /raw/文字.json {[8ff0]} • /uni/文字.json ⿰亻壯 • /pua/文字.json U+F8FF0

Slide 35

Slide 35 text

Web Fonts for Private-Use Area • Initially based on Hán Nôm font (@YaoWei) • Subset everything outside Big5 range • Hand-drawn PUA chars like ⿰亻壯 • Later on, switched to Hanazono 花園明朝 font • 75,619 + 8,236 glyphs • From 花園大学国際禅学研究所

Slide 36

Slide 36 text

科技始終 來自於佛性

Slide 37

Slide 37 text

Live Demo

Slide 38

Slide 38 text

Reaching the Fifth Star 1. ⊙☉ Open License 2. Structured Data 3. Non-Proprietary Format 4. ✧ Each Item has an URI 5. ✩ Linking between Items

Slide 39

Slide 39 text

Chinese Segmentation • Therearenowhitespacesbetweenwords • Lots of heuristic algorithms • Naive solution: Longest-token match • Requires a large dictionary • …wait, we just got one here

Slide 40

Slide 40 text

In-browser Implementation {"4":"一(丁不識|不小心|不扭眾|不拗眾|世之雄|世英名|丘一壑|丘之 貉|串驪珠|之為甚|之謂甚|乾二淨|了心願|了百了|了百當|事無成|五 一十|人之交|介不取|仍舊貫|代宗匠|代宗臣|代巨擘|代楷模|代風流| 代鼎臣|以當十|以貫之|來一往|來二去|依舊式|個勁兒|個子兒|個樣 兒|倡三歎|倡百和|偏之見|傅眾咻|償宿願|元大武|元復始|兵一卒|刀 一割|刀兩斷|刀兩段|分一毫|切從簡|切現成|切眾生|刻千金|力承當| 勇之夫|勞久逸|勞永逸|匡天下|去不返|反常態|口價兒|口兩匙|口咬 定|口咬死|古腦兒|名半職|吐為快|吹一唱|呼再諾|呼百應|呼百諾|命 嗚呼|哄而上|哄而散|哄而起|哄而集|唱一和|唱三歎|唱百和|喫一添|

Slide 41

Slide 41 text

Worked well, but… • Freezes IE8, crashes IE7 • Broken on Android 2.x, too • So let’s pre-segment on server • Needs a tool to move JS into DB • …wait, we just got one here

Slide 42

Slide 42 text

/a/文字.json {"h":[{"b":"ㄨㄣˊ ㄗˋ","d":[{"f":"`人類~`用來~`表示~`觀念~、 `記錄~`語言~`的~`符號~。","s":"`筆墨~,`翰墨~"},{"f":"`文書 ~。","q":["`五代史~`平話~.`梁~`史~.`卷~`上~:「`您~`去~`攻 破~`宋~`州~,`為我~`奪取~`張~`節使~`歸~`娘~。`才~`得~,`便~` 發文~`字~`來~`報~`我~。」","`警世通言~.`卷~`十~`三~.`三~`現 身~`包龍圖~`斷~`冤~:「`有~`甚事~`煩惱~?`想~`是~`縣~`裡~`有 ~`甚~`文字~`不了~。」"]}],"p":"wén zì"}],"t":"`文~`字~"}

Slide 43

Slide 43 text

Live Demo, part II

Slide 44

Slide 44 text

Materialized View: 160k .json files (@obra++)

Slide 45

Slide 45 text

Let’s PhoneGap it! • Freezes XCode, crashes Eclipse • Solution: Pack into 1024 .txt files • Take the first character, mod 1024 • Related words share the same bucket • Great success!

Slide 46

Slide 46 text

Google Play & App Store

Slide 47

Slide 47 text

User-Driven Development • Wildcard and part-of-word searching (@esor) • Two-column layout for tablets (@hlb) • Toggle between Pinyin and Bopomofo (@matic) • Volume key on Android resizes fonts (@ivan) • Top Request: Taiwanese Bân-lâm-gi

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

Personal Motivation • My main caretakers were my grandparents • Grandma from Lo̍k-káng, Taiwan • Grandpa from Sì-chuān, China • Raised biligually as a pre-schooler • But only Mandarin had a writing system • Editing her memoir brought back memories

Slide 50

Slide 50 text

Taiwan Bân-lâm-gi Common Dictionary (MoE, 2011)

Slide 51

Slide 51 text

Good Parts • Unified Romanization system (TL) • Standardized Ideographic characters (RHC) • Full text search with Mandarin, TL & RHC • MP3 pronounciations of all entries • Licensed under CC-BY-ND 3.0

Slide 52

Slide 52 text

Not-so-good Parts • Entries are in non-bookmarkable s • No equivalent Mandarin field for entries • Still uses bitmaps for Ext-B+ fonts • Easy to scrape but hard to parse • …as discovered by @happyman_eric

Slide 53

Slide 53 text

g0v hackath2n, 2013.3.23.

Slide 54

Slide 54 text

Crowd-OCR for 154 glyphs, 2013.3.25.

Slide 55

Slide 55 text

Finished over lunch! Thanks to: @happyman, @Irvin, @hit1205, @MissleTW, @YuerLee, @YuanChao, @clkao, @MGDesigner, @gontera…

Slide 56

Slide 56 text

Database received, 2013.3.27. • 詞目總檔.xls 詞目總檔.屬性對照.xls • 釋義.xls 釋義.詞性對照.xls • 又音.xls 又音.屬性對照.xls • 近義詞對應.xls 反義詞對應.xls • 詞彙方言差.xls 語音方言差.xls • 例句.xls

Slide 57

Slide 57 text

.…What about that extra request? 您好: 資料匯入目前大致無誤。不過,twblg 網頁上 的「華語檢索」,可以用「一乾二淨」找到閩 語典的「離離」條目,這個對照表似乎沒有在 Excel 檔中看到? “

Slide 58

Slide 58 text

Well… 語言之間的對譯,不能盡然以詞彙對應,對不 夠深入了解的使用者來說,會讓他誤以為A語 言的X詞等於B語言的Y詞(並且這種呈現,會 被民眾認知為「教育部的辭典說的」)。 “

Slide 59

Slide 59 text

However… 因此華語對應這個欄位,我們是藏在系統中。 如果是民間的辭典編輯,會比較沒有這個負 擔,因此我這裡確實不能給,非常希望你們能 有辦法解決。 “

Slide 60

Slide 60 text

.…it’s all good. 好的,感謝您的提醒和協助。 目前從網頁以 Big5 範圍取出的華語條目, 共有 26274 筆對映。 在應用上,這部份我們會註明不屬於教育部 CC-BY-ND 的授權範圍。 “

Slide 61

Slide 61 text

Data Cleanup, 2013.3.30. • Convert all .xsl to .csv with LibreOffice 4 • 3 stars: Non-Proprietary Format • Replace PUA characters with mapped Unicode • Add x-造字.csv and x-華語對照表.csv • Time to put PgREST to work!

Slide 62

Slide 62 text

PgREST: MongoLab API Server • GET /collections/table_or_view • q=&c=true&f=&fo=true&s=&sk=&l= curl $LY/collections/bills?q={"proposal.0":" 吳育昇 "} curl $MOE/collections/entries?q={" 部首 ":" 一 "}&c=1 • PUT /collections/table_or_view

Slide 63

Slide 63 text

PgREST: Import/Export pgrest dbname export MOE=http://127.0.0.1:3000 curl -i -X PUT -H "Content-Type: text/csv" \ --data-binary @uni/ 詞目總檔 .csv $MOE/collections/entries curl $MOE/collections/entries # [{" 主編號 ","1"," 屬性 ":"1"," 詞目 ":" 一 "," 音讀 ":"ts i ̍ t", # " 文白俗替 ":" 替 "," 部首 ":" 一 "," 部首序 ":"001-00-01"," 方言差 ":""}]

Slide 64

Slide 64 text

PgREST: 3du.tw JSON in 48 lines https://github.com/g0v/moedict-data-twblg/blob/master/gen.ls “

Slide 65

Slide 65 text

Live Demo, part III

Slide 66

Slide 66 text

Lessons Learned • Open Data is a beginning, not an end • Keep conversations with all participants • Turn detractors into collaborators • Keep a kind heart • Assume the best intentions

Slide 67

Slide 67 text

宅心仁厚 仁者無敵

Slide 68

Slide 68 text

阿宅無敵

Slide 69

Slide 69 text

— Aaron Swartz, «Open Government» When is Transparency Useful? 眾人為了共同目標聚在一起,才能做出改變, 科技人很難獨力完成。 衡量成功的標準,可以是有多少人的生命因你 獲得改善,而不只是有多少人看你架的網站。 “

Slide 70

Slide 70 text

開站一時 開源一輩子

Slide 71

Slide 71 text

Thank you!

Slide 72

Slide 72 text

⧸/北島〈回答〉 Thank you! 新的轉機和閃閃星斗, 正在綴滿沒有遮攔的天空。 那是五千年的象形文字, 那是未來人們凝視的眼睛。 “