Summary
• V8 : JavaScript engine
• PLV8 : Stored procedures in JavaScript
• plv8x : Package manager for PLV8
• Turns NPM modules into SQL functions
• JSON expressions with ~> and <~
• Code reuse for browser + server + database !
Slide 12
Slide 12 text
Cutting out the Middleware
• Serve JSON API from SQL
• Shared models & validation code
• Put Business Logic into DB
• Perfect fit for Medium Data™
Slide 13
Slide 13 text
@clkao++
Slide 14
Slide 14 text
3du.tw
Slide 15
Slide 15 text
The Revised MoE Dictionary (1994)
Slide 16
Slide 16 text
The Good
• 160,000+ entries
• Official, high quality sources
• Rich etymology and historical usage
• Full text search with regular expressions
• Still frequently updated!
Slide 17
Slide 17 text
The Bad
• Results are not bookmarkable
• Requires N clicks to get to a definition
• Rare characters become low-res bitmaps
• Difficult to use on mobile devices
• ”Optimized for IE 5.0 and Netscape 4.7+”!?
Slide 18
Slide 18 text
⧸/教育部國語推行委員會〈有關授權〉
The Sad
本會非常歡迎各位來連結「國語辭典」,但是
本會目前只開放以超連結 (hyperlink) 的方式與
國語辭典 首頁 連結,至於其他方式本會並未對
外開放授權。若還有疑問或建議,歡迎來信。
“
g0v hackath1n, 2013.1.27.
• Scrape 2741 idioms as HTML (@TonyQ, @MnO2)
• Scrape 3000 characters as raw HTML (@au)
• Design JSON schema from samples (@pingooo)
• Design SQL schema from samples (@albb0920)
• Parse HTML into JSON & SQLite (@kcwu)
• …and for those 24x24 bitmaps…
Slide 24
Slide 24 text
← Big-5
→ UTF-8
Slide 25
Slide 25 text
Crowd-OCR for 1000+ glyphs
Slide 26
Slide 26 text
Finished in 24 hours!
Thanks to: Favonia, Jun-Yuan Yan, Yao Wei, Yaoting Huang, Poka,
Caasi Huang, Daniel Liang, Grey Lee, Irvin Chen, Gugod, Schee…
Slide 27
Slide 27 text
粗略的共識
會動的程式
Slide 28
Slide 28 text
Applications
• XUL Desktop App (@racklin)
• OS X Dictionary (@yllan)
• Windows 8 App (@wenpei)
• iOS Client (@tomjpsun, @jamessa, @pct)
• iOS Offline App (@zonble)
Slide 29
Slide 29 text
Integrations
• Rails API server (@albb0920)
• AngularJS Client+Server (@viirya)
• Chrome Extension (@tonytonyjan)
• Sublime Text plugin (@zonble)
• WinRT Component (@eriksk)
Web Fonts for Private-Use Area
• Initially based on Hán Nôm font (@YaoWei)
• Subset everything outside Big5 range
• Hand-drawn PUA chars like
⿰亻壯
• Later on, switched to Hanazono 花園明朝 font
• 75,619 + 8,236 glyphs
• From 花園大学国際禅学研究所
Slide 36
Slide 36 text
科技始終
來自於佛性
Slide 37
Slide 37 text
Live Demo
Slide 38
Slide 38 text
Reaching the Fifth Star
1. ⊙☉ Open License
2. Structured Data
3. Non-Proprietary Format
4. ✧ Each Item has an URI
5. ✩ Linking between Items
Slide 39
Slide 39 text
Chinese Segmentation
• Therearenowhitespacesbetweenwords
• Lots of heuristic algorithms
• Naive solution: Longest-token match
• Requires a large dictionary
• …wait, we just got one here
Worked well, but…
• Freezes IE8, crashes IE7
• Broken on Android 2.x, too
• So let’s pre-segment on server
• Needs a tool to move JS into DB
• …wait, we just got one here
Let’s PhoneGap it!
• Freezes XCode, crashes Eclipse
• Solution: Pack into 1024 .txt files
• Take the first character, mod 1024
• Related words share the same bucket
• Great success!
Slide 46
Slide 46 text
Google Play & App Store
Slide 47
Slide 47 text
User-Driven Development
• Wildcard and part-of-word searching (@esor)
• Two-column layout for tablets (@hlb)
• Toggle between Pinyin and Bopomofo (@matic)
• Volume key on Android resizes fonts (@ivan)
• Top Request: Taiwanese Bân-lâm-gi
Slide 48
Slide 48 text
No content
Slide 49
Slide 49 text
Personal Motivation
• My main caretakers were my grandparents
• Grandma from Lo̍k-káng, Taiwan
• Grandpa from Sì-chuān, China
• Raised biligually as a pre-schooler
• But only Mandarin had a writing system
• Editing her memoir brought back memories
Slide 50
Slide 50 text
Taiwan Bân-lâm-gi Common Dictionary
(MoE, 2011)
Slide 51
Slide 51 text
Good Parts
• Unified Romanization system (TL)
• Standardized Ideographic characters (RHC)
• Full text search with Mandarin, TL & RHC
• MP3 pronounciations of all entries
• Licensed under CC-BY-ND 3.0
Slide 52
Slide 52 text
Not-so-good Parts
• Entries are in non-bookmarkable s
• No equivalent Mandarin field for entries
• Still uses bitmaps for Ext-B+ fonts
• Easy to scrape but hard to parse
• …as discovered by @happyman_eric
Slide 53
Slide 53 text
g0v hackath2n, 2013.3.23.
Slide 54
Slide 54 text
Crowd-OCR for 154 glyphs, 2013.3.25.
Slide 55
Slide 55 text
Finished over lunch!
Thanks to: @happyman, @Irvin, @hit1205, @MissleTW, @YuerLee,
@YuanChao, @clkao, @MGDesigner, @gontera…
Data Cleanup, 2013.3.30.
• Convert all .xsl to .csv with LibreOffice 4
• 3 stars: Non-Proprietary Format
• Replace PUA characters with mapped Unicode
• Add x-造字.csv and x-華語對照表.csv
• Time to put PgREST to work!
Slide 62
Slide 62 text
PgREST: MongoLab API Server
• GET /collections/table_or_view
• q=&c=true&f=&fo=true&s=&sk=&l=
curl $LY/collections/bills?q={"proposal.0":"
吳育昇
"}
curl $MOE/collections/entries?q={"
部首
":"
一
"}&c=1
• PUT /collections/table_or_view
PgREST: 3du.tw JSON in 48 lines
https://github.com/g0v/moedict-data-twblg/blob/master/gen.ls
“
Slide 65
Slide 65 text
Live Demo, part III
Slide 66
Slide 66 text
Lessons Learned
• Open Data is a beginning, not an end
• Keep conversations with all participants
• Turn detractors into collaborators
• Keep a kind heart
• Assume the best intentions
Slide 67
Slide 67 text
宅心仁厚
仁者無敵
Slide 68
Slide 68 text
阿宅無敵
Slide 69
Slide 69 text
— Aaron Swartz, «Open Government»
When is Transparency Useful?
眾人為了共同目標聚在一起,才能做出改變,
科技人很難獨力完成。
衡量成功的標準,可以是有多少人的生命因你
獲得改善,而不只是有多少人看你架的網站。
“