Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Parsing and Visualization

April 24, 2014

Data Parsing and Visualization

Talk for NTU CCSP Course


April 24, 2014

More Decks by tkirby

Other Decks in Education


  1. var comicName = “怠蔣敹!"; var nextVolume=“/HTML/Naruto/476/"; var preVolume=“/HTML/Naruto/474/"; var picCount

    = 17; var picAy = new Array(); var hosts = ["http://hotpic.sfacg.com","http://hotpic.sfacg. “http://ltpic.sfacg.com"]; picAy[0] = “/Pic/OnlineComic1/Naruto/475/001_1924.png"; picAy[1] = “/Pic/OnlineComic1/Naruto/475/002_1209.png"; picAy[2] = “/Pic/OnlineComic1/Naruto/475/003_17512.png"; picAy[3] = “/Pic/OnlineComic1/Naruto/475/004_13346.png"; picAy[4] = “/Pic/OnlineComic1/Naruto/475/005_6797.png"; picAy[5] = “/Pic/OnlineComic1/Naruto/475/006_16563.png"; picAy[6] = “/Pic/OnlineComic1/Naruto/475/007_4992.png"; picAy[7] = “/Pic/OnlineComic1/Naruto/475/008_5900.png"; picAy[8] = “/Pic/OnlineComic1/Naruto/475/009_30082.png"; picAy[9] = “/Pic/OnlineComic1/Naruto/475/010_18438.png"; picAy[10] = “/Pic/OnlineComic1/Naruto/475/011_19255.png"; picAy[11] = “/Pic/OnlineComic1/Naruto/475/012_17436.png"; picAy[12] = “/Pic/OnlineComic1/Naruto/475/013_14834.png"; picAy[13] = “/Pic/OnlineComic1/Naruto/475/014_16148.png"; . . . . . . (後略)
  2. HTML • html — document with tags <html> <body> <img

    src=“blah”/> <h1 id=“main”> Header </h1> <p class=“hot”>hot!</p> <p> this is a book … </p> <a href=“http://…”>back</a> </body> </html>
  3. CSS Selector • html — document with tags • tag

    selector: p, div • class selector: .title, .desc • id selector: #content, #dialog • relations: h1 > a, br ~ p, p div • pseudo: a:first-letter, h2:first-line • attr: a[href=“#”]
  4. cheerio jQuery subset in server side cheerio = require(“cheerio”) $

    = cheerio.load(YourHtmlSnippet); ! result = $(“.r-ent .hl”).text() npm install cheerio
  5. request + cheerio request = require(“request”) cheerio = require(“cheerio”) !

    ptt = “http://ptt.cc/bbs/Gossiping/index.html” ! request(ptt, function(e,r,b) { $ = cheerio.load(b); result = $(“.r-ent .hl”).text(); }); npm install cheerio request
  6. request + cheerio result = {}, idx = 1; function

    download() { request(ptt(idx), function(e,r,b) { if(e) return; $ = cheerio.load(b); result[idx] = $(“.r-ent .hl”).text(); idx++; setTimeout(download, 0); }); }
  7. request + cheerio result = {}, idx = 1; function

    download() { request(ptt(idx), function(e,r,b) { if(e) return; $ = cheerio.load(b); result[idx] = $(“.r-ent .hl”).text(); idx++; setTimeout(download, 0); }); }
  8. https://cris.hpa.gov.tw/ 癌症線上登記系統 curl 'https://cris.hpa.gov.tw/pagepub/Home.aspx?itemNo=cr.q.40' -H 'Cookie: ASP.NET_SessionId=zdmkqgelubtzswj0majv3aiw; _ga=GA1.3.1312976917.1398148250' -H 'Origin:

    https://cris.hpa.gov.tw' -H 'Accept-Encoding: gzip,deflate,sdch' -H 'Accept-Language: zh-TW,zh;q=0.8,en-US;q=0.6,en;q=0.4,ja;q=0.2,zh-CN;q=0.2' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.116 Safari/537.36' -H 'Content-Type: multipart/form-data; boundary=----WebKitFormBoundaryuuaKlSwaG0BKt1db' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/ *;q=0.8' -H 'Cache-Control: max-age=0' -H 'Referer: https://cris.hpa.gov.tw/ pagepub/Home.aspx?itemNo=cr.q.40' -H 'Connection: keep-alive' --data-binary $'------WebKitFormBoundaryuuaKlSwaG0BKt1db\r\nContent-Disposition: form-data; name="__EVENTTARGET"\r\n\r\n\r\n------WebKitFormBoundaryuuaKlSwaG0BKt1db\r \nContent-Disposition: form-data; name="__EVENTARGUMENT"\r\n\r\n\r\n------ WebKitFormBoundaryuuaKlSwaG0BKt1db\r\nContent-Disposition: form-data; name="__VIEWSTATE_ID"\r\n\r\n871165\r\n------WebKitFormBoundaryuuaKlSwaG0BKt1db \r\nContent-Disposition: form-data; name="__VIEWSTATE"\r\n\r\n\r\n------ WebKitFormBoundaryuuaKlSwaG0BKt1db\r\nContent-Disposition: form-data; name="__EVENTVALIDATION"\r\n\r\nXOBXdAUfZ/ +c9MQLjFuAu4ua2oATUYv7Ky61orP3bnuQBqTBl/he3+CGfCqTF6lo5pzEGdQnpmzPocNVZ +EjiTjjj/0tLHu8PFynQ0i1IVvE3EEyJvHguowAYIThKE4kvfrG7A==\r\n------ WebKitFormBoundaryuuaKlSwaG0BKt1db\r\nContent-Disposition: form-data;
  9. phantomjs a headless WebKit scriptable with a JavaScript API page

    = require(“webpage”).create() page.open(SomeURL, function() { comicID = page.evaluate(function() { return comicCounterID; }); page.render(“snapshot.png”); }); http://github.com/hcchien/doh-cancer/
  10. Summary * CSS Selector ! * Cheerio + request !

    * HTTP Header *wget / curl ! * PhantomJS ! * Tools like OpenCV
  11. var comicName = “怠蔣敹!"; var nextVolume=“/HTML/Naruto/476/"; var preVolume=“/HTML/Naruto/474/"; var picCount

    = 17; var picAy = new Array(); var hosts = ["http://hotpic.sfacg.com","http://hotpic.sfacg. “http://ltpic.sfacg.com"]; picAy[0] = “/Pic/OnlineComic1/Naruto/475/001_1924.png"; picAy[1] = “/Pic/OnlineComic1/Naruto/475/002_1209.png"; picAy[2] = “/Pic/OnlineComic1/Naruto/475/003_17512.png"; picAy[3] = “/Pic/OnlineComic1/Naruto/475/004_13346.png"; picAy[4] = “/Pic/OnlineComic1/Naruto/475/005_6797.png"; picAy[5] = “/Pic/OnlineComic1/Naruto/475/006_16563.png"; picAy[6] = “/Pic/OnlineComic1/Naruto/475/007_4992.png"; picAy[7] = “/Pic/OnlineComic1/Naruto/475/008_5900.png"; picAy[8] = “/Pic/OnlineComic1/Naruto/475/009_30082.png"; picAy[9] = “/Pic/OnlineComic1/Naruto/475/010_18438.png"; picAy[10] = “/Pic/OnlineComic1/Naruto/475/011_19255.png"; picAy[11] = “/Pic/OnlineComic1/Naruto/475/012_17436.png"; picAy[12] = “/Pic/OnlineComic1/Naruto/475/013_14834.png"; picAy[13] = “/Pic/OnlineComic1/Naruto/475/014_16148.png"; . . . . . . (後略)
  12. aac aBc a:c a-c a1c a2c …. a.c aac abc

    acc adc aec …. a[a-z]c aac abc acc adc aec …. ^a[a-z]c$ a[^A-Z]c
  13. + * ? . [ ] ^ $ () \w

    \W \d \D \s \S \\ \. \+ \? \* …
  14. Regular Expression a sequence of characters that forms a search

    pattern, mainly for use in pattern matching with strings… 正規表達式
  15. re = { push: ‘hl f3”>(\d+)<‘ title: ‘title“><a href=“([^”]+)”>([^<]+)<‘ date:

    ‘“date”>([^<]+)<‘ author: ‘“author”>([^<]+)<‘ for(i=0;i<lines.length;i++) { ret = re.push.exec(lines[i]); if(ret) { push = ret[1]; } else … }
  16. xls —> csv —> json doc —> txt —> json

    pdf —> txt —> json html —> txt —> json shp —> json xml —> json …… Always Search for Existing Tools
  17. Vim 指令式⽂文字編輯器 h j k l — 左 下 上

    右 指令模式 / 編輯模式 a o i <ESC> — 附加 新增⼀一⾏行 插入 退出編輯模式 c d r x — 改寫 刪除 覆蓋 刪掉⼀一個字 數字鍵 — 重複次數 y p — 複製 貼上
  18. 100i2<ESC> — 插入 100 個 2 ddp — 與下⼀一⾏行交換 100dd100jp

    — 下⼀一百⾏行移到兩百⾏行之後 d100w — 刪除⼀一百個字 qad100wq — 將 「d100w」命名為 「a」 @a — 重複剛剛命名為「a」的動作
  19. ⾃自動化的⽂文字編輯 範例 g0v.tw 聊天室記錄 HTML 檔 ( http://logbot.g0v.tw/ ) 台北市親山步道列表

    ( http://zbryikt.github.io/visualize/hiking/ )
 ( http://www.tcge.taipei.gov.tw/MP_106051.html )
  20. ! 捷運關渡站 1.75K → 約34分 雕塑公園 0.4K → 約7分 打印台

    0.6K → 約16分 忠義山打印台 0.8K → 約25分 ⾏行天宮登山⼜⼝口 ! 捷運忠義站 1.1K → 約15分 ⾏行天宮登山⼜⼝口 0.8K → 約30分 忠義山打印台 1.34K → 約43分 中央北路四段30巷
  21. Summary • Regular Expression • grep, sed, sort, … •

    xlsjs, xls2csv, … • vim, sublime Text, Emacs • 定義規則, 重複操作, 利⽤用前⼈人的成果
  22. Useful Junk? The Effects of Visual Embellishment on Comprehension and

    Memorability of Charts Scott Bateman, Saskatchewan university
  23. 7 6 5 2 8 7 6 5 2 8

    7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 2 8 7 6 5 2 2 2 8 2
  24. 7 6 5 2 8 7 6 5 2 8

    7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 2 8 2 8 7 6 5 2 2 2 8 2
  25. 7 6 5 2 8 7 6 5 2 8

    7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 2 8 2 8 6 5 2 2 2 8 2 7
  26. <div></div> width: 10px
 height: 10px border-bottom: 20px solid #000 border-left:

    20px solid transparent border-right: 20px solid transparent border-radius: 50%
  27. <path d=“....”> M (x,y)+ - Move to (+ lineto) Z

    - Close Path L (x,y)+ - Line to C (x1,y1,x2,y2,x,y)- Curve to ctrl1 ctrl2
  28. <path d=“....”> m (x,y)+ - Move to (+ lineto) z

    - Close Path l (x,y)+ - Line to c (x1,y1,x2,y2,x,y)+ - Curve to ctrl1 ctrl2 q (x1,y1,x,y)+ - Quadratic Bezier ! a (rx,ry,a,b,c,x,y)+ - Quadratic Bezier
  29. <ellipse rx=“1” ry=“8”> <animate attributeName=“rx” from=“1” to=“10” dur=“1s” 

    <animateTransform attributeName=“transform” type=“rotate” from=“0” to=“180” dur=“0.5s” repeatCount=“indefinite”/> </circle>
  30. Work with HTML <html> <head> </head> <body> <h1>Hello World!</h1> <svg

    viewBox=“0 0 100 100”> <circle cx=“50” cy=“50” r=“25”/> </svg> </body> </html>
  31. Tips and Pits * SVG 屬於 XML — 記得 </>

    *部分屬性可以寫成 CSS *功能族煩不及備載,包含 *圖⽚片 / Pattern / 漸層 *連結 / 群組
  32. D3JS js library for manipulating documents based on data <script

    type=“text/javascript” src=“d3.v3.min.js”></script> <script type=“text/javascript”> d3.json(“data.json”, function(data) { min = d3.min(data); }); </script>
  33. data data data data selection enter exit selection d3.selectAll(“div”) .data(data)

    .enter().append(“div”) .exit().remove() d3.selectAll(“div”)
  34. Pack Layout data = { children: [ {value: 1}, {value:

    2}, …… ]} pack = d3.layout.pack(); pack.node(data);
  35. pack = d3.layout.pack(); pack.node(data); ! d3.select(“svg”) .selectAll(“circle”) .data(data) .enter() .append(“circle”)

    .attr({ “cx”: function(it) { return it.x;}, “cy”: function(it) { return it.y;}, “r”: function(it) { return it.r; } });
  36. Summary * Purpose of Visualization * Various Chart — Pits

    and Falls * CSS and SVG Visualization * D3JS basic * Examples * Resources * Infographics? should be 另⼀一⾨門課了! :p
  37. links http://comic.sfacg.com/HTML/Naruto/ http://campaign-finance.g0v.ctiml.tw/ http://ptt.cc/bbs/Gossiping/ http://zbryikt.github.io/visualize/ptt-user/ https://cris.hpa.gov.tw/ http://g0v.github.io/nsc-projects/v2.html http://g0v.github.io/nsc-projects/index.html
 http://zbryikt.github.io/visualize/kirby/ http://mbostock.github.io/d3/talk/20111018/collision.html

    http://zbryikt.github.io/visualize/dorling/ http://bl.ocks.org/zbryikt/raw/4696905/ http://bl.ocks.org/zbryikt/raw/4248542 http://zbryikt.github.io/visualize/banana/ http://zbryikt.github.io/visualize/mrt/ http://g0v.github.io/cancer/web/ http://zbryikt.github.io/visualize/crossfilter/ http://zbryikt.github.io/visualize/jobless/ http://zbryikt.github.io/visualize/nsc-2/