Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Parsing and Visualization

tkirby
April 24, 2014

Data Parsing and Visualization

Talk for NTU CCSP Course

tkirby

April 24, 2014
Tweet

More Decks by tkirby

Other Decks in Education

Transcript

  1. var comicName = “怠蔣敹!"; var nextVolume=“/HTML/Naruto/476/"; var preVolume=“/HTML/Naruto/474/"; var picCount

    = 17; var picAy = new Array(); var hosts = ["http://hotpic.sfacg.com","http://hotpic.sfacg. “http://ltpic.sfacg.com"]; picAy[0] = “/Pic/OnlineComic1/Naruto/475/001_1924.png"; picAy[1] = “/Pic/OnlineComic1/Naruto/475/002_1209.png"; picAy[2] = “/Pic/OnlineComic1/Naruto/475/003_17512.png"; picAy[3] = “/Pic/OnlineComic1/Naruto/475/004_13346.png"; picAy[4] = “/Pic/OnlineComic1/Naruto/475/005_6797.png"; picAy[5] = “/Pic/OnlineComic1/Naruto/475/006_16563.png"; picAy[6] = “/Pic/OnlineComic1/Naruto/475/007_4992.png"; picAy[7] = “/Pic/OnlineComic1/Naruto/475/008_5900.png"; picAy[8] = “/Pic/OnlineComic1/Naruto/475/009_30082.png"; picAy[9] = “/Pic/OnlineComic1/Naruto/475/010_18438.png"; picAy[10] = “/Pic/OnlineComic1/Naruto/475/011_19255.png"; picAy[11] = “/Pic/OnlineComic1/Naruto/475/012_17436.png"; picAy[12] = “/Pic/OnlineComic1/Naruto/475/013_14834.png"; picAy[13] = “/Pic/OnlineComic1/Naruto/475/014_16148.png"; . . . . . . (後略)
  2. HTML • html — document with tags <html> <body> <img

    src=“blah”/> <h1 id=“main”> Header </h1> <p class=“hot”>hot!</p> <p> this is a book … </p> <a href=“http://…”>back</a> </body> </html>
  3. CSS Selector • html — document with tags • tag

    selector: p, div • class selector: .title, .desc • id selector: #content, #dialog • relations: h1 > a, br ~ p, p div • pseudo: a:first-letter, h2:first-line • attr: a[href=“#”]
  4. cheerio jQuery subset in server side cheerio = require(“cheerio”) $

    = cheerio.load(YourHtmlSnippet); ! result = $(“.r-ent .hl”).text() npm install cheerio
  5. request + cheerio request = require(“request”) cheerio = require(“cheerio”) !

    ptt = “http://ptt.cc/bbs/Gossiping/index.html” ! request(ptt, function(e,r,b) { $ = cheerio.load(b); result = $(“.r-ent .hl”).text(); }); npm install cheerio request
  6. request + cheerio result = {}, idx = 1; function

    download() { request(ptt(idx), function(e,r,b) { if(e) return; $ = cheerio.load(b); result[idx] = $(“.r-ent .hl”).text(); idx++; setTimeout(download, 0); }); }
  7. request + cheerio result = {}, idx = 1; function

    download() { request(ptt(idx), function(e,r,b) { if(e) return; $ = cheerio.load(b); result[idx] = $(“.r-ent .hl”).text(); idx++; setTimeout(download, 0); }); }
  8. https://cris.hpa.gov.tw/ 癌症線上登記系統 curl 'https://cris.hpa.gov.tw/pagepub/Home.aspx?itemNo=cr.q.40' -H 'Cookie: ASP.NET_SessionId=zdmkqgelubtzswj0majv3aiw; _ga=GA1.3.1312976917.1398148250' -H 'Origin:

    https://cris.hpa.gov.tw' -H 'Accept-Encoding: gzip,deflate,sdch' -H 'Accept-Language: zh-TW,zh;q=0.8,en-US;q=0.6,en;q=0.4,ja;q=0.2,zh-CN;q=0.2' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.116 Safari/537.36' -H 'Content-Type: multipart/form-data; boundary=----WebKitFormBoundaryuuaKlSwaG0BKt1db' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/ *;q=0.8' -H 'Cache-Control: max-age=0' -H 'Referer: https://cris.hpa.gov.tw/ pagepub/Home.aspx?itemNo=cr.q.40' -H 'Connection: keep-alive' --data-binary $'------WebKitFormBoundaryuuaKlSwaG0BKt1db\r\nContent-Disposition: form-data; name="__EVENTTARGET"\r\n\r\n\r\n------WebKitFormBoundaryuuaKlSwaG0BKt1db\r \nContent-Disposition: form-data; name="__EVENTARGUMENT"\r\n\r\n\r\n------ WebKitFormBoundaryuuaKlSwaG0BKt1db\r\nContent-Disposition: form-data; name="__VIEWSTATE_ID"\r\n\r\n871165\r\n------WebKitFormBoundaryuuaKlSwaG0BKt1db \r\nContent-Disposition: form-data; name="__VIEWSTATE"\r\n\r\n\r\n------ WebKitFormBoundaryuuaKlSwaG0BKt1db\r\nContent-Disposition: form-data; name="__EVENTVALIDATION"\r\n\r\nXOBXdAUfZ/ +c9MQLjFuAu4ua2oATUYv7Ky61orP3bnuQBqTBl/he3+CGfCqTF6lo5pzEGdQnpmzPocNVZ +EjiTjjj/0tLHu8PFynQ0i1IVvE3EEyJvHguowAYIThKE4kvfrG7A==\r\n------ WebKitFormBoundaryuuaKlSwaG0BKt1db\r\nContent-Disposition: form-data;
  9. phantomjs a headless WebKit scriptable with a JavaScript API page

    = require(“webpage”).create() page.open(SomeURL, function() { comicID = page.evaluate(function() { return comicCounterID; }); page.render(“snapshot.png”); }); http://github.com/hcchien/doh-cancer/
  10. Summary * CSS Selector ! * Cheerio + request !

    * HTTP Header *wget / curl ! * PhantomJS ! * Tools like OpenCV
  11. var comicName = “怠蔣敹!"; var nextVolume=“/HTML/Naruto/476/"; var preVolume=“/HTML/Naruto/474/"; var picCount

    = 17; var picAy = new Array(); var hosts = ["http://hotpic.sfacg.com","http://hotpic.sfacg. “http://ltpic.sfacg.com"]; picAy[0] = “/Pic/OnlineComic1/Naruto/475/001_1924.png"; picAy[1] = “/Pic/OnlineComic1/Naruto/475/002_1209.png"; picAy[2] = “/Pic/OnlineComic1/Naruto/475/003_17512.png"; picAy[3] = “/Pic/OnlineComic1/Naruto/475/004_13346.png"; picAy[4] = “/Pic/OnlineComic1/Naruto/475/005_6797.png"; picAy[5] = “/Pic/OnlineComic1/Naruto/475/006_16563.png"; picAy[6] = “/Pic/OnlineComic1/Naruto/475/007_4992.png"; picAy[7] = “/Pic/OnlineComic1/Naruto/475/008_5900.png"; picAy[8] = “/Pic/OnlineComic1/Naruto/475/009_30082.png"; picAy[9] = “/Pic/OnlineComic1/Naruto/475/010_18438.png"; picAy[10] = “/Pic/OnlineComic1/Naruto/475/011_19255.png"; picAy[11] = “/Pic/OnlineComic1/Naruto/475/012_17436.png"; picAy[12] = “/Pic/OnlineComic1/Naruto/475/013_14834.png"; picAy[13] = “/Pic/OnlineComic1/Naruto/475/014_16148.png"; . . . . . . (後略)
  12. aac aBc a:c a-c a1c a2c …. a.c aac abc

    acc adc aec …. a[a-z]c aac abc acc adc aec …. ^a[a-z]c$ a[^A-Z]c
  13. + * ? . [ ] ^ $ () \w

    \W \d \D \s \S \\ \. \+ \? \* …
  14. Regular Expression a sequence of characters that forms a search

    pattern, mainly for use in pattern matching with strings… 正規表達式
  15. re = { push: ‘hl f3”>(\d+)<‘ title: ‘title“><a href=“([^”]+)”>([^<]+)<‘ date:

    ‘“date”>([^<]+)<‘ author: ‘“author”>([^<]+)<‘ for(i=0;i<lines.length;i++) { ret = re.push.exec(lines[i]); if(ret) { push = ret[1]; } else … }
  16. xls —> csv —> json doc —> txt —> json

    pdf —> txt —> json html —> txt —> json shp —> json xml —> json …… Always Search for Existing Tools
  17. Vim 指令式⽂文字編輯器 h j k l — 左 下 上

    右 指令模式 / 編輯模式 a o i <ESC> — 附加 新增⼀一⾏行 插入 退出編輯模式 c d r x — 改寫 刪除 覆蓋 刪掉⼀一個字 數字鍵 — 重複次數 y p — 複製 貼上
  18. 100i2<ESC> — 插入 100 個 2 ddp — 與下⼀一⾏行交換 100dd100jp

    — 下⼀一百⾏行移到兩百⾏行之後 d100w — 刪除⼀一百個字 qad100wq — 將 「d100w」命名為 「a」 @a — 重複剛剛命名為「a」的動作
  19. ⾃自動化的⽂文字編輯 範例 g0v.tw 聊天室記錄 HTML 檔 ( http://logbot.g0v.tw/ ) 台北市親山步道列表

    ( http://zbryikt.github.io/visualize/hiking/ )
 ( http://www.tcge.taipei.gov.tw/MP_106051.html )
  20. ! 捷運關渡站 1.75K → 約34分 雕塑公園 0.4K → 約7分 打印台

    0.6K → 約16分 忠義山打印台 0.8K → 約25分 ⾏行天宮登山⼜⼝口 ! 捷運忠義站 1.1K → 約15分 ⾏行天宮登山⼜⼝口 0.8K → 約30分 忠義山打印台 1.34K → 約43分 中央北路四段30巷
  21. Summary • Regular Expression • grep, sed, sort, … •

    xlsjs, xls2csv, … • vim, sublime Text, Emacs • 定義規則, 重複操作, 利⽤用前⼈人的成果
  22. Useful Junk? The Effects of Visual Embellishment on Comprehension and

    Memorability of Charts Scott Bateman, Saskatchewan university
  23. 7 6 5 2 8 7 6 5 2 8

    7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 2 8 7 6 5 2 2 2 8 2
  24. 7 6 5 2 8 7 6 5 2 8

    7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 2 8 2 8 7 6 5 2 2 2 8 2
  25. 7 6 5 2 8 7 6 5 2 8

    7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 5 2 8 7 6 2 8 2 8 6 5 2 2 2 8 2 7
  26. <div></div> width: 10px
 height: 10px border-bottom: 20px solid #000 border-left:

    20px solid transparent border-right: 20px solid transparent border-radius: 50%
  27. <path d=“....”> M (x,y)+ - Move to (+ lineto) Z

    - Close Path L (x,y)+ - Line to C (x1,y1,x2,y2,x,y)- Curve to ctrl1 ctrl2
  28. <path d=“....”> m (x,y)+ - Move to (+ lineto) z

    - Close Path l (x,y)+ - Line to c (x1,y1,x2,y2,x,y)+ - Curve to ctrl1 ctrl2 q (x1,y1,x,y)+ - Quadratic Bezier ! a (rx,ry,a,b,c,x,y)+ - Quadratic Bezier
  29. <ellipse rx=“1” ry=“8”> <animate attributeName=“rx” from=“1” to=“10” dur=“1s” 
 repeatCount=“indefinite”/>

    <animateTransform attributeName=“transform” type=“rotate” from=“0” to=“180” dur=“0.5s” repeatCount=“indefinite”/> </circle>
  30. Work with HTML <html> <head> </head> <body> <h1>Hello World!</h1> <svg

    viewBox=“0 0 100 100”> <circle cx=“50” cy=“50” r=“25”/> </svg> </body> </html>
  31. Tips and Pits * SVG 屬於 XML — 記得 </>

    *部分屬性可以寫成 CSS *功能族煩不及備載,包含 *圖⽚片 / Pattern / 漸層 *連結 / 群組
  32. D3JS js library for manipulating documents based on data <script

    type=“text/javascript” src=“d3.v3.min.js”></script> <script type=“text/javascript”> d3.json(“data.json”, function(data) { min = d3.min(data); }); </script>
  33. data data data data selection enter exit selection d3.selectAll(“div”) .data(data)

    .enter().append(“div”) .exit().remove() d3.selectAll(“div”)
  34. Pack Layout data = { children: [ {value: 1}, {value:

    2}, …… ]} pack = d3.layout.pack(); pack.node(data);
  35. pack = d3.layout.pack(); pack.node(data); ! d3.select(“svg”) .selectAll(“circle”) .data(data) .enter() .append(“circle”)

    .attr({ “cx”: function(it) { return it.x;}, “cy”: function(it) { return it.y;}, “r”: function(it) { return it.r; } });
  36. Summary * Purpose of Visualization * Various Chart — Pits

    and Falls * CSS and SVG Visualization * D3JS basic * Examples * Resources * Infographics? should be 另⼀一⾨門課了! :p
  37. links http://comic.sfacg.com/HTML/Naruto/ http://campaign-finance.g0v.ctiml.tw/ http://ptt.cc/bbs/Gossiping/ http://zbryikt.github.io/visualize/ptt-user/ https://cris.hpa.gov.tw/ http://g0v.github.io/nsc-projects/v2.html http://g0v.github.io/nsc-projects/index.html
 http://zbryikt.github.io/visualize/kirby/ http://mbostock.github.io/d3/talk/20111018/collision.html

    http://zbryikt.github.io/visualize/dorling/ http://bl.ocks.org/zbryikt/raw/4696905/ http://bl.ocks.org/zbryikt/raw/4248542 http://zbryikt.github.io/visualize/banana/ http://zbryikt.github.io/visualize/mrt/ http://g0v.github.io/cancer/web/ http://zbryikt.github.io/visualize/crossfilter/ http://zbryikt.github.io/visualize/jobless/ http://zbryikt.github.io/visualize/nsc-2/