從大數據走向人工智慧

從大數據走向人工智慧發揮大數據的價值，別忽略機器學習及人工智慧陳昇瑋台灣資料科學協會理事長中央研究院資訊科學研究所研究員

陳昇瑋 / 從大數據走向人工智慧 2 http://ds.sinica.edu.tw/

陳昇瑋 / 從大數據走向人工智慧中央研究院 AI 月活動 3

陳昇瑋 / 從大數據走向人工智慧中央研究院資訊科學研究所 6

陳昇瑋 / 從大數據走向人工智慧 8 中央研究院資訊科學研究所組成 40 位研究員 30 位博士後研究員
300 位研究助理研究領域演算法資料科學智慧型代理人語音處理中文認知多媒體生物資訊系統技術機器學習

陳昇瑋 / 從大數據走向人工智慧 10 Data Insights Research Lab

陳昇瑋 / 從大數據走向人工智慧 Area 1: Quality of Experience 使用情緒量測技術來預言線上遊戲的成與敗 12
[1] Jing-Kai Lou, Kuan-Ta Chen, Hwai-Jung Hsu, and Chin-Laung Lei, Forecasting Online Game Addictiveness, IEEE/ACM NetGames 2012.

陳昇瑋 / 從大數據走向人工智慧 Area 2: Multimedia Systems 14

Area 3: Computational Social Science “The emerging intersection of the
social and computational sciences, an intersection that includes analysis of web-scale observational data, virtual lab–style experiments, and computational modeling” [1]. [1] Duncan J. Watts, Computational Social Science Exciting Progress and Future Directions, Frontiers of Engineering, Winter 2013.

陳昇瑋 / 從大數據走向人工智慧資料分析這條路 Since 2002 (my first PhD year)
… PhD dissertation: based on a 20-hour game packet trace Collaboration & Consulting 製造業電信業社群網路 / 遊戲銀行 / 壽險 / 電子票証中央 / 地方政府 16

陳昇瑋 / 從大數據走向人工智慧好，進入正題 18

陳昇瑋 / 從大數據走向人工智慧時代在變 20

陳昇瑋 / 從大數據走向人工智慧 Evolving Sciences Thousand years ago science was
empirical  describing natural phenomena Last few hundred years theoretical branch using models, generalizations Last few decades a computational branch simulating complex phenomena 21

陳昇瑋 / 從大數據走向人工智慧 22

陳昇瑋 / 從大數據走向人工智慧 The Fourth Paradigm Data-driven science Scientists overwhelmed
with datasets from different sources  Data captured by instruments  Data generated by simulations  Data collected by sensor networks Need new methodologies to deal with the data 23

What’s Data Science?

陳昇瑋 / 從大數據走向人工智慧 Definition of “Science” Science is a systematic
enterprise that builds and organizes knowledge in the form of general, measureable and verifiable explanations and predictions about the universe. In modern usage "science" most often refers to a way of pursuing knowledge, not only to the knowledge itself. Over the course of the 19th century, the word "science" became increasingly associated with the scientific method itself.

陳昇瑋 / 從大數據走向人工智慧

陳昇瑋 / 從大數據走向人工智慧 (Photo credit: Brian Harrington Spier)

3 Major Trends in Data Science Big Data Deep Learning
Beyond Prediction

陳昇瑋 / 從大數據走向人工智慧 3V Explained

陳昇瑋 / 從大數據走向人工智慧 Massive number of Internet users (generating data)
Collecting & storing data is much cheaper now New types and wide deployed sensors Advances in machine learning (esp. for analyzing unstructured data) Why Big Data?

陳昇瑋 / 從大數據走向人工智慧 Google patent covers using vehicle sensors to
detect road quality 38

SurroundSense: Mobile Phone Localization via Ambience Fingerprinting 39

陳昇瑋 / 從大數據走向人工智慧 See through walls with WiFi! 40 applies
to 8” concrete walls, 6” hollow walls, and 1.75” solid wooden doors.

陳昇瑋 / 從大數據走向人工智慧 Computer Vision Matters Safety Health Security Comfort
Access Fun (Slide Credit: Jia-Bin Huang)

陳昇瑋 / 從大數據走向人工智慧 Food Recognition 43

陳昇瑋 / 從大數據走向人工智慧 Computer vision in sports SportVision: improving viewer
experiences (Slide Credit: Jia-Bin Huang)

陳昇瑋 / 從大數據走向人工智慧 Computer vision in sports Player tracking (Slide
Credit: Jia-Bin Huang)

陳昇瑋 / 從大數據走向人工智慧 Computer vision in sports Second Spectrum: visual
analytics (Slide Credit: Jia-Bin Huang)

陳昇瑋 / 從大數據走向人工智慧 Computer vision in sports Replay Technologies: improving
viewer experiences (Slide Credit: Jia-Bin Huang)

陳昇瑋 / 從大數據走向人工智慧 Computer vision for healthcare Video magnification (Slide

陳昇瑋 / 從大數據走向人工智慧 51 https://www.youtube.com/watch?v=QbXgEbeceJI (Credit: Jia-Bin Huang)

陳昇瑋 / 從大數據走向人工智慧 54 #2 DEEP LEARNING

“Deep Learning” search trend May 2015 May 2016 May 2017

陳昇瑋 / 從大數據走向人工智慧 Machine Learning 56 A type of algorithms
that gives computers the ability to learn from data, rather than being explicitly programmed. Find the common patterns from the left waveforms It seems impossible to write a program for speech recognition 你好你好你好你好 You quickly get lost in the exceptions and special cases. (Slide Credit: Hung-Yi Lee)

陳昇瑋 / 從大數據走向人工智慧 Let the machine learn by itself 你好
大家好人帥真好 You said “你好” A large amount of audio data You only have to write the learning algorithm ONCE Derive rules from datasets (Slide Credit: Hung-Yi Lee)

陳昇瑋 / 從大數據走向人工智慧 Patterns learned by machine 59

陳昇瑋 / 從大數據走向人工智慧 Multi-layer patterns learned from faces 60

陳昇瑋 / 從大數據走向人工智慧 Edge & Blob 62 http://vision03.csail.mit.edu/cnn_art/data/single_layer.png (like receptive
fields in V1 neurons)

陳昇瑋 / 從大數據走向人工智慧 Texture 63 http://vision03.csail.mit.edu/cnn_art/data/single_layer.png

陳昇瑋 / 從大數據走向人工智慧 Object Parts 64 http://vision03.csail.mit.edu/cnn_art/data/single_layer.png

陳昇瑋 / 從大數據走向人工智慧 Object Classes 65 http://vision03.csail.mit.edu/cnn_art/data/single_layer.png

69 http://technews.tw/2017/05/05/updating-google-maps-with-deep-learning-and-street-view/

70 http://technews.tw/2017/05/05/updating-google-maps-with-deep-learning-and-street-view/

陳昇瑋 / 從大數據走向人工智慧 Deep learning can be highly flexible •
Speech Recognition • Handwritten Recognition • Playing Go • Dialogue System ( )= * f ( )= * f ( )= * f ( )= * f “2” “Morning” “5-5” “Hello” “Hi” (what the user said) (system response) (step) (Slide Credit: Hung-Yi Lee)

陳昇瑋 / 從大數據走向人工智慧 Human Brains (Slide Credit: Hung-Yi Lee)

Sheng-Wei Chen / From Data Science to Artificial Intelligence An
Artificial Neuron z 1 w 2 w N w … 1 x 2 x N x + b ( ) z σ ( ) z σ z bias a ( ) z e z − + = 1 1 σ Sigmoid function Each neuron is a function Activation function (Slide Credit: Hung-Yi Lee)

Artificial Neural Network ( ) z σ + ( )
z σ + ( ) z σ + ( ) z σ + (Slide Credit: Hung-Yi Lee)

陳昇瑋 / 從大數據走向人工智慧 Output Layer Hidden Layers Input Layer Fully
Connect Feedforward Network Input Output 1 x 2 x Layer 1 …… N x …… Layer 2 …… Layer L …… …… …… …… …… y1 y2 yM Deep means many hidden layers neuron (Slide Credit: Hung-Yi Lee)

陳昇瑋 / 從大數據走向人工智慧 Visual Question Answering source: http://visualqa.org/ (Slide Credit:
Hung-Yi Lee)

陳昇瑋 / 從大數據走向人工智慧 Word Embedding 80

陳昇瑋 / 從大數據走向人工智慧 Word Embedding 81

陳昇瑋 / 從大數據走向人工智慧 Word Vector Source: http://www.slideshare.net/hustwj/cikm-keynotenov2014 (Slide Credit: Hung-Yi
Lee)

陳昇瑋 / 從大數據走向人工智慧 Word Vector Characteristics Solving analogies ℎ −
ℎ ≈ 𝑏𝑏𝑏𝑏 − 𝑅𝑅𝑅𝑅 − 𝐼𝐼 ≈ 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 − 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 𝑘𝑘𝑘𝑘𝑘𝑘 − 𝑞𝑞𝑞𝑞𝑞𝑞 ≈ 𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢 − 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 Rome : Italy = Berlin : ? Compute 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 − 𝑅𝑅𝑅𝑅 + 𝐼𝐼 Find the word w with the closest V(w) (Slide Credit: Hung-Yi Lee)

陳昇瑋 / 從大數據走向人工智慧 Machine Reading Machine learn the meaning of
words from reading a lot of documents without supervision Machine learns to understand netizens via reading the posts on PTT (Slide Credit: Hung-Yi Lee)

陳昇瑋 / 從大數據走向人工智慧 (Slide Credit: Hung-Yi Lee)

陳昇瑋 / 從大數據走向人工智慧如果你想 “深度學習深度學習” “Neural Networks and Deep
Learning” written by Michael Nielsen http://neuralnetworksanddeeplearning.com/ “Deep Learning” Written by Yoshua Bengio, Ian J. Goodfellow and Aaron Courville http://www.iro.umontreal.ca/~bengioy/dlbook/ Course: Machine learning and having it deep and structured http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD15_ 2.html (Slide Credit: Hung-Yi Lee)

陳昇瑋 / 從大數據走向人工智慧 #3. BEYOND PREDICTION MODELING 88 BEYOND PREDICTION
MODELING

Sheng-Wei Chen / From Data Science to Artificial Intelligence Time
series forecasting 89

Sheng-Wei Chen / From Data Science to Artificial Intelligence Shape
clustering using DTW 90

陳昇瑋 / 從大數據走向人工智慧 Generative Adversarial Networks 95 Dimension reduction Simulate
possible futures for reinforcement learning Can be trained with missing data Multi-modal outputs Some useful applications

陳昇瑋 / 從大數據走向人工智慧 CIFAR-10: Which one is machine-generated? https://openai.com/blog/generative-models/ (Slide
Credit: Hung-Yi Lee)

陳昇瑋 / 從大數據走向人工智慧 Source of images: https://zhuanlan.zhihu.com/p/24767059 DCGAN: https://github.com/carpedm20/DCGAN-tensorflow (Slide
Credit: Hung-Yi Lee) Anime Girl Face Generation

陳昇瑋 / 從大數據走向人工智慧 100 rounds (Slide Credit: Hung-Yi Lee) Anime
Girl Face Generation

1000 rounds (Slide Credit: Hung-Yi Lee) Anime Girl Face Generation

50,000 rounds (Slide Credit: Hung-Yi Lee) Anime Girl Face Generation

NN Generator v1 Discri- minator v1 NN Generator v2 Discri-
minator v2 NN Generator v3 Discri- minator v3 Real poems: 床前明月光，疑似地上霜，舉頭望明月，低頭思故鄉。哈哈哈哈哈… 低頭吃便當… 春眠不覺曉… (Slide Credit: Hung-Yi Lee) WGAN – Poem Generation

由李仲翊同學提供實驗結果 • 升雲白遲丹齋取，此酒新巷市入頭。黃道故海歸中後，不驚入得韻子門。 • 據口容章蕃翎翎，邦貸無遊隔將毬。外蕭曾臺遶出畧，此計推上呂天夢。 • 新來寳伎泉，手雪泓臺蓑。曾子花路魏，不謀散薦船。
• 功持牧度機邈爭，不躚官嬉牧涼散。不迎白旅今掩冬，盡蘸金祇可停。 • 玉十洪沄爭春風，溪子風佛挺橫鞋。盤盤稅焰先花齋，誰過飄鶴一丞幢。 • 海人依野庇，為阻例沉迴。座花不佐樹，弟闌十名儂。 • 入維當興日世瀕，不評皺。頭醉空其杯，駸園凋送頭。 • 鉢笙動春枝，寶叅潔長知。官爲宻爛去，絆粒薛一靜。 • 吾涼腕不楚，縱先待旅知。楚人縱酒待，一蔓飄聖猜。 • 折幕故癘應韻子，徑頭霜瓊老徑徑。尚錯春鏘熊悽梅，去吹依能九將香。 • 通可矯目鷃須浄，丹迤挈花一抵嫖。外子當目中前醒，迎日幽筆鈎弧前。 • 庭愛四樹人庭好，無衣服仍繡秋州。更怯風流欲鴂雲，帛陽舊據畆婷儻。 Randomly generated (Slide Credit: Hung-Yi Lee) WGAN – Poem Generation

Conditional GAN – Text to Image "red flower with black
center" (Slide Credit: Hung-Yi Lee)

Conditional GAN - Text to Image 由曾柏翔同學提供實驗結果
Black hair, blue eyes Blue hair, green eyes Red hair, long hair (Slide Credit: Hung-Yi Lee)

Image-to-image Translation Phillip Isola, Jun-Yan Zhu,Tinghui Zhou, Alexei A. Efros,
“Image-to-Image Translation with Conditional Adversarial Networks”, arXiv preprint, 2016 (Slide Credit: Hung-Yi Lee)

陳昇瑋 / 從大數據走向人工智慧 Interactive Image Translation with pix2pix-tensorflow https://affinelayer.com/pixsrv/

陳昇瑋 / 從大數據走向人工智慧 Image to Image Translation: CycleGAN 111

陳昇瑋 / 從大數據走向人工智慧 112 Image to Image Translation: CycleGAN

陳昇瑋 / 從大數據走向人工智慧 CycleGAN 114

陳昇瑋 / 從大數據走向人工智慧 Horse <-> Zibra 115

陳昇瑋 / 從大數據走向人工智慧 Auto Coloring 120 https://paintschainer.preferred.tech/index_zh.html

陳昇瑋 / 從大數據走向人工智慧 Types of Machine Learning Methods 121 Machine
Learning Supervised Unsupervised Reinforcement Task driven (Regression / Classification) Data driven (Clustering) Learning by reacting to feedback

陳昇瑋 / 從大數據走向人工智慧 Why Supervised Learning is Not Enough 122
https://www.reddit.com/r/MachineLearning/comments/2lmo0l/ama_geoffrey_hinton/ The brain has about 1014 synapses and we only live for about 109 seconds. So we have a lot more parameters than data. This motivates the idea that we must do a lot of unsupervised learning since the perceptual input (including proprioception) is the only place we can get 105 dimensions of constraint per second. -- Geoffrey Hinton

陳昇瑋 / 從大數據走向人工智慧 Learning to play Go Supervised v.s. Reinforcement
128 (Slide Credit: Hung-Yi Lee)

陳昇瑋 / 從大數據走向人工智慧 Approaches To Reinforcement Learning Policy-based RL Search
directly for the optimal policy This is the policy achieving maximum future reward Value-based RL Estimate the optimal value function This is the maximum value achievable under any policy Model-based RL Build a transition model of the environment Plan (e.g. by lookahead) using model Of course you can combine any of the above 130

陳昇瑋 / 從大數據走向人工智慧 Typical Applications of RL Play games: Atari,
poker, Go, ... Explore worlds: 3D worlds, Labyrinth, ... Control physical systems: manipulate, walk, swim, ... Interact with users: recommend, optimize, personalize, ... 131 (Slide credit: David Silver)

陳昇瑋 / 從大數據走向人工智慧 DeepMind A3C 132 https://youtu.be/M40rN7afngY?t=21

陳昇瑋 / 從大數據走向人工智慧 DeepMind A3C 133 https://youtu.be/0xo1Ldx3L5Q?t=22 SL: Mimic excellent
drivers RL: Learn from failures

陳昇瑋 / 從大數據走向人工智慧 More RL Applications Flying Helicopter Driving Google
Cuts Its Giant Electricity Bill With DeepMind- Powered AI Parameter tuning in manufacturing lines Text generation Hongyu Guo, “Generating Text with Deep Reinforcement Learning”, NIPS, 2015 Marc'AurelioRanzato,SumitChopra,Michael Auli,Wojciech Zaremba, “Sequence Level Training with Recurrent Neural Networks”, ICLR, 2016 134 (Slide Credit: Hung-Yi Lee)

陳昇瑋 / 從大數據走向人工智慧 Reinforcement Learning Resources Textbook: Reinforcement Learning: An
Introduction https://webdocs.cs.ualberta.ca/~sutton/book/the-book.html Lectures of David Silver http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html (10 lectures, 1:30 each) http://videolectures.net/rldm2015_silver_reinforcement_learn ing/ (Deep Reinforcement Learning ) Lectures of John Schulman https://youtu.be/aUrX 135 (Slide Credit: Hung-Yi Lee)

陳昇瑋 / 從大數據走向人工智慧 136 Common Myths

陳昇瑋 / 從大數據走向人工智慧 MYTH 1. DATA SCIENCE = BIG DATA
137

陳昇瑋 / 從大數據走向人工智慧 Data Science vs. Big Data Data Science
is a superset of Big Data. However, the rise of Big Data draws people’s attention to Data Science. 138 Data Science Big Data Machine Learning Data Mining Deep Learning

陳昇瑋 / 從大數據走向人工智慧 Big data (small data also do) Statistics
/ machine learning Data analysis languages (e.g., R, Python) Data infrastructure (e.g., NoSQL, Hadoop, Spark) Data visualization Data Science is More Than …

陳昇瑋 / 從大數據走向人工智慧 MYTH 2. DATA VISUALIZATION = DATA ANALYTICS
142

陳昇瑋 / 從大數據走向人工智慧 Data Visualization vs. Data Analysis 143 Visualization
is the act or process of interpreting in visual terms or of putting into visible form. Analysis indicates a careful study of something to learn about its parts, what they do, and how they are related to each other. —Merriam-Webster's Dictionary

144 (來源: 哈佛商業評論 2017 五月號)

陳昇瑋 / 從大數據走向人工智慧 MYTH 3. AI IS INDEPENDENT FROM DATA
SCIENCE 146

陳昇瑋 / 從大數據走向人工智慧 Big data vs. Machine learning vs. AI
Big data: 3Vs Machine learning: “A type of algorithms that gives computers the ability to learn from data, rather than being explicitly programmed." Artificial intelligence Turing test 147

陳昇瑋 / 從大數據走向人工智慧 AI is a Product of Data Analytics
Data Science Big Data Machine Learning Data Mining Deep Learning

陳昇瑋 / 從大數據走向人工智慧 MYTH 4. IT TEAM IS RESPONSIBLE FOR
DATA SCIENCE/ANALYTICS 150

陳昇瑋 / 從大數據走向人工智慧 153 Computer Science Statistical Skills Data Engineer
Data Analyst Data Scientist Domain Expertise

陳昇瑋 / 從大數據走向人工智慧 MYTH 5. DATA ANALYTICS CAN BE DONE
BY USING XXX PLATFORM 154

陳昇瑋 / 從大數據走向人工智慧 155 (Slide Credit: Jyun-Yu Jiang) Descriptive Analytics
Diagnostic Analytics Predictive Analytics Prescriptive Analytics

陳昇瑋 / 從大數據走向人工智慧 158 網路購書大數據 – 網路購書大數據 – 給出版者的洞察分析

陳昇瑋 / 從大數據走向人工智慧三個層次 159 描述分析讀者樣貌分析 • 基本樣貌及購買力 •
差異式讀者樣貌分析 • 另類書籍排行榜解釋分析預測分析書籍銷售表現分析 • 商品屬性及呈現方式 • 書名關鍵字書籍銷售表現預測 • 預測模型建立及解釋

陳昇瑋 / 從大數據走向人工智慧 163 夜貓族讀者偏好同性愛小說

陳昇瑋 / 從大數據走向人工智慧 165 25歲以前買生涯規劃; 30 歲後買致富

陳昇瑋 / 從大數據走向人工智慧 167 隨收入增者，購買外遇/離婚及家庭/親子書的讀者人數增加

陳昇瑋 / 從大數據走向人工智慧 168 男生的最愛女生的最愛把妹達人正妹心理學心理學家的專業把妹術搭訕聖經
正妹沒告訴你的事寫給女人的生命啟動書法國女人：寫給女人的30天愛自己計劃下一站，幸福：女孩的必修12堂課真愛絕非運氣，被愛是種實力！：女人受益一生的12堂幸福課貼心的女人，幸福無敵：改變男人的賀爾蒙，從貼心做起

陳昇瑋 / 從大數據走向人工智慧 169 男生的最愛女生的最愛台南80攤：徐天麟帶你吃遍道地台南美食漫漫首爾澎湖＋金門玩全攻略慢遊濟州島：不走尋常路的祕境風景
台灣單車環島遊在英國遇見小狐狸台灣．用騎的最美 ~和她騎出屬於自己的單車故事首爾日歸小旅行鐵道‧祕境：30座魅力小站╳5種經典樂趣，看見最浪漫的台灣鐵道故事歐巴，我來了！BIGBANG、EXO、SHINee、 Super Junior等韓國6大人氣男團99個首爾追星蹲點x撞星美食全攻略

陳昇瑋 / 從大數據走向人工智慧 170 超實用澳洲打工度假武林祕笈西雅圖東台灣小日子普羅旺斯，慢慢走用心遊台灣華盛頓DC自助超簡單
首爾打工度假：從申辦、住宿到當地找工作、遊玩的第一手資訊東京五星級魚食名店：日本名美食家岸朝子精選84家私料理 30歲前都能實現的哈日遊學夢：日本打工度假全攻略早安巴黎。午安倫敦：歐洲之星雙城記小資族的最愛好野人的最愛

陳昇瑋 / 從大數據走向人工智慧 171 普羅大眾的最愛知識份子的最愛 3分鐘立即說越南語 2015－2017 iBT托福閱讀題庫 GEPT全民英檢初級]閱讀測驗-最新增訂版
活用學術字彙：跨出論文寫作的第一步不學可惜的韓語單字書 GMAT字彙紅寶書如何活用日常英文單字？ TOEFL iBT聽說讀寫Barron’s最新第13版就要這樣學 KK音標 TOEFL iBT階段式托福寫作

陳昇瑋 / 從大數據走向人工智慧 172 中藥材養生密碼失智怎麼伴？24位名人陪伴失智親人的故事穴位按摩圖典宇宙健康法：莊淑旂的養生智慧實用青草藥圖鑑提昇癌症治療效果的遠紅外線溫熱療法
從頭到腳推拿技巧台灣心臟外科第一人：洪啟仁的生命故事圖解偏方祕方大全瑜珈拉筋解剖痠痛拉筋解剖書醫療保健類普羅大眾的最愛知識份子的最愛

陳昇瑋 / 從大數據走向人工智慧 173 社會科學類泛綠的最愛泛藍的最愛小英的故事：蔡英文的翻轉人生攻略中國，從天下到民族國家曾國藩兵法
我們人民：憲法根基犯罪心理學新論十個詞彙裡的中國災難管理與社會工作實務手冊圖解中國「十三五規劃」建議一直同在Together & Forever：我們和小英一起走過的旅程兩岸最前線：從海陸大戰到海陸休兵

陳昇瑋 / 從大數據走向人工智慧 174 我的菩提路第二輯 2016唐立淇星座運勢大解析+2016占星文本南懷瑾談歷史與人生塔羅占卜全書生命終點的盼望：生命與死亡的藝術萊德．偉特塔羅牌
禪修地圖天使與惡魔人性相談室禪是喝茶吃飯：千年禪宗教你不煩惱的生活智慧心理占星學全書宗教命理類晨型人的最愛夜貓族的最愛

陳昇瑋 / 從大數據走向人工智慧 176 放棄的力量正面迎對，思考的力量決斷2秒間：擷取關鍵資訊，發揮不假思索的力量道歉：比你想像中的力量還要大腦內革命：驚人的潛意識力量

陳昇瑋 / 從大數據走向人工智慧 177 閩南諺語的處世智慧汲取名人智慧：閱讀人生，改變生命的大好機會聽南懷瑾講淡定智慧有一種智慧叫以退為進魯蛇翻身做董事：躍升人生勝利組不可
不知的32個街頭智慧

陳昇瑋 / 從大數據走向人工智慧 179 優氧：改善微循環，優化身體氧氣，增強自癒力我也曾憂鬱：一位精神科醫師用心靈自癒力，不吃藥改善頭痛、失眠、憂鬱。阿茲海默症有救了！椰子油生酮體，改善大腦退化的救星改善脖子僵硬，身體90%的疼痛都會消失：醫學
博士教你躺五分鐘即可見效的「脖子矯正法」養腰活腿，身體就輕鬆：關鍵穴位、飲食、運動有效改善36種痠痛的中醫自療書

陳昇瑋 / 從大數據走向人工智慧 180 樂齡養生抗老智慧流傳千年的本草養生智慧五臟養生飲食祕方入世養生長命養生．健康食療術

陳昇瑋 / 從大數據走向人工智慧 181 輕鬆說英語輕鬆搞定三語情境單字英文速成班：文法30天輕鬆搞定太神奇了MEGA日語輕鬆學 30堂漫畫成語課：外國人也能輕鬆開口說輕鬆投資股票期貨
輕鬆快樂開家咖啡店省小錢輕鬆存下100萬賣夢想，輕鬆提升100%業績為什麼出布容易贏？從球賽、股市到選擇題，在未知中輕鬆致勝的22個預測練習

陳昇瑋 / 從大數據走向人工智慧 182 練習坐，找到心佛陀陪你練習不生氣青春，飛翔練習：看世界的角度決定你的高度佛陀教你強心術：有效改變人生的練習不反應的練習：消除煩惱，清理內心的思考法韻文發音練習本
日語50音練習字貼朗文國中英語會考聽力練習科技英文閱讀練習完全攻略英檢初級文法及練習109：國中文法大全

陳昇瑋 / 從大數據走向人工智慧上市前的市場狀況書名關鍵字書籍與商品呈現特徵 183 書籍銷售表現預測模型預測模型建
模預測目前欠缺文本內容 700+ 個因子

陳昇瑋 / 從大數據走向人工智慧商品屬性及呈現方式 184 書名翻譯書籍折扣比可供預覽圖片數封面影像

陳昇瑋 / 從大數據走向人工智慧內容強調標記 185

陳昇瑋 / 從大數據走向人工智慧商品頁內容簡介推薦、感動類 … 等同義詞偵測 186 推薦詞感動詞
權威詞

187 詞種感動詞推薦詞權威詞

文學小說暢銷書解密蟑螂哲學  文學小說 > 愛情小說出版社: 城邦原創, 作者: 菌菌
定價: 240元暢銷指數: 98.55% 189 特徵中位數本書暢銷指數改變所屬出版社之文學小說出版比例 0.82 0.74 16.36% 第二類別同類書過去四週相對平均銷量 1.00 1.67 8.73% 作者介紹字數 132 213 6.91% 作者介紹數字比例 0 0.0093 6.18% 封面ROI平均色調 (Mean values of hue in ROI) 0.056 0.003 5.45% 所屬出版社之出版多樣性 0.80 0.83 4.73% 封面ROI平均飽和度 (Mean values of saturation in ROI) 0.0017 0.0001 3.64% 封面轉灰階淺灰色 ( ) 佔比 (Grey intensity histogram3) 0.25 0.08 3.27%

心理勵志暢銷書解密我想傾聽你：懂得傾聽，學會不過度涉入，讓我們用更自在的陪伴豐富彼此  心理勵志 > 兩性與家庭關係 > 家庭/親子關係 
親子教養 > 生活教養出版社: 遠流, 作者: 洪仲清定價: 300元暢銷指數: 94.33% 190 特徵中位數本書暢銷指數改變書名字數 13 30 7.18% 定價於過去四週第一層同類書百分等級 0.58 0.99 -5.67% 序言問號比例 0.0003 0.0024 5.33% 內容簡介字數 634 438 -4.67% 是否有OKAPI連結否是 4.67% 作者介紹字數 254 595 4.33% 每頁價格 1.13 0.89 -3.33% 定價於過去一週第一層同類書百分等級 0.52 0.99 -2.67% 封面飽和度景深 (Depth of field of saturation image) 0.34 0.53 2.67% X X

Data Science in Taiwan

陳昇瑋 / 從大數據走向人工智慧 http://twconf.data-sci.org/ https://www.facebook.com/twdsconf 2 days x 800 ppl
8/30 – 8/31, 2014

陳昇瑋 / 從大數據走向人工智慧 196 Data Scientists x 17

陳昇瑋 / 從大數據走向人工智慧 4 days x 1300 ppl 8/20 –
8/23, 2015

陳昇瑋 / 從大數據走向人工智慧 Data Scientists x 24

2015/ 11/14 (六)

機器學習初探 2015/ 12/12 (六)

4 days x 1718 ppl

資料科學的第一堂課：心法、案例分析與團隊建立從資料到知識：從零開始的資料探勘 R 語言資料工程及探勘實務一天搞懂深度學習手把手教你 R 語言資料分析實務

資料科學專家 x 50

產官學合作媒合交流人才媒合場次有志一同交流場次社群網路分析電商、零售及網路行銷資料視覺化資訊安全健康醫療教育大數據財務金融
未來城市的交通運輸人工智慧/機器學習/深度學習開放資料及個資保護 x 10

陳昇瑋 / 從大數據走向人工智慧 215 1,718 attendees

協辦單位中央研究院資訊科技創新研究中心中央研究院統計科學研究所中央研究院資訊服務處工業技術研究院巨量資訊科技中心財團法人資訊工業策進會
數據科技與應用研究所國際科學與技術資料委員會中華民國委員會 Sudo Recruit 中華機率統計學會 FINDIT 中華民國計算語言學學會台大智慧聯網創新研究中心國家高速網路與計算中心

陳昇瑋 / 從大數據走向人工智慧台灣資料科學年會系列活動

陳昇瑋 / 從大數據走向人工智慧資料分析可以拿來解決什麼問題？ 241

244 陳昇瑋台灣資料科學協會理事長中央研究院資訊科學研究所研究員計算社會科學初探－當電腦科學家遇上社會科學

246 PEOPLE

247 The Favorite Major for US College Athletes (Source: USA
Today, http://usatoday30.usatoday.com/sports/college/2008-11-18-majors-graphic_N.htm)

248 Social Science

249 Social Life is Hard to See We can interview
friends, but we cannot interview a friendship Fleeting interaction In private Tedious to record over time, especially in large groups

250 Bigger Problems Social phenomena involve many individuals interacting to
produce collective entities firms, markets, cultures, political parties, social movements, audiences “Micro-Macro” problem (aka “Emergence”) Micro-macro problems are hard to study empirically Difficult to collect observational data about individuals, networks, and populations at same time Even more difficult to do “macro” scale experiments

251 1890 US Census 1st time Hollerith machines were used
to tabulate US Census data (population: 62,947,714)

252 The Era of Big Data Past: Government data, national
survey data Today: A variety of new data sources Economic data: trade, finance, e-cash / e-wallet, ... GIS data: satellite, GPS loggers, laser scanning cars, … Sensor data: video surveillance, smart phones, wearable devices, mobile apps, beacons, …

253 New kinds of data

陳昇瑋 / 資料科學往前看－從大數據到人工智慧 Computer vision for healthcare Video magnification (Slide

陳昇瑋 / 資料科學往前看－從大數據到人工智慧 257 https://www.youtube.com/watch?v=QbXgEbeceJI (Credit: Jia-Bin Huang)

260 Engagement and Exploration Standing face-to-face? Physical distance Hand gesture,
posture Conversation patterns Frequency of interruptions

262 Web as a Record of Social Interaction Public web
pages / discussions Twitter, Facebook, blogs, news groups, wikis, MMOGs, Instagram, LastFM, Flickr, Spotify Private email, Whatsapp, LINE, Slack Text, images, sounds: speeches, commercials

265 Computational Social Science The science that investigates social phenomena
through the medium of computing and statistical data processing.

266 Instrument-enabled scientific disciplines microbiology microscope radio astronomy radar nanoscience
electron microscope

267 Computational Social Science

270 Technical Challenges Computational infrastructures for dealing with More data:
analyzing large amounts of data Fuzzy data: cleaning up inprecise and noisy data New kinds of data: processing real-time sensor streams and web data Need for new substantive ideas Need for new statistical methods (WHY in addition to WHAT and HOW)

271 3 Common Approaches Macroscope Virtual Lab Empirical Modeling

272 APPROACHES #1 MACROSCOPE #2 VIRTUAL LAB #3 EMPIRICAL MODELING

274 WE ARE WHAT WE SAY Linguistics Schwartz, H. Andrew,
et al. "Personality, gender, and age in the language of social media: The open-vocabulary approach." PloS one 8.9 (2013): e73791. Macroscope

275 Dataset 700 million words, phrases, and topic instances collected
from 75,000 volunteers’ FB posts Record users’ personality (5-factor), gender and age

276 What Words Do You Use? male female

277 How Old Are You? (#1) 13 - 18 19
- 22

278 How Old Are You? (#2) 23 - 29 30
- 65

279 Personality Traits Extraversion Introversion

281 Topics Across 4 Age-groups

282 Warm and Negative Words

283 Usage of “I” & “We” Huge-volume data + simple
analysis  crystal clear language use patterns

286 Scaling up the Lab Social science experimental heavily constrained
by scale and speed Unit of analysis was individuals or small groups Experiments took months to design and run Potentially “virtual labs” lift both constraints State of the art ~ 5000 workers, but in principle could construct subject panel ~ 100K – 1M Could shrink hypothesis-testing cycle to days or hours

287 MOOD CONTAGION (& MANIPULATION) ON FACEBOOK Social Psychology Kramer,
Adam DI, Jamie E. Guillory, and Jeffrey T. Hancock. "Experimental evidence of massive-scale emotional contagion through social networks.” Proceedings of the National Academy of Sciences111.24 (2014): 8788-8790. Virtual Lab

288 Facebook Mood Contagion 0.7 million (~ 0.04%) users on
Facebook 3 million posts manipulated in one week Hide some “positive” or “negative” emotional posts from users (in the experimental group)

289 Observations Negative posts hidden People who see more positive
posts, tend to post more positively, and vice versa. Facebook users’ emotion can be easily manipulated by changing ALGORITHMS Positive posts hidden

290 Ethical Issues (!) Unethical experiment because it’s conducted without
users’ consent Serious invasion of users’ perceptions about their friend circles (and the society) Well, Facebook's data use policy states that users' information will be used "for internal operations, including troubleshooting, data analysis, testing, research and service improvement," meaning that any user can become a lab rat.

291 FACEBOOK “I VOTED” BUTTON Social Psychology & Politics Bond,
Robert M., et al. "A 61-million-person experiment in social influence and political mobilization." Nature 489.7415 (2012): 295-298. Virtual Lab

292 “I Voted” Button Direct messages to 61 million users
on FB Informational: 1% users received Social: 98% users received Control group: 1% (no message received) Informational Social

294 Effect of Manipulation Ratio of friends voted Prob. of
oneself claimed voted

296 2% more likely to click “I voted” button and
0.3% more likely to seek information about a polling place, and 0.4% more likely to head to the polls.

297 Real-world Consequence (!) In total there were about 60,000
votes of turnout, and estimated 280,000 indirect turnout (out of 61 million users) What if Facebook did not randomize the control/experimental groups?

306 Empirical Modeling Traditional mathematical or computational modeling Tends to
rely on many, often unrealistic, assumptions Not generally tested in detail against data Result is proliferation of models that exist in parallel and are often incompatible with each other New sources/scales of data allow both to learn/test models and also calibrate them Observations  Models  Lab  Field  Observations

310 PREDICTION OF COUNTY-LEVEL HEART DISEASE MORTALITY

311 Datsets Heart disease Arteriosclerotic heart disease mortality rates during
2009 -- 2010 Predictors 826 million tweets collected between June 2009 and March 2010 Socioeconomic (income and education) Demographic (percentages of Black, Hispanic, married, and female residents) Health status (diabetes, obesity, smoking, and hypertension)

312 Prediction Accuracy

317 Language Use in Tweets

318 Social media opens up a new window of what
humans actually feel and think

319 YOU ARE WHAT YOU LIKE Social Psychology Empirical Modeling
Kosinski, Michal, David Stillwell, and Thore Graepel. "Private traits and attributes are predictable from digital records of human behavior." Proceedings of the National Academy of Sciences 110.15 (2013): 5802-5805.

321 Personality Prediction Personality traits Gender, age, relationship status, #
friends Sexual orientation, ethnicity, religion, political inclination Addictive substances (alcohol, drugs, cigarette), parental separation IQ, 5-Factor model, satisfaction with Life

322 Data Collection 9,939,220 Likes (55,814 unique ones) from 58,466
Facebook volunteers Sports Music Books Restaurants Popular websites

323 Ground truth Political Inclination Sexual Orientation Democrat Republican Democratic
GOP (Grand Old Party) Democratic Party Republican Party Homosexual Heterosexual 1 / 0 1 / 0

325 Ground truth 5-Factor Model Openness Conscientiousness Extraversion Agreeableness Stability

326 Ground truth Satisfaction with Life (SWL)

328 Methodology User-Like matrix dimension reduction: Singular Value Decomposition (SVD)
Prediction models: Logistic Regression & Linear Regression

330 Prediction Results Solid: Pearson corr. coef. between pred. &
actual values Transparent: baseline acc. of the questionnaire, in terms of test- retest reliability

332 Discriminative Likes (#1)

335 Computer vs. Humans Correlating participants’ score with judgments made
by humans and computer models. 335 meditation, TED

336 性別男女小茉莉-陳瑀希 Catworld小舖 Garena《英雄聯盟 LOL》 EYESCREAM Inc.
波多野結衣HatanoYui LOVFEE 解婕翎 OB嚴選豆花妹蔡黃汝 grace gift Nono_辜莞允 BEVY C. 張景嵐 Joyceshopstyle FHM Taiwan 男人幫國際中文版 QUEEN FASHION SHOP 潮物blog - 街頭潮流男著 SweeSa水莎張小筑Ya Chu Lulus

337 體態胖瘦一休陪你一起愛瘦身《HITO 本舖》 iFit 愛瘦身潮物部落格
小甜甜張可昀 PAZZO FB減肥達人輕鬆教你瘦 Image 美樂蒂 Melody 《 OneBoy 》 BEMAX UNO STORE Woma RockSteady OB嚴選 SweeSa水莎鍾欣凌高高-流行服飾Store 杜詩梅Tu Shih Mei Maxy

338 身高 (女性) 高矮瑞秋空姐教室王子邱勝翊航空資訊站最新考訊及航空動態
東京著衣 Janet Hsieh 謝怡芬衣芙日系空姐瘋林彥君 151 H&M 窈窕比例學院凱渥 CatWalk Chu me 日系精品服飾空姐報報Emily Post 終極x宿舍 Choies.com 陳子玄 ZARA 三立藝能中心 Rima 瑞瑪席丹唯舞獨尊(臉書版)_首款社群平台音樂遊戲

340 Income (age ≥ 40) 高低商業周刊李亮瑾 Andy老爹
連靜雯joanne lien 背包客棧綜藝大集合 citiesocial 旗山天后宮 relux 連靜雯專屬後援會台灣賓士授權經銷商-中華賓士三條崙海清宮閻羅天子包公祖廟李開復 Kai-Fu Lee 楊丞琳 RainieYang Mobile01 郭靜 Claire Mercedes-Benz Taiwan 台灣賓士九族文化村天下雜誌寶島神很大

341 Personality – 顧家有選沒選巨蟹座 06/22~07/22 Duncan 方文琳
Cherng 東森氣象主播王淑麗 Byebyechuchu 新聞主播陳海茵 H.H先生雨揚樂活家族谷阿莫 AmoGood 北港朝天宮 R-chord ☆巨蟹座★ Undine 魏華萱 Dorothy 巨蟹座男生 Joyceshopstyle 連靜雯專屬後援會 Lu's

342 Social 喜歡社交喜歡獨處柯震東 Kai Ko 音速語言學習(日語) 黑人陳建州
PanSci 科學新聞網 Futun World 卡卡洛普★宅宅新聞玖壹壹王可樂的日語教室羅志祥 SHOW 國家地理雜誌 Look Happy 博客來 Mimi Dancing Club 哈日劇敖小犬小敖 Lailai & Chichi 頑童MJ116 辛卡米克 Gon Word nagee Can we have real privacy on social media? Unprecedented opportunity to observe individuals in a society

348 資料分析如何幫我們更瞭解捐款人？

349 x 3,518 in 10.5 years (since May 2003)

351 AppleDaily Charity Case Dataset 3000+ cases along with detailed
description and donation records

352 20 50 80 捐款金額分布 (每戶個案家庭)

354 DATA COLLECTION

355 Crawling http://search.appledaily.com.tw/charity/projlist/

356 Web page parsing

361 # donors

362 # donors w/ linear fitting

363 Adjusted Time Series

364 ANNOTATION

365 人工編碼平台 Online

367 http://bountyworkers.net/

368 人工編碼成果 431 編碼者 6532 人次 255 小時 8436 家庭成員
1590 個案

369 Sample Annotations

370 Variables we got (290+)

372 Methodology Predict # donors and donation amount Feature selection
based on mutation information Using libsvm to do 2-class classification Classifying top 25% and bottom 25% cases by removing the middle 50% cases 10-fold cross validation Find out significant factors that determine the dependent variable(s)

374 Factor Categories Subject Structure Finance Member Presentation Meta

375 Factor – Members Category Subject & Member Age, gender,
marital status Disability, disease, accident, habit, status

377 Factor – Structure Category Structure Count and ratio of
particular types of family members Relationships between members

378 Factor – Finance Category Finance Is the family below
the poverty line? Regular income & expense

379 Factor – Presentation Presentation Currently, only title and images
are evaluated Subjective ratings from human subjects

381 Title & picture rating http://mmnet.iis.sinica.edu.tw/~cslin/rating/welcome.php

382 Factor – Meta Information Meta information Information unrelated to
the family & its situation E.g., article writer and when was the article published

384 捐款意願與時間點高度相關

385 星期幾很重要日一二三四五

386 哪個月份也重要一二三四五六七八
九十十一十二

387 受訪者的胖瘦會影響捐款決策

390 誰收到較多捐款？

392 捐款人對各式疾病及身心障礙有差別待遇

397 不可抗力因素較讓人同情

398 意外失業離婚入獄人為意外輟學

402 捐款與固定支出成反比個案家庭固定支出捐款金額

403 捐款者期待能看見「希望」

404 CASE STUDY

405 Successful Case

406 Less Successful Case

410 TEXT MINING APPROACH

411 C-LIWC簡介從James Pennebaker的LIWC (Linguistic Inquiry and Word Count) 發展而來
由台科大與台大心理團隊，依照中文特性增刪類別與語詞，編製而成總計88個類別，6862個詞與詞幹語言特性與寫作風格多少能反應個人特質、影響讀者的感受此文本分析方法，逐漸被廣泛使用在心理學相關研究主題。如：道歉與原諒、測謊、治療過程的語言變化、心理位移等 C-LIWC官網：http://cliwc.weebly.com/

412 中文版語文探索與字詞計算字典 (C-LIWC)

413 家庭詞、死亡詞、健康詞相關：家庭詞、死亡詞、健康詞大致和捐款皆成正相關推論：當事件主題符合傳統價值時較易引起捐款 (r, p-value) 家庭詞死亡詞健康詞 log(捐款總額)
(r=0.148, p=0.000) (r=0.101, p=0.000) (r=0.056, p=0.026) 捐款人數 (r=0.131, p=0.000) (r=0.113, p=0.000) (r=0.058, p=0.021) 每人平均捐款額 (r=0.129, p=0.000) (r=0.084, p=0.001) (r=0.007, p=0.771) 範例母親、婆婆、阿公、家屬、堂妹、繼父、雙親火化、死者、自殺、告別式、往生、致死中風、糖尿病、結石、住院、安眠藥

414 文章總詞數相關：文章總詞數和捐款成正相關推論：將事件敘述越詳盡，越容易募到款 (r, p-value) 總詞數 (word count) log(捐款總額)
(r=0.101, p=0.000) 捐款人數 (r=0.056, p=0.027) 每人平均捐款額 (r=0.143, p=0.000)

415 工作詞、成就詞、金錢詞相關：工作詞、成就詞、金錢詞大致和捐款皆成負相關推論：和工作相關的主題，相較不易募得款項 (r, p-value) 工作詞成就詞金錢詞 log(捐款總額)
(r=-0.079, p=0.002) (r=-0.064, p=0.011) (r=-0.072, p=0.004) 捐款人數 (r=-0.099, p=0.000) (r=-0.085, p=0.000) (r=-0.025, p=0.319) 每人平均捐款額 (r=-0.022, p=0.380) (r=-0.020, p=0.001) (r=-0.101, p=0.000) 範例勞工、契約、付費、裁員、生意、員工、職業升遷、職權、權威、嘉獎、能幹、高層、榮耀帳戶、租金、商店、現金、消費、捐贈

416 其它否定詞範例：不滿、不幸、不能、無關、不料、不須相關：和平均每人捐款額呈負相關(r=-0.063, p=0.013) 推論：正面描述較佳副詞範例：真的、終於、確實、一定、一向、不管、全然相關：和平均每人捐款額呈負相關(r=-0.084,
p=0.001) 推論：平實地描述即可，過度誇大或多加贅述易有反效果

418 ONGOING WORK

資訊充足，才能聰明地捐款。台灣資料科學協會

聰明公益資訊平台希望能解決資訊破碎及不透明的問題 423 http://www.smartdonor.tw/

聰明公益資訊平台台灣有超過兩千個社會公益團體，你認識幾個呢？ 424

聰明公益資訊平台 425

聰明公益資訊平台搜尋及過濾條件 426

聰明公益資訊平台地圖檢視 427

聰明公益資訊平台分析功能 428

NPO 資訊總覽 (1/4) 429

聰明公益資訊平台 NPO 資訊總覽 (2/4) 430

NPO 資訊總覽 (3/4) 431

聰明公益資訊平台 NPO 資訊總覽 (4/4) 432

聰明公益資訊平台群眾參與 (1/2) 維基百科模式：只要以 Facebook or Google 帳號登入後，任何人都可以編輯任何 NPO
的任何資訊。 433

但不用擔心，所有編輯記錄都會被留下，因此若有人搗亂或惡意填寫不實資訊，都可以檢舉。再由管理者回覆到正確的版本。群眾參與 (2/2) 434

聰明公益資訊平台 NPO 資訊編輯 (1/2) 435

聰明公益資訊平台 NPO 資訊編輯 (2/2) 436

聰明公益資訊平台資訊透明度的量化 437

聰明公益資訊平台資訊透明度權重與計算方式說明資訊透明度的計算的重點在於估計每項資訊的權重，我們採用常見的 IDF (Inverse Document Frequency) 的原則，也就是說，越常見的資訊，權重越低；反之，越少見的資訊，權重越高。
越多 NPO 填寫的欄位，表示越容易取得／提供，因此權重低；反之，越少 NPO 提供的欄位，表示取得成本較高，通常也表示更有價值，因此權重高。舉例來說成立日期有 100% NPO 提供，權重為 1.0 登記財產總額有 64% NPO 提供，權重為 4.19 公開徵信查詢只有 5% NPO 提供，權重為 14.91 438 http://www.smartdonor.tw/transparency.php

聰明公益資訊平台假設共有 N 家 NPO，某個欄位 f 有 n(f) 家 NPO
填寫，那麼欄位 f 的基本權重就是 sqrt(N/n(f))，基本權重再經過正規化讓所有欄位的權重加起來為 100，就是最後的權重值。舉例來說，目前本平台共有 2404 家 NPO，共有 121 家 NPO 提供「公開徵信查詢」連結，那麼「公開徵信查詢」欄位的基本權重為 sqrt( 2404 / 121)，經過正規化後，此欄位的權重為 14.91。 sqrt (平方根) 的作用是讓欄位之間的權重差異小一點，不要被少數的重要欄位決定分數。權重不是固定的值，隨著 NPO 在平台上填寫更多資料，權重會隨時調整。假設有一天所有的 NPO 都提供公開徵信查詢，那「公開徵信查詢」的欄位權重就會變成 1.0。 439

聰明公益資訊平台 441 http://smartdonor.tw/npo.php?npo=1034

聰明公益資訊平台我們的願景從捐款人的角度所有的 NPO 資訊一目瞭然可以搜尋、排序、比較、分析不用到每個 NPO 網站慢慢翻找資料，所有資料一頁呈現
成為聰明的捐款人從公益團體的角度讓潛在捐款人看到自己的努力讓大型 NPO 可以量化方式呈現成果讓小型 NPO 更有機會被看見。對小型 NPO 來說，即使人力有限，可讓社會善心人士幫忙維護 NPO 公開資訊。 442

聰明公益資訊平台致謝 443

交流時間

446 Large-scale Facebook Fan Page Network Analysis

452 CONCLUSION & OUTLOOK

453 WE ARE STILL AT THE VERY START

454 LOTS of Big Questions The polarization of global economic
inequality What explains the success of social movements? The emergence of pro-sociality behavior The causality of video gaming and propensity of violence? The politics of censorship The causality of social selection and social influence? …

455 The Data Divide Social scientists have good questions but…
IT tools are not part of their toolkits Not clear that we will/should make the investment Computer scientists have powerful methods but… Trained to resolve technical problems It seems there are less “methodological” contributions

456 The Challenges Education and habits of social and computer
scientists Different ways of thinking Different methodologies Differences in framing questions and defining contributions Data access and fragmentation issue Data privacy issue Ethics issue Organizational issue

458 Institutional Innovations New platforms and protocols for data management
Better coordination of data collection, storage, sharing Recruitment and management of subject pools, field panels Integrated research designs Coordination across theoretical, experimental and observational studies Collaborative interdisciplinary teams For a given data set, often unclear what the most interesting question is For a given question, often unclear how to collect the right data

交流時間

當學術研究者遇見線上遊戲陳昇瑋中央研究院資訊科學研究所

陳昇瑋 / 當學術研究者遇見線上遊戲 462 US$ 42 billion US$ 35 billion
US$ 63 billion Video games Movie Music US$ 27 billion Book http://vgsales.wikia.com/wiki/Video_game_industry Entertainment Market Size (worldwide) No. 1 No. 2 No. 3 No. 4

陳昇瑋 / 當學術研究者遇見線上遊戲 463

陳昇瑋 / 當學術研究者遇見線上遊戲 467 Game Research: My Own Reasons As
A PC Gamer … As A Programmer … As A Researcher …

As A PC Gamer (1) 1988 1989 1990 1991

As A PC Gamer (2) 1990 1992 1993 1998

As A Programmer (1) 10 歲寫 football game with ROM
BASIC 國中寫對打遊戲 with dBASE & Pascal 高中寫 RPG with C & Assembly Richard Garriott 1980

My Role Model in 1990

As A Programmer (2) 1999 – 2002 資策會教育訓練課程 (C/C++, Winsock
Programming, Delphi, C++Builder) 夾帶遊戲設計課程 1999 – 2001《遊戲設計大師》專欄作家 2000 出版《Delphi 深度歷險》 2002 出版《C++Builder 深度歷險》

陳昇瑋 / 當學術研究者遇見線上遊戲 476 As A Researcher A killer application
35% Internet users & larger business than movie & music An emerging field E.g., IEEE Transactions on AI and CI in Games since Sep 2008 Asia-based researchers have some niches Large user base (50%) Lots of local game companies It’s fun!

陳昇瑋 / 當學術研究者遇見線上遊戲 479 Security Topics Game Bot Detection

陳昇瑋 / 當學術研究者遇見線上遊戲 480 Game Bots Game bots: automated AI
programs that can perform certain tasks in place of gamers Popular in MMORPG and FPS games MMORPGs (Role Playing Games) accumulate rewards in 24 hours a day  break the balance of power and economies in game FPS games (First-Person Shooting Games) a) improve aiming accuracy only b) fully automated  achieve high ranking without proficient skills and efforts

陳昇瑋 / 當學術研究者遇見線上遊戲 481 Bot Detection Detecting whether a character
is controlled by a bot is difficult since a bot obeys the game rules perfectly No general detection methods are available today State of practice is identifying via human intelligence Detect by “bots may show regular patterns or peculiar behavior” Confirm by “bots cannot talk like humans” Labor-intensive and may annoy innocent players

陳昇瑋 / 當學術研究者遇見線上遊戲 482 CAPTCHA in a Japanese Online Game
(Completely Automated Public Turing test to tell Computers and Humans Apart)

陳昇瑋 / 當學術研究者遇見線上遊戲 483 Our Goal of Bot Detection Solutions
Passive detection  No intrusion in players’ gaming experience No client software support is required Generalizable schemes (for other games and other game genres)

陳昇瑋 / 當學術研究者遇見線上遊戲 484 Our Solution I: Traffic Analysis Game
client Game server Traffic stream Q: Whether a bot is controlling a game client given the traffic stream it generates? A: Yes or No

陳昇瑋 / 當學術研究者遇見線上遊戲 485 Case Study: Ragnarok Online (Figure courtesy
of www.Ragnarok.co.kr)

陳昇瑋 / 當學術研究者遇見線上遊戲 486 DreamRO -- A screen shot World
Map View scope Character Status

陳昇瑋 / 當學術研究者遇見線上遊戲 487 Trace Collection Category Tr# ID Avg.
Period Avg. Pkt rate Network Human players 8 A, B, C, D 2.6 hr 1.0 / 3.2 pkt/s ADSL, Cable Modem, Campus Network Bots 11 K (Kore) R (DreamRO) 17 hr 1.0 / 2.2 pkt/s 207 hours, 3.8 million packets were traced in total Heterogeneity in player skills and network conditions Category participants Client pkt rate Avg. RTT Avg. Loss rate Human players 2 rookies 2 experts 0.8 ~ 1.2 pkt/s 45 ~ 192 ms 0.01% ~ 1.73% Bots 2 bots 0.5 ~ 1.7 pkt/s 33 ~ 97 ms 0.004% ~ 0.2%

陳昇瑋 / 當學術研究者遇見線上遊戲 488 Command Timing Client response time (response
time): time difference between the client packet departure time and the most recent server packet arrival time We expect the following patterns: A large number of small response times (bots respond server packets immediately) Regularity in response times Observation bots often issue their commands based on arrivals of server packets, which carry the latest status of the character and environment State Update Command After certain time t

陳昇瑋 / 當學術研究者遇見線上遊戲 489 CDF of Client Response Times Kore:
Zigzag pattern (multiples of a certain value) DreamRO: > 50% response times are very small

陳昇瑋 / 當學術研究者遇見線上遊戲 490 Histograms of Response Times 1 ms
multiple peaks 1 ms multiple peaks

491 Periodograms of Histograms of Response times Player 1 Player
2

陳昇瑋 / 當學術研究者遇見線上遊戲 492 Examining the Trend of Traffic Burstiness

陳昇瑋 / 當學術研究者遇見線上遊戲 496 An Integrated Classifier Conservative approach (10000
packets): false positive rate ≈ 0% and 90% correct rate Progressive approach (2000 packets): false negative rate < 1% and 95% correct rate

陳昇瑋 / 當學術研究者遇見線上遊戲 497 Robustness against Counter Attacks Adding random
delays to the release time of client commands Command timing scheme will be ineffective Schemes based on traffic burstiness and human reaction to network conditions are robust  Adding random delay to command timing will not eliminate the regularity unless the added delay is longer than the updating interval by orders of magnitude or heavy-tailed  However, adding such long delays will make the bots incompetent as this will slowdown the character’s speed by orders of magnitude

陳昇瑋 / 當學術研究者遇見線上遊戲 498 The IDC of the original packet
arrival process and that of intentionally-delayed versions

陳昇瑋 / 當學術研究者遇見線上遊戲 499 Our Solution II: Movement Trajectory Based
on the avatar’s movement trajectory in game Applicable for all genres of games where players control the avatar’s movement directly Avatar’s trajectory is high-dimensional (both in time and spatial domain)

陳昇瑋 / 當學術研究者遇見線上遊戲 500 The Rationale behind Our Scheme The
trajectory of the avatar controlled by a human player is hard to simulate for two reasons: Complex context information: Players control the movement of avatars based on their knowledge, experience, intuition, and a great deal of environmental information in game. Human behavior is not always logical and optimal How to model and simulate realistic movements (for game agents) is still an open question in the AI field.

陳昇瑋 / 當學術研究者遇見線上遊戲 501 Bot Detection: A Decision Problem Q:
Whether a bot is controlling a game client given the movement trajectory of the avatar? A: Yes / No?

陳昇瑋 / 當學術研究者遇見線上遊戲 502 User Movement Trails

陳昇瑋 / 當學術研究者遇見線上遊戲 503 3D Path Visualization Tool

陳昇瑋 / 當學術研究者遇見線上遊戲 504 Case Study: Quake 2

陳昇瑋 / 當學術研究者遇見線上遊戲 505 Data Collection Human traces downloaded from
fan sites including GotFrag Quake, Planet Quake, Demo Squad, and Revilla Quake Site Bot traces collected on our own Quake server CR BOT 1.14 Eraser Bot 1.01 ICE Bot 1.0 Totally 143.8 hours of traces were collected

陳昇瑋 / 當學術研究者遇見線上遊戲 506 Data Representation (X, Y) (X, Y)
t (X, Y) (X, Y)

陳昇瑋 / 當學術研究者遇見線上遊戲 507 Aggregate View of Trails (Human &
3 Bots) Human CR Bot Eraser ICE Bot

陳昇瑋 / 當學術研究者遇見線上遊戲 508 Trails of Human Players

陳昇瑋 / 當學術研究者遇見線上遊戲 509 Trails of Eraser Bot

陳昇瑋 / 當學術研究者遇見線上遊戲 510 Trails of ICE Bot

陳昇瑋 / 當學術研究者遇見線上遊戲 511 Movement Trail Analysis Activity mean/sd of
ON/OFF periods Pace speed/offset in each time period teleportation frequency Path linger frequency/length smoothness detourness Turn frequency of mild turn, U-turn, …

陳昇瑋 / 當學術研究者遇見線上遊戲 512 Bot Detection Performance

陳昇瑋 / 當學術研究者遇見線上遊戲 513 Step 1. Pace Vector Construction For
each trace sn , we compute the pace (distance) in successive two seconds by We then compute the distribution (histogram) of paces with a fixed bin size by where B is the number of bins in the distribution.

陳昇瑋 / 當學術研究者遇見線上遊戲 514 Pace Vector: An Example B is
set to 200 (dimensions) in this work

陳昇瑋 / 當學術研究者遇見線上遊戲 515 Step 2. Dimension Reduction with Isomap
We adopt Isomap for nonlinear dimension reduction for Better classifiaction accuracy Lower computation overhead in classification Isomap Assume data points lie on a manifold 1. Construct the neighborhood graph by kNN (k-nearest neighbor) 2. Compute the shortest geodesic path for each pair of points 3. Reconstruct data by MDS (multidimensional scaling) A mathematical space in which every point has a neighborhood which resembles Euclidean space, but in which the global structure may be more complicated. (Wikipedia)

陳昇瑋 / 當學術研究者遇見線上遊戲 516 A Graphic Representation of Isomap

陳昇瑋 / 當學術研究者遇見線上遊戲 518 PCA (Linear) vs. Isomap (Nonlinear)

陳昇瑋 / 當學術研究者遇見線上遊戲 519 Five Methods for Comparison Method Data
Input kNN Original 200-dimension Pace Vectors Linear SVM Nonlinear SVM Isomap + kNN Isomap-reduced Pace Vectors Isomap + Nonlinear SVM

陳昇瑋 / 當學術研究者遇見線上遊戲 520 Evaluation Results Error Rate False Positive
Rate False Negative Rate

陳昇瑋 / 當學術研究者遇見線上遊戲 522 Evaluation Results Error Rate False Postive
Rate False Negative Rate

陳昇瑋 / 當學術研究者遇見線上遊戲 523 User Behavior Topics Game-Play Time Prediction

陳昇瑋 / 當學術研究者遇見線上遊戲 524 Unsubscription Prediction Game improvement Players’ unsubscription
 low satisfaction Surveys can be conducted to determine the causes of player dissatisfaction and improve the game accordingly More likely to receive useful comments before players quit Prevent VIP players’ quitting (maintain revenue) For “item mall” model, users’ contribution (of revenue) is heavy-tailed Losing VIP players may significantly harm the revenue Network/system planning and diagnosis By predicting “which” players tend to leave the game  investigating is there any problem regarding network resource planning, network congestion, or server arrangement

陳昇瑋 / 當學術研究者遇見線上遊戲 525 Unsubscription Prediction: Our Proposal Rationale: players’
satisfaction / enthusiasm / addiction to a game is embedded in her game play history Quit in 30 days? Quit Stay Login history Jan Feb Mar Apr May Jun July Aug Sep Oct Nov Dec 2007 Subscription time

陳昇瑋 / 當學術研究者遇見線上遊戲 527 World of Warcraft The most popular
MMOG for now

陳昇瑋 / 當學術研究者遇見線上遊戲 528 Data Collection Methodology Create a game
character Use the command ‘\who’ The command asks the game server to reply with a list of players who are currently online Write a specialized data-collection program (using C#, VBScript, and Lua)

陳昇瑋 / 當學術研究者遇見線上遊戲 530 Trace Summary

陳昇瑋 / 當學術研究者遇見線上遊戲 531 福克斯大神之謎？？ (1) ref. http://forum.gamebase.com.tw/content.jsp?no=4715&cno=47150002&sno=75201947 ref. http://www.wings-of-narnia.com/viewtopic.php?t=3012
網友A：不知道在聖光之願部落的玩家有沒有發現到，在新手村薩滿訓練師的後面，永遠都會站著一個叫「福克斯大神」的獵人玩家！在半年前我到聖光定居時我在新手村見到他，到現在他仍然還是留守在那個地方……不會暫離, 而且可以觀察他= =" 這種事該回報給GM嗎？創新手看到他的時候都覺得好恐佈啊囧網友B：me too 看到的一瞬間突然起雞皮疙瘩..... 網友C："已離去"玩家的怨念(怨魂@@)嗎? 還是在悲傷愛情故事裡,癡等所愛的另一人? ^^^^^^^^QQ 網友D：哈線在好多人在看噢旁邊為了一大群人@@ 觀光景點呀XD

陳昇瑋 / 當學術研究者遇見線上遊戲 532 福克斯大神之謎？？ (2) 網友E：我剛剛也有去看了一下開了一個ID叫做“聽說有鬼”的獸人戰士坐在他面前的桶子一直望著他~
忽然! <暫離>福克斯大神他蹲下了...隔一分鐘..消失=ˇ=" .. .. 現在我心裡也是毛毛的.. 網友F：好猛鬼啊!!!!!!大神的力量好可怕啊,一堆信眾死在他之前！！！！！！網友G：我上次有開過去看，還遇到了兩位同好，看的時候真的蠻不可思議的... 可以列入魔獸10大世界奇觀吧!

陳昇瑋 / 當學術研究者遇見線上遊戲 533 福克斯大神與祂的信眾們 -_-

陳昇瑋 / 當學術研究者遇見線上遊戲 536 Questionnaire 37% 19% 16% 12% 4%
4% 3%2%2% 1% WoW 天堂 RO 楓之谷石器時代 LUNA 神州其他洛汗萬王之王 # samples: 1,747

陳昇瑋 / 當學術研究者遇見線上遊戲 537 Reasons for User Unsubscription

陳昇瑋 / 當學術研究者遇見線上遊戲 538 Trend of Game Playing Time 37%
28% 20% 9% 6% 沒有特定趨勢，依當時情況而定越玩越短，登入的天數也越來越少沒有明顯變化到後期反而玩得比較多隨著月份不同而周期性變化

陳昇瑋 / 當學術研究者遇見線上遊戲 539 Logisitic Regression Model for Unsubscription Prediction
Significant features (out of > 20 features) Avg. session time Daily session count Variation of the login hour (when the player starts playing a game each day) Variation of daily play time (number of hours) A naive logistic regression model achieves approximately 75% prediction accuracy

陳昇瑋 / 當學術研究者遇見線上遊戲 540 Unsubscription Prediction Result

交流時間

Forecasting Online Game Addictiveness NetGames 2012 Jing-Kae Lou National Taiwan
University Kuan-Ta Chen Academia Sinica Hwai-Jung Hsu Academia Sinica Chin-Laung Lei National Taiwan University

World of Warcraft by Blizzard

World of Warcraft by Blizzard 4.5 years and $63M USD
for development before release on 2004* *http://digitalbattle.com/2006/06/15/world-of-warcraft-cost-63-million/ **http://online.wsj.com/article/SB10001424052748703467304575383443343071562.html?mod=googlenews_wsj > $37M USD for upkeep and expansions during 2004 to 2010**

Grand Theft Auto V (by Rockstar Games)

Grand Theft Auto V (by Rockstar Games) $137M USD for
development and 100M for marketing Hit $1 billion in 3 Days

Witcher 3 (by CDPR)

Witcher 3 (by CDPR) $81M USD for development and marketing
3.5 years with 240 staff Net Profit $62.5M in 6 weeks

Online Game Industry is Competitive $1M to $200M USD dev
cost per game* > 200 game titles each year** *http://www.gamesetwatch.com/2007/04/mmo_production_costs_how_low_c.php *http://www.gamespot.com/news/star-wars-the-old-republic-cost-200-million-to-develop-6348959 **http://www.gamespot.com/

The Terrifying Truth Most of them survived only 4--9 months.
http://www.slideshare.net/TomSente/casualconnect2012-honeytracks-game-lifecycle-kpis Usually long before a game’s investment could ever be paid off…

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 551 The
Question Is a game’s lifetime predictable?

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 552 In
other words … Is a game’s addictiveness predictable? addictiveness [noun]: the ability to retain players active in the game for a long time.

Significance STOP developing hopeless games SUGGEST better design decisions during development CHOOSE better games to publish (for game publishers)

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 554 State-of-the-Practice
Intuition of game designers Feedbacks from focus groups Psychologically inspired methods E.g., the think aloud method

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 555 Our
rationale - Being entertained - Having various emotions arisen, e.g., joy, excitement, tension Why a player addicts to an online game?

Approach Published games Emotion measuremnts Market performance Prediction Model Predicted market performance for unpublished game X Emotion measurements Unpublished game X

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 558 GROUNDTRUTH
DATASET DESCRIPTION

Collaborator Gamania, a top game company in Taiwan Gamania released player session information (every player’s login and logout events) of 11 games to us

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 560 Overview
of Games 4 ACT 2 FPS 5 RPG

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 561 Account
Activity Records (AAR) AAR Format Dataset Overview

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 562 QUANTIFYING
GAME ADDICTIVENESS

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 563 Attempt
#1: Subscription period Subscription period The time span (in days) of a player’s first and last game sessions. Issues The actual time players spent in game is not considered. INTUITION A game is more addictive if its gamers tend to play it as much as they can.

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 564 Attempt
#2: Ratio of Presence Ratio of presence (RoP) The total number of days that the gamer entering the game at least once during the subscription period. E.g., Entering the game on 20 days with 100 subscription period  RoP = 20/100 = 0.2 Issues Bias toward games with short subscription periods E.g., average 4 online days over 5 subscribed days = RoP 0.8

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 565 Subscription
period and RoP

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 566 RoP(OP)
RoP with a certain observation period RoP curve The curve formed by RoPs over a range of OP RoP Generalization

RoP curve of FPS2 RoP curves follow a power-law relationship with OP.

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 568

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 569 Defining
Addictiveness Index β The decline rate of RoP over time genre-independent

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 571 MEASURING
PLAYER EMOTION

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 572 Corugattor
supercilli muscle groups Frowning Negative Emotion

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 573 Zygomaticus
major muscle groups Smiling Positive Emotion

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 574 Facial
EMG approach 1. Continuous emotion measures (can be at a rate of 1000 Hz or even higher) 2. Does not disturb game play 3. Objective since the emotional indicators are directly measured rather than told by subjects (EMG: Electromyography)

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 575 Facical
EMG Measurement Setup Corrugator Supercilii muscle Negative emotions Zygomaticus Major muscle Positive emotions

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 576 Measurement
Devices PowerLab 16/30 Electrodes Wires

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 577 Measuring
Facial EMG during game play

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 578 Experiment
Design 84 subjects are asked to play the 11 games A subject must be new to the games he played Each game session lasts >= 45 minutes continuously

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 579 Quantifying
the Measurement EMG samples are taken at 1,000 Hz, so a 45-minute trace comprises 45 × 60 × 1, 000 = 2, 700, 000 samples The average absolute differences between adjacent samples is taken as the representative index Given a time series of electrical potential samples P = {p1 , p2 , …, pn } CS: corugattor supercilii muscles  negative emotion ZM: zygomaticus major muscles  positive emotion

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 581 FORECASTING
GAME ADDICTIVENESS

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 582 Emotion
vs. Addictiveness

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 583 Modeling
Game Addictiveness ES: the emotional strength ES = CS + ZM The combined emotional strength arisen β = ω0 + ω1 ∙CS + ω2 ∙ZM + ω3 ∙CS:ZM + ω4 ∙CS:ES +ω5 ∙ZM:ES Adj. R2 = 0.94

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 584 Leave-One-Out
Validation Pearson cor: 0.86 Kendal cor: 0.78 Avg. error rate: 11%

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 585 Applications
of the model Early evaluation of game design Market value assessment before publishing 1. Optimize the odds of successful investments 2. Target more accurately the provision of better entertaining experience.

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 586 Ongoing
Work & Future Plan More sophisticated modelings and more validations Game addictiveness may change over a game’s lifetime Develop models that can explain WHY a game’s lifetime is longer than another? Due to particular game designs? Due to commercial promotions or others?

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 587 It
Is Just The Beginning We are now digging into psychophysiology Exploring various possibility to read one’s emotion Brain activity Eye movement Heart activity Respiration Sweat secretion And so on Also the mechanisms related to fun and addiction Reward process …

Are All Games Equally Cloud-Gaming-Friendly? / Kuan-Ta Chen 588 Sweat
Secretion • Apocrine – Hormonal change – Active for stress and sexual excitement • Eccrine – Themoregulation – Excretion – Protection – Reflection of emotion change • Palms and soles 588

交流時間

未知號碼來電怎麼辦? 陳昇瑋中央研究院資訊科學研究所

An everyday annoyance… 591

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 Available Solutions 592

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 Technologies Adopted Yellow pages HiPage, YP.com Yelp,
Google Places 104.com.tw, 好評網 Users’ address books Google search (!) 593

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 Search Phone Numbers on Google 02-2311-3731 02-27883799
0933-555770 0277064034 0987772305 … 595

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 Available Solutions in Taiwan 596

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 Real-time caller ID identification based on Google
search and user reports / tags 597

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 THUS I BECAME A USER OF TWO
YEARS AGO… 598

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 BUT, … GOOGLE SEARCH AND USER TAGGING
ARE NOT ENOUGH 599

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 Frequently, I still see the following screen:
600 e.g., 0910889139

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 THUS, I WROTE AN EMAIL TO WHOSCALL
CUSTOMER SERVICE 602

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 603 Whoscall 很實用，但是若 Google search 找不到也沒有人回報的未知號碼你們就沒戲唱了對吧
:P 沒錯 XD 那我來幫忙做這個功能好了… :D 好啊 :)

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 A JOINT RESEARCH PROJECT 604

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 The research problem For a unknown phone
number No google results (or no useful information) No user tags / reports Not a Whoscall user Can we determine if it’s a malicious number? 推銷電話? 詐騙電話? 色情電話? 打錯電話? 605

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 Rationale We believe it’s possible to identify
a malicious number because of … Whoscall userbase ( = potential sensors) 4 million installations 1 million active users (daily) 10 million phone calls (daily) So, when a phone number reaches a Whoscall user, we could possibly determine whether the number is malicious or not based on its previous call behavior. 606

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 The Scenario 607 ?

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 Our Steps Recruit a group of voluntary
Whoscall users as our sensors Collect phone call logs from these sensors for a month Compare these phone call logs with user reports (封鎖記錄) Use machine learning techniques to build a predictor for unknown phone numbers 608

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 Privacy Concerns User privacy is kept the
highest priority Phone numbers are stored as MD5 hash codes (therefore unable to be reversed) 609

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 User reports ㄧ接就掛斷一打來就掛掉一接對方馬上掛斷一接就掛一接就掛掉
一接就掛斷一接就掛斷的吵人電話一接就掛電話一接聽就掛掉一接起來就掛斷電話一接起來，就說打錯一直傳廣告簡訊一直打錯一直打錯電話一直收到沒顯示的APP 一直狂打錯電話一聲一聲不響，就掛掉，有問題一聲就掛一聲掛斷一聽收線一響即掛一響就掛 610 嚴重騷擾國外莫名來電國際電話偽裝台北區碼??? 地下期貨公司地下錢莊地下錢莊推銷地下非法期公司地下非法期貨公司地產垃圾垃圾件垃圾廣告垃圾簡訊垃圾訊息垃圾電話城市理財基隆美髮填問卷壽險外勞外崎砂斗美多次接聽冇人回應，數秒後夜半打給不認識的在那亂色情交友色情交友電話色情人肉市場色情仲介色情傳播色情垃圾簡訊色情外送色情妹妹電話色情媒介色情宣傳色情干擾色情廣告色情廣告擾人色情廣告簡訊色情拉客妹色情按摩色情推銷色情推銷廣告色情推銷簡訊色情推銷電話色情援交外送色情敗類色情服務色情業廣告摩門撥了馬上掛掉擾亂電話擾人電話收數收視率調查放款簡訊放款電話政府宣導政府立案單身敲一聲而已整人電話新光保全日制日産フィナンシャル日豐車行Sales 星展星展借貸星展推消星展銀行星展銀行推廣星展銀行貸款淫媒仲介

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 Data Summary 611 推銷電話詐騙電話

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 Call Pattern Observation 612 Normal Spam

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 Two Modes of Spam Calls 613 Type
1 Type 2

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 A Side-by-Side Comparison 614

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 615 (calls / day)

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 616 (calls / day) (calls / day)

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 617

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 620 (seconds)

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 621 (minutes)

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 622 (minutes) (minutes)

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 Dimension Reduction 46 dimensions => 2 dimensions
using classical MDS (multi-dimensional scaling) 627

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 Feature selection Using 2-norm SVM (support vector
machine) 629

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 Feature Selection (cont) 630

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿收工？ 631 你以為這樣就可以收工了嗎？太天真了…

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 Our Goal Predict whether a number is
malicious as EARLY as possible In order to prevent further victims…  Our goal: accurate and FAST detection 632

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 # calls observed each day 633

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 Observation time: Month  Day 634

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 # calls observed each hour 635

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 Observation time: Day  Hour 636

Dynamic observation period When we require malicious number prediction? Ans:
The time a phone call reaches a Whoscall user 637 time Phone call Phone call Phone call Phone call ? Observation window

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 Observation time: The last 5 calls 638

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 Prediction based on the last N calls
640

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿真的可以收工了嗎？ 641

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 Work in Progress Feature selection Anti-countermeasures Online
learning Personalized penalty setting Crowdsourced tag correction mechanisms And much more… 642

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 A SHORT PROMO ON R (Tools I
used in this project: awk + PHP + R) 643

Why ? [1] IEEE Spectrum: The Top Programming Languages in
2015 http://spectrum.ieee.org/computing/software/the-2015-top-ten-programming-languages

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 Starting R 646 Introduction to R (Alex
Storer, IQSS) 1/20/12

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 Learning R....

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 Once you become an R expert…

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 Demo of R Basics examine data.frame table,
hist, ecdf, color plot, cor barplot, boxplot on dur.call.med 649

陳昇瑋 / 資料科學家未曾公開之資安研究事件簿 Final Words of Warning “Using R is
a bit akin to smoking. The beginning is difficult, one may get headaches and even gag the first few times. But in the long run,it becomes pleasurable and even addictive. Yet, deep down, for those willing to be honest, there is something not fully healthy in it.” --Francois Pinard R

TW.R 社群＆ MLDM Monday  聚會時間：每週一晚上七點半 − 地點：政大創立方 − 報名網址：http://www.meetup.com/Taiwan-R/
− FB：https://www.facebook.com/Tw.R.User − Youtube： http://www.youtube.com/user/TWuseRGroup

交流時間

陳昇瑋中央研究院資訊科學研究所有沒有人在偷用你的臉書？

陳昇瑋 /資料科學家未曾公開之資安研究事件簿 The Prevalence of Social Network Services 654

陳昇瑋 /資料科學家未曾公開之資安研究事件簿 Sensitive Info on SNS: A LOT! Personal info
Photos, Diary, Schedule Groups, Pages, Likes Connections with friends Friends’ information Friends’ photos, demographics, and so on Interactions with friends Conversations Messages

陳昇瑋 /資料科學家未曾公開之資安研究事件簿 Are those information safe? 656 stealthy use of
SNS accounts is commonly seen.

陳昇瑋 /資料科學家未曾公開之資安研究事件簿 Stealthy Use: Tips 1, 2, 3!! People let
browsers to manager their passwords Entering password on mobile devices is cumbersome People left SNS logged on when they’re temporarily away 657

陳昇瑋 /資料科學家未曾公開之資安研究事件簿 Existing Measures of Facebook • 紀錄 IP address
、作業系統及瀏覽器種類 • 註冊裝置：經過簡訊回傳認證碼驗證裝置 • 然而，這些方法都無法辨別一台已註冊的電腦，是否目前為註冊者本人操作，被盜用時無法即時得知。

陳昇瑋 /資料科學家未曾公開之資安研究事件簿 Our Approach • Rationale: 不同人使用同一個帳號時，瀏覽行為也會不同。 • 會特別注意某位朋友的資訊嗎？
• 會多少時間瀏覽新資訊？ • 會如何瀏覽過時資訊？ • 透過機器學習，判斷瀏覽行為是否為帳號擁有者所進行。 • 當偵測到異常的行為時，透過行動電話或是電子郵件通知帳號擁有者，以確保帳號安全。

陳昇瑋 /資料科學家未曾公開之資安研究事件簿 The Loophole Of User Identity Process The whole
duration of using SNS Log-in Log-out ? Logging-in Authentication The account will be protected by the logging-in authentication process. We need the continual authentication to ensure the security for the whole duration of using SNS. 66 /78

陳昇瑋 /資料科學家未曾公開之資安研究事件簿 The 3 Different Roles Of A Subject 66
/78

陳昇瑋 /資料科學家未曾公開之資安研究事件簿 User Studies 1. 受測者必須兩兩認識的人為一對，關係可以是家人、朋友、情侶、同事或同學。 2. 每位使用者登入自己的 Facebook帳號，並瀏覽個人朋友清單
。 3. 接下來實驗分為三階段，每一階段約30分鐘，並隨機安排位置，每個人有可能使用非自己的帳號。 4. 記錄下每一筆與 Facebook 主機間的 http request 及 response 。 5. 實驗完成後，請使用者填寫個人基本資訊，包含年齡，性別，與同組夥伴的關係。

陳昇瑋 /資料科學家未曾公開之資安研究事件簿 Experiment Procedures 66 /78

陳昇瑋 /資料科學家未曾公開之資安研究事件簿 Data Collection: HTTP Spying • Intercept all HTTP
communications (including AJAX req. and resp.) between the subject’s PC and Facebook servers

陳昇瑋 /資料科學家未曾公開之資安研究事件簿 /78 66 Trace Summary

陳昇瑋 /資料科學家未曾公開之資安研究事件簿 18 Different Actions On Facebook 66 /78 
We define 18 common actions on Facebook and categorize them into 2 groups: interactive actions and page-switching actions.  Interactive actions are actions that users interact with a certain target person.  Page-switching actions are those lead the browser into another Facebook page.

陳昇瑋 /資料科學家未曾公開之資安研究事件簿 18 Browsing Actions 66 /78

陳昇瑋 /資料科學家未曾公開之資安研究事件簿 Example Action Logs 66 /78

陳昇瑋 /資料科學家未曾公開之資安研究事件簿 The Evidence Of General Diversity 67 /78 
Stalkers pay more attention to reading or searching the interesting or earlier information hidden in expandable pages.

陳昇瑋 /資料科學家未曾公開之資安研究事件簿 The Evidence Of General Diversity (Con’t) 67 /78
 Stalkers tend not to do the trackable action like adding comment or pressing the like button.

陳昇瑋 /資料科學家未曾公開之資安研究事件簿 What Stalkers Do Not Care 67 /78 
Stalkers tend to ignore most of the newsfeeds, and show less interest in expanding comments, groups/fans pages, or who likes the post.

陳昇瑋 /資料科學家未曾公開之資安研究事件簿 What Acquainted Stalkers Care 67 /78  Acquainted
stalkers are usually interested in accounts’ friend list, message pages, and profile cards.

陳昇瑋 /資料科學家未曾公開之資安研究事件簿 What Stranger Stalkers Care 67 /78  Stranger
stalkers are interested in account owners’ profiles and photos. Also they are more willing to check nonfriends’ pages and external links.

陳昇瑋 /資料科學家未曾公開之資安研究事件簿 The Flow Chat of Our Detection Scheme 67
/78

陳昇瑋 /資料科學家未曾公開之資安研究事件簿  We randomly permute the data points for
20 times and do the 10-fold cross validation, then record the mean and standard deviation of accuracies. 67 /78 Detection Performance

陳昇瑋 /資料科學家未曾公開之資安研究事件簿 Important Features for Early Detection  We count
the features with the 3 most positively and negatively weight w within 7 minutes which can give us the hint to modify the early detection model. 67 /78

交流時間

2011/3/17 679 Data Mining and Machine Learning Lab.

 Authors:  Anh Le, Athina Markopoulou (University of California,
Irvine)  Michalis Faloutsos (University of California, Riverside)  Source:  to appear in IEEE INFOCOM 2011 Mini Conference, Shanghai, China, April 10-15, 2011. (poster, tech report) 2011/3/17 680 Data Mining and Machine Learning Lab.

 Introduction  Dataset and Feature Extraction  Classification Algorithms
 Evaluation Results  System Deployment  Conclusion 2011/3/17 681 Data Mining and Machine Learning Lab.

 “How well can one detect phishing URLs using only
lexical features compared to using full features?”  PhishDef Properties:  High accuracy:  96%-97%  Light-weight:  Low latency  Imposes a modest overhead  Proactive approach  As opposed to reactively relying on blacklist  Resilience to noise  95%-86% accuracy when there is 5%-45% noise 2011/3/17 682 Data Mining and Machine Learning Lab.

 Dataset  Malicious URLs  PhishTank  MalwarePatrol 
Legitimate URLs  Yahoo Directory  Open Directory (DMOZ)  External Feature Collection  WHOIS  Team Cymru 2011/3/17 683 Data Mining and Machine Learning Lab.

 Feature Extraction  Automatically selected features  Delimiters: ‘/’,
’?’, ‘.’, ‘=‘, ‘_’, ‘&’ and ‘-’.  Four parts:  Domain Name  Directory  File Name  Argument  Obfuscation-resistant lexical features  Four different URL obfuscation techniques  Five categories of hand-selected lexical features 2011/3/17 684 Data Mining and Machine Learning Lab.

 (I) Obfuscating the host with an IP address 
(II) Obfuscating the host with another domain  (III) Obfuscating with large host names  (IV) Domain unknown or misspelled 2011/3/17 685 Data Mining and Machine Learning Lab.

Phishing URLs characteristics PhishScore: Hacking Phishers‘ Minds – Samuel Marchal
7 / 16 www.paypal.creasconsultores.com/www.paypal.com/Resolutioncenter.php shevkun.org/css/paypal.com/cgi-bin/cmd%3D_login-submit/css/websc.php us-mg6.mail.yahoo.com.dwarkamaigroup.com/Yahoo.html emailoans.hostingventure.com.au/bankofamerica.com nitkowski.pl/components/wellsfargo/questions.php The registered domain has no relationship with the rest of the URL • Most parts of URLs can be freely defined • Except the registered domain: main level domain + public suffix 4ld.3ld. http:// mld.ps /path1/path2?key1=value1&key2=value2

 Features related to the full URL  Length of
the URL (Type II)  Number of dots in the URL (Type II)  Blacklisted words (Type IV)  confirm, account, banking, secure, ebayisapi, webscr, login and signin  Paypal, free, lucky and bonus  Features related to the domain name  Length of the domain name (Type III)  IP or port number is used in the domain name (Type I)  Number of tokens of the domain name (Type III)  Number of hyphens used in the domain name (Type III)  The length of the longest token (Type III)  Features related to the directory  Length of the directory (Type II)  Number of sub-directory tokens (Type II)  Length of the longest sub-directory token (Type II)  Maximum number of dots and other delimiters used in a sub-directory token (Type II) 2011/3/17 Data Mining and Machine Learning Lab. 687

 Features related to the file name  Length of
the file name (Type II)  Number of dots and other delimiters used in the file name (Type II)  Features related to the argument part  Length of the argument part  Number of variables  Length of the longest variable value  The maximum number of delimiters used in a value  Summary of dataset 2011/3/17 Data Mining and Machine Learning Lab. 688

 Batch Learning  Support Vector Machine (SVM)  Online
Learning  Online Perception (OP)  Confidence Weighted (CW)  Adaptive Regularization of Weights (AROW) 2011/3/17 Data Mining and Machine Learning Lab. 689

 Batch-based vs. Online algorithms  SVM vs. AROW 
Yahoo-Phish 2011/3/17 Data Mining and Machine Learning Lab. 690

 Lexical Features vs. Full Features  OP, CW and
AROW  Yahoo-Phish 2011/3/17 Data Mining and Machine Learning Lab. 691

 Obfuscation-Resistant Lexical Features  Performance of AROW with/without OR
features after the last URL 2011/3/17 Data Mining and Machine Learning Lab. 692

交流時間

陳昇瑋 / 當學術研究者遇見線上遊戲資料科學如何輔助線上遊戲虛寶銷售

陳昇瑋 / 當學術研究者遇見線上遊戲 (Photo credit: FluffyLtd)

陳昇瑋 / 當學術研究者遇見線上遊戲哪一件銷量最好？

陳昇瑋 / 當學術研究者遇見線上遊戲商品銷售差異總銷售量：93,945 首週銷量：55,947 總銷售量：1,268 首週銷量：992

陳昇瑋 / 當學術研究者遇見線上遊戲資料分析團隊該通常做些什麼？玩家層面 DAU, WAU, MAU 上線時間平均花費
商品層面每個商品的交易量每個商品隨著時間交易量演進玩家 vs. 商品玩家對於特定商品的偏好玩家屬性 (性別、年紀、等級、職業、是否 VIP)、購買期間與商品的關係行銷作法使用推薦系統來做個人化推薦商品給玩家 700 X

陳昇瑋 / 當學術研究者遇見線上遊戲其實我們很想知道一個問題…

陳昇瑋 / 當學術研究者遇見線上遊戲以資料分析幫助設計虛擬商品量化影響虛擬商品銷售好壞的要素主觀要素影像訊號要素提供可以讓設計師參考的設計指引建構一套系統化的方法，為運行在不同區域, 國家的
遊戲，提供調整虛擬商品設計的準則

陳昇瑋 / 當學術研究者遇見線上遊戲目標設計熱銷的虛擬商品

陳昇瑋 / 當學術研究者遇見線上遊戲 Everything is DECOMPOSABLE 總銷售量：93,945 首週銷量：55,947 總銷售量：1,268 首週銷量：992

陳昇瑋 / 當學術研究者遇見線上遊戲 Feature Engineering 705 A feature is a
piece of information that might be useful for prediction. Any attribute could be a feature, as long as it is useful to the model. "…some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.“ —Pedro Domingos, "A Few Useful Things to Know about Machine Learning”

陳昇瑋 / 當學術研究者遇見線上遊戲 http://jobs.netflix.com/jobs.php?id=NFX01466

陳昇瑋 / 當學術研究者遇見線上遊戲 Netflix Taggers 聘請專人依照 SOP (36 pages) 觀賞並標註影片
555 個標籤，76,897 種組合 (2014年一月) 以標籤為基礎建立影片推薦系統

陳昇瑋 / 當學術研究者遇見線上遊戲 Netflix Micro-genres for Videos

陳昇瑋 / 當學術研究者遇見線上遊戲 Feature extraction based on object detection 709
https://pjreddie.com/darknet/yolo/ https://youtu.be/VOC3huqHrss?t=8

陳昇瑋 / 當學術研究者遇見線上遊戲科技三箭？ 710

陳昇瑋 / 當學術研究者遇見線上遊戲 Crowdsourcing = Crowd + Outsourcing “soliciting solutions
via open calls to large-scale communities”

陳昇瑋 / 當學術研究者遇見線上遊戲 A more formal definition “Crowdsourcing is the
act of taking a job traditionally performed by a designated agent (usually an employee) and outsourcing it to an undefined, generally large group of people in the form of an open call.” [1] [1] Howe, Jeff. Crowdsourcing: A Definition, http://crowdsourcing.typepad.com/

陳昇瑋 / 當學術研究者遇見線上遊戲 Image Semantics Reward: 0.04 USD / task
main theme? key objects? unique attributes?

陳昇瑋 / 當學術研究者遇見線上遊戲 0.02 USD/ task find out photos of
revolvers!

陳昇瑋 / 當學術研究者遇見線上遊戲 0.01 USD/ task Human Skeleton

陳昇瑋 / 當學術研究者遇見線上遊戲 0.01 USD/ task Photo Orientation

陳昇瑋 / 當學術研究者遇見線上遊戲 Perspectives for 3D Objects Thi Phuong Nghiem,
Axel Carlier, Geraldine Morin, and Vincent Charvillat, "Enhancing online 3D products through crowdsourcing," ACM CrowdMM'12.

陳昇瑋 / 當學術研究者遇見線上遊戲 Web Site Classifier 12 USD / hour
Panos Ipeirotis, “Crowdsourcing using Mechanical Turk: Quality Management and Scalability,” Invited Talk at CSDM 2011.

陳昇瑋 / 當學術研究者遇見線上遊戲 Photographers’ Intention to support a task? to
capture a bad feeling? to preserve a good feeling? to recall later on? to publish it online? to show it to friends and family? Mathias Lux, Mario Taschwer, and Oge Marques, “A Closer Look at Photographers’ Intentions: a Test Dataset,” ACM CrowdMM’12.

陳昇瑋 / 當學術研究者遇見線上遊戲 Linguistic Affective Judgement Affective response (Snow et
al. 2008) USD 0.4 to label 20 headlines (140 labels) “Closing and cancellations top advice on flu outbreak”

陳昇瑋 / 當學術研究者遇見線上遊戲 A Lot More Examples Document relevance evaluation
Document rating collection Noun compound paraphrasing Person name resolution Among others...

陳昇瑋 / 當學術研究者遇見線上遊戲 http://bountyworkers.net/

陳昇瑋 / 當學術研究者遇見線上遊戲尋找實驗受試者是一件苦差事 727

陳昇瑋 / 當學術研究者遇見線上遊戲 728 亟待解決的問題帳號加值受試者四處分散，需花費心思招募
PTT 問卷板、Facebook 社團管理實驗流程耗時耗力實驗進行總人數控管、重複執行排除無法與受試者即時溝通獎勵發放過程繁瑣，影響受試動機郵寄或親領現金或購物禮券（非即時獲得報酬）抽獎摸彩（並非人人有獎） P 幣（僅限 PTT 上使用）經常需要受試者重複填寫個資（影響受試意願）

陳昇瑋 / 當學術研究者遇見線上遊戲 729 用 Bounty Worker 解決這些問題帳號
加值穩定的用戶成長量集合有意願進行實驗的志願者，無需四處尋找系統化管理實驗流程任務參與上限、同一人可重複操作次數隨時暫停任務以即時訊息與受試者溝通任務完成與審查提醒具公信力的第三方支付完成任務的執行者可獲得平台點數使用點數兌換獎勵，如超商抵用券

陳昇瑋 / 當學術研究者遇見線上遊戲 730 Bounty Worker 的運作流程帳號加值建立任務審核回報
取得任務結果接取任務執行任務並回報獲得獎勵報酬瀏覽任務發布者執行者

陳昇瑋 / 當學術研究者遇見線上遊戲累積 2,500+ 名用戶已成功執行了 3,700+ 次任務已經發出了
11,3000+ 元的獎勵連絡信箱 [email protected] 平台現況（2015 年 12 月）

陳昇瑋 / 當學術研究者遇見線上遊戲侍者、變裝、僕從、小妹、遐想、貓女、短裙、萌萌、長腿、長襪、女僕、俏麗、甜美、奪目、可愛、幫傭、女侍、女佣、服從、服務、迷裙

陳昇瑋 / 當學術研究者遇見線上遊戲女角衣服的風格標籤俏皮暗紅撩人溫婉魔女和風
裸露辣妹可愛火焰管家華麗東洋誘惑媚惑學生蓬裙火辣性感淘氣萌萌制服彩衣艷麗冷豔惡魔女傭夢幻狂野神聖女僕飄逸野性青春古典甜美日式迷你裙

陳昇瑋 / 當學術研究者遇見線上遊戲首週銷量見真章商品發售首週銷售量佔總銷量一半首週銷售量與總銷量之相關係數為 > 0.9

陳昇瑋 / 當學術研究者遇見線上遊戲銷售量與活躍玩家數相關係數：0.83

陳昇瑋 / 當學術研究者遇見線上遊戲虛擬商品銷售指標 (SI) 比較不同時期發售之裝備的銷售優劣去除發售時間之影響 (1) 去除銷售期間之影響 (2)
去除玩家購買力影響 (3) 每個裝備的銷售指標 SI (Sale Index) 定義為銷售數量 normalized by (1), (2), and (3)

陳昇瑋 / 當學術研究者遇見線上遊戲彩衣：0.667 誘惑：0.143 俏皮：0.048 火辣：0 冷豔：0.548 夢幻：0.161 俏皮：0.065
裸露：0 SI：0.0621 SI：0.0013

風格標籤與SI 之相關係數

陳昇瑋 / 當學術研究者遇見線上遊戲以風格標籤預測女裝 SI 高低真實值總數高低
預測值高 19 2 21 低 2 14 16 總數 21 16 準確度：89.2% 靈敏度：90.5% 特異度：90.5% AUC：0.890

陳昇瑋 / 當學術研究者遇見線上遊戲影像訊號分析

陳昇瑋 / 當學術研究者遇見線上遊戲以影像訊號分辨女裝 SI 高低真實值總數高低
預測值高 16 2 18 低 5 19 24 總數 21 21 準確度：83.3% 靈敏度：88.9% 特異度：76.2% AUC：0.833 略低於風格標籤

以預測女裝 SI R^2：0.669 風格標籤影
像訊號

交流時間

陳昇瑋中央研究院資訊科學研究所資料科學人才的養成

陳昇瑋 / 資料科學人才的養成 750

陳昇瑋 / 資料科學人才的養成 Major Roles in a Data Team 752
Data Project Manager Data Scientist Data Analyst Data Engineer Visual Designer

陳昇瑋 / 資料科學人才的養成技術背景資料科學家／分析師 Statistics Statistical packages (e.g., R,
Python) Machine learning Domain-specific data mining techniques Data visualization 資料工程師 UN*X / Web programming DBMS Data crawling / parsing Data cleansing Data visualization techniques (e.g., d3.js) 753

陳昇瑋 / 資料科學人才的養成 755 Computer Science Statistical Skills Data Engineer
Data Analyst Data Scientist Domain Expertise

陳昇瑋 / 資料科學人才的養成理夢想中的資料科學家 756 資料分析師統計分析、建模報表及視覺資料呈現機器學習 X
可靠的顧問良好的溝通能力與人際技巧懂得發問，能快速掌握問題的核心及評估可行性科學家科學性思維探索未知，定義問題設計實驗，驗證假設商業專家企業如何運作、如何賺錢？對於要把資料分析與大數據運用在哪些層面很有看法大數據資料分析家懂得分析文字、影片或圖像等非結構化資料知道如何引入外部資料來做結合駭客會寫程式能掌握大數據技術架構 Data Analyst

陳昇瑋 / 資料科學人才的養成 Difference between Engineers and Scientists Engineers imagine
and realize things. Scientists conjecture and verify them. 758 Scientists discover the world that exists; engineers create the world that never was. -Theodore Von Karman

759 http://www.kdnuggets.com/2016/03/data-science-process-rediscovered.html Data Analytics Process Software Dev. Process

陳昇瑋 / 資料科學人才的養成 762 資料素養 - 瞭解資料的 (潛在) 價值

陳昇瑋 / 資料科學人才的養成看似簡單的難題如何提昇印度女性地位？墮胎 vs. 犯罪率？槍枝越多，犯罪越少？鯊魚殺的人多還是大象殺人多？
足球罰球時，踢哪個方位最可能進球？兒童汽車座椅安全還是安全帶安全？酒醉只有開車才危險嗎？  假裝知道：傳統思維謬誤 / 道德羅盤 / 從眾與偏見 763

陳昇瑋 / 資料科學人才的養成假裝知道未來其實難以預測股市專家數年間超過 6,000 個預測，整體準確率僅為 47.4% 764

陳昇瑋 / 資料科學人才的養成你真的喝得出貴的葡萄酒？哈佛學者學會上的盲目品酒測試四個醒酒壺結果：四壺平均評分相近，且 1 號壺與 4
號壺評分差距最大！ 765 1 貴葡萄酒 A 2 貴葡萄酒 B 3 便宜葡萄酒 C 4 貴葡萄酒 A

陳昇瑋 / 資料科學人才的養成你真的喝得出貴的葡萄酒？ Robin Goldstein 的實驗在幾個月內到全美各地進行 17 項雙盲品酒測試
參加人數超過 500 人，包括入門人士、侍酒師與酒商測試 523 種酒，每支酒價格從 1.65 美元至 150 美元不等結果  較貴的酒沒有獲得比較高分  平均而言，昂貴葡萄酒的分數稍低於便宜的酒  樣本中 12% 的參與者受過品酒訓練，但這些人並未偏好便宜的酒，也沒有明顯特別偏好昂貴的酒 766

陳昇瑋 / 資料科學人才的養成人們為什麼自殺？近年來美國兇殺率與交通死亡率均創新低，但自殺率幾乎不變，數十年間 15~24 歲的自殺率甚至增為 3 倍
紐澤西 Richard Stockton 學院的心理學家 David Lester，透過 2 千 5 百多篇學術發表，探索自殺與其他事物的關聯：酒精、憤怒、抗憂鬱劑、星座、生物化學、血型、體型、憂鬱症、藥物濫用、槍枝控管、快樂、假期、網路使用、智商、心理疾病、偏頭痛、月亮、音樂、國歌歌詞、性格類型、抽煙、性靈、看電視、開闊空間研究了這麼多，還是不知道到底人們為何自殺 David Lester 的結論：「沒有特定事物可以怪罪」 767

陳昇瑋 / 資料科學人才的養成測量才可能確認真相 768

陳昇瑋 / 資料科學人才的養成相關 ≠ 因果 X 與 Y 相關
X 導致 Y？ Y 導致 X？或另有變數同時導致 X & Y？ 769 巧克力消耗量 vs. 諾貝爾得獎數

陳昇瑋 / 資料科學人才的養成金錢有助勝選？花費高的候選人的確較常當選是金錢讓人贏得選舉？抑或領袖魅力引來捐款和選票？候選人吸引力如何量化？檢視 1972
以來美國國會選舉，相同候選人連兩次對決比較連續兩次 A vs. B 的情形約有 1,000 件，在候選人吸引力相對穩定下，即可測量出金錢的作用  勝者就算經費削減一半得票率僅減少 1%  敗者儘管經費加倍，也不過多爭取到 1% 的得票率 773

陳昇瑋 / 資料科學人才的養成父母對子女成績的影響？幼兒長期研究計畫美國 1990 年代晚期，全國各地選出共 2 萬名以上學童詳細
調查背景資料，並測量由幼稚園到五年級的學業進步情形迴歸分析結果家中藏書豐富，是否讓小孩在學校表現優良？家中藏書豐富的小孩，是否比沒有書的小孩表現好？  家中藏書豐富的小孩，成績優於沒書的小孩但家中藏書或許只反應家長所得高低，成績高低可能有其它變數影響 774

陳昇瑋 / 資料科學人才的養成回到父母對子女成績的影響哪些是與考試成績高度相關的家庭因素？父母教育程度高家庭關係親密父母社經地位高最近搬到較好的社區母親生第一胎時
30 歲以上小孩出生時體重偏低小孩參加過學前輔導母親在小孩出生後到上幼稚園前沒有上班 775 父母在家中說英語父母會定期帶小孩上博物館小孩為領養小孩常挨打父母參與學校家長會小孩常看電視家裡有很多書父母幾乎天天唸書給小孩聽

陳昇瑋 / 資料科學人才的養成父母對子女成績的影響重要的是家長「是」怎樣的人，而非家長「做」了什麼 776 家長「是誰」高度相關家長「做什麼」低度相關：教育程度高家庭關係親密
社經地位高最近搬到較好的社區母親生第一胎時 30 歲以上母親在小孩出生後到上幼稚園前沒有上班小孩出生時體重偏低小孩參加學前輔導在家中說英語定期帶小孩上博物館小孩為領養小孩常挨打參與學校家長會小孩常看電視家裡有很多書幾乎天天唸書給小孩聽

陳昇瑋 / 資料科學人才的養成養父母的影響？養父母通常較親生父母聰明、教育水準、收入也較高，但這些優點通常對養子女的學業成績沒有貢獻然而養子女上大學、從事待遇高的工作、成年後結婚的比率較高，顯示養子女成年後能擺脫純由 IQ 所預
測的命運軌跡 777

陳昇瑋 / 資料科學人才的養成美國黑人罹患心血管病機率為何較高？美國黑人得高血壓機率較白人高 50% 明顯的刺激因子：飲食、抽煙、貧窮等都無法解釋加勒比海黑人高血壓率亦較高，但現居非洲的黑人，統計上患病機率則和美洲白人無異 778

陳昇瑋 / 資料科學人才的養成美國黑人罹患心血管病機率為何較高？哈佛經濟學者 Roland Fryer 的觀察 779

陳昇瑋 / 資料科學人才的養成美國黑人罹患心血管病機率為何較高？昔日奴隸貿易的篩選，可能是美國黑人心血管疾病罹患率較高的根本原因奴隸從非洲運送至美洲常中途死亡，脫水是主因「鹽敏感性」高的人，較不容易脫水，體質能留住鹽分，就能留住更多水分商人（舔臉）找出鹽敏感性高的奴隸，降低風險
此種鹽敏感性體質是高度遺傳特徵 780

陳昇瑋 / 資料科學人才的養成以資料來輔助誘因設計 781

陳昇瑋 / 資料科學人才的養成 782 誘因「道德不會改變人的行為，價格才會！」 ——歐巴馬總統經濟顧問 Austan Goolsbee 解決問題的基本步驟：瞭解特定情境下，所有相關人
士的誘因勿聽其言，而要觀其行

陳昇瑋 / 資料科學人才的養成聽其言，觀其行加州居民節約能源的原因？電話訪問：在您決定節能時，下列因素的重要程度？省錢環保對社會有益很多人正在做
783 1 2 3 4

陳昇瑋 / 資料科學人才的養成聽其言，觀其行 (cont.) 田野實驗：登門拜訪，發放小標語掛在居民門上 - 能源節約（對照組） - 節約能源，保護環境（道德動機）
- 盡你的責任，替子孫節省能源（社會責任） - 節約能源也省錢（財務動機） - 和你的鄰居一起節約能源（從眾心理） 784

陳昇瑋 / 資料科學人才的養成觀察才能得知真相十個常用美國房地產廣告字眼中，哪些與最終售價高度正相關？  絕佳 (Fantastic) 
寬敞 (Spacious)  可麗耐建材 (Corian)  迷人 (Charming)  楓木 (Maple) 分析 10 萬筆芝加哥郊區售屋資料 3,000 筆房仲銷售自宅，控制地點、屋況等變數後，平均銷售時間多 10 天，相同屋況最終售價高 3% 785  花崗岩 (Granite)  最先進 (State-of-the-Art)  ”！”  饗宴 (Gourmet)  環境優美 (Great neighborhood)     

陳昇瑋 / 資料科學人才的養成誘因：錯誤示範美國亞利桑那州，化石森林國家公園的警示標語立有警告標語小徑的失竊率，是沒有標語小徑的 3 倍！錯誤的行為，因為很多人都在做，而被合理化了。 786
「您的自然遺產每天都在遭受破壞，這裡的木化石一年被偷走 14 噸，大多是一次偷一小塊。」

陳昇瑋 / 資料科學人才的養成眼鏡蛇效應印度被殖民時期，為減少當地眼鏡蛇數量，英國政府懸賞殺眼鏡蛇換獎金越南被法國殖民也有類似的案例：減少鼠害墨西哥波哥大為解決塞車問題，政府規定，每天只有部份車牌號碼可以上路，以降低車流量 787

陳昇瑋 / 資料科學人才的養成微笑列車成立於 1949 年，到 2007 年止已為 76 個國家
38 萬的唇齶裂兒童提供免費治療，工作的重點地區是中國和印度策略：「只要現在捐一次，我們將永遠不會再請您捐錢」一般募款希望培養重複性捐款人，怎麼能為了短期進帳而犧牲長期捐款？慈善募款：我只煩你一次 788

陳昇瑋 / 資料科學人才的養成為什麼我只該煩你一次？ 789

陳昇瑋 / 資料科學人才的養成微笑列車回覆卡選項：  「這是唯一一次捐款，請寄給我報稅收據，別再請我捐款」  「我願意每年收到兩次微笑列車訊息，請尊重我的意願，限制寄給我的郵件數量」 
「讓我知道微笑列車行動的最新進展，定期寄給我通訊」結果：  首次捐款的機率是一般DM的 2 倍，平均首捐金額也較高  整體捐款率竟然提昇 46%! 慈善募款：我只煩你一次 790 1/3 2/3

陳昇瑋 / 資料科學人才的養成小結保持赤子之心，不要帶入自己的假設或偏見善用資料及統計工具自然實驗可遇不可求，必要時設計實驗來驗證假設會慢慢再接近「真相」一點… 791

陳昇瑋 / 資料科學人才的養成 792 創意人的訓練 - 創意的產生是有方法可循的

陳昇瑋 / 中央研究院如何成為創意人

陳昇瑋 / 資料科學人才的養成創意的發生

陳昇瑋 / 資料科學人才的養成讀太多書會限制創意嗎？

陳昇瑋 / 資料科學人才的養成魔島理論 by James Webb Young (楊傑美)

陳昇瑋 / 資料科學人才的養成魔島理論－別人怎麼說創意似乎不能超乎一個人的經驗之外 “ ” 創意人就像乳牛，不吃草就分泌不出乳汁 “
” 百分之九十九的努力，加上百分之一的靈感 “ ”

陳昇瑋 / 資料科學人才的養成但，靈感怎麼來的？一定還有些因素決定它冒不冒出海面……

陳昇瑋 / 資料科學人才的養成創意的形式 (1) 拼圖遊戲不相干事物的「相干性」隨身聽：走路 + 音樂
果汁汽水：果汁 + 汽水論文主題產生器？改變用途不龜手之藥：染布工人軍隊心理學、社會學廣告業（爭取消費者）政治

陳昇瑋 / 資料科學人才的養成創意的形式 (2) 階段再定義眼光是新的，東西就是新的創意不見得是改變東西，有時候只是改變自己「認知的改變」是重要的創新來源情勢律
(Law of Situation)  年代影視：製作者  提供者  規劃者  窗帘  調節光線  影印機  辦公室自動化  大賣場  商品訊息 / 遊戲休閒  手機, Google, Facebook, …

陳昇瑋 / 資料科學人才的養成創意的自我訓練 TRAINING

陳昇瑋 / 資料科學人才的養成巴黎司機訓練法強迫自己觀察，直到觀察成為生活的一部分上班休息時間，觀察每一位同事打電話的姿勢用餐時間，觀察每一位食客吃飯的細節當你看的東西與人不同，你想的東西也就與眾不同

陳昇瑋 / 資料科學人才的養成杜拉克式問句簡化問題，並集中精神於真正的問題上問題要淺問題要清楚判斷問題的重要性「好的問題，就等於答對了一半」我來說一個故事
…

陳昇瑋 / 資料科學人才的養成 “What if …” 訓練法給自己大膽的假設，試想各種可能的狀況如果…會怎麼樣… 如果台灣持續乾旱，我們生活用水該怎麼辦？
如果不小心睡過頭了，上班遲到會怎麼樣？

陳昇瑋 / 資料科學人才的養成反分析訓練法分析與綜合，要彼此互相支援分析是「同中求異」，把看起來相同的東西說成不相干綜合是「異中求同」，把看起來不同的東西說成相關運用「分析」的能力，將東西拆成不同成分，再運用「綜合」的能力，將這些成分重新排列組合試著找出兩個(看似)不相干事物的共同之處
戒指 vs. 仙人掌音響 vs. 茶杯信用卡 vs. 早餐皮夾 vs. 螞蟻鉛筆 vs. 溜滑梯

陳昇瑋 / 資料科學人才的養成重新定義訓練法創意的來源，有時只是「認知的改變」如果解釋是新的，舊的東西也能變成新的漸距推遠賣豆漿的人 
供應早餐的人  供應外出人士方便快速用早餐的人平行重定義百貨公司擁有者建築物的地主賣東西給消費者的商店為消費者選擇生活用品的人

陳昇瑋 / 資料科學人才的養成創意的產出

陳昇瑋 / 資料科學人才的養成三個階段預備期  潛伏期  發光期 “
” - Helmholtz (德國哲學家)

陳昇瑋 / 資料科學人才的養成三個階段 (cont.) 籌備  培養  靈感
 事實驗證 “ ” - Hoshe F. Rubinstein (USC) 1. 收集原始資料 2. 在心裡咀嚼這些資料 3. 儘你所能的將主題拋開，把問題徹底忘掉 4. 不知道從哪裡點子就竄出來了 5. 將你新生的點子付諸實踐，然後看看它是不是會成功 “ ” - James Young (廣告人)

陳昇瑋 / 資料科學人才的養成三個階段 (cont.) 把你自己浸在你正在進行的計畫中，達到一個飽和的狀態，然後開始等待。並不是停下來休息或停下來開始看一個星期的電視，我說的是忘了它，去做別的工作。 “
” - Lloyd Morgan 所有研究室的發現、發明都是經過一段時間的緊密思考和收集資料後，在放鬆的時刻以「靈感」的方式出現。 “ ” - C.G. Suits (GE)

陳昇瑋 / 資料科學人才的養成三個階段 (cont.) 當卡在某個案子時，就去做下一個，讓非意識的部分來發揮功用。當你再回到這個案子時，你會很驚訝地發現，10 次裡有 9
次問題都解決了，你甚至不知道是怎麼解決的。 “ ” - Carl Sagan (天文科學家)

陳昇瑋 / 資料科學人才的養成 Incubation http://dictionary.reference.com/browse/incubation (noun.) 1610s, "brooding," from Latin
incubationem (nominative incubatio) "a laying upon eggs," noun of action from past participle stem of incubare "to hatch," literally "to lie on, rest on," from in- "on" (see in- (2)) + cubare "to lie" (see cubicle ). The literal sense of "sitting on eggs to hatch them" first recorded in English 1640s.

陳昇瑋 / 資料科學人才的養成如何產出好構想？

陳昇瑋 / 資料科學人才的養成量中取質在一切相等的前提下，每單位時間內，若有人能產生很多構想，則得到好構想的機會比別人大。 “ ” - J.P.
Guiford 得到一個好構想的最好方法，就是要有很多構想。 “ ” - Linus Pauling (Nobel Laureate)

陳昇瑋 / 資料科學人才的養成如採珠者採蚌愛迪生 5000 產品名 (70 人) 610
個書名 100 個社論標題 3800 個橋名

陳昇瑋 / 資料科學人才的養成自由運轉 “Free wheeling” (coined by Dr. J.
R. Killian Jr.)

陳昇瑋 / 資料科學人才的養成自由運轉模式不要同時踩煞車和踏油門，不對任何觀念做任何評斷儘量想出一大堆構想，儘可能以最快的速度將其列出。搭便車  on top
of others’ ideas 反面思考唯一目的：「數量，數量，更多的數量！」如果理智對意念檢核得太緊密的話，創造性的意念就將躲藏起來。 “ ” - Friedrich Von Schiller

陳昇瑋 / 資料科學人才的養成再十個點子再去吃飯！瑞士刀的戶外看板廣告一天的時間夠不夠？午休時間夠不夠？

陳昇瑋 / 資料科學人才的養成面對未知的時候當你不確定一個問題是否有答案，要找答案就難了；當你知道有很多答案，要找到一兩個就容易多了。 “ ” -
Emile Coue (法國心理學家) 當一個科學家面對一個問題，他確定有答案時，他的態度就轉變了，那等於已經找到 50% 的答案。 “ ” - Norbert Wiener (數學家)

陳昇瑋 / 資料科學人才的養成「以量求質」的其它形式當我年輕時，我發現所做的十件事情中，失敗的總有九件。我不想成為失敗者，所以我總是做十倍的事情。 “ ” -
George Bernard Shaw

陳昇瑋 / 資料科學人才的養成我的點子筆記本

陳昇瑋 / 資料科學人才的養成創意的絆腳石

陳昇瑋 / 資料科學人才的養成血統主義「這是不可能的；大家都不這麼做。」抗拒新元素的加入，每一件事都應遵循既有的規則我們的創造力因而受到阻礙，破壞我們思考的流暢性和彈性

陳昇瑋 / 資料科學人才的養成逆變心理習慣領域「從前，我有一次…」 Comfort zone 我們得學習與「改變」一起生活

陳昇瑋 / 資料科學人才的養成不介意加入新元素，但總以為新元素加入的變化都是循著直線前進歷史往往不是直線發展的電腦大小便宜快速多功能電視尺寸成像品質高傳真直線主義

陳昇瑋 / 資料科學人才的養成如何毀掉一場動腦會議讓老闆先說：只要老闆先說，就註定這場動腦會議失敗了，因為大家會傾向猜測與說出老闆喜歡的方向大家輪流依序發言：大概輪個一次或兩次就結束了只讓專家或技術人員發言：動腦會議最好由不同性質的人組成，匯聚各領域人才，理想人數約為 5～8
人，如果成員中有與主題有關的專家，比例為半數以下較為恰當，因為集合各領域人才，對於擴大發想內容更有幫助。遠離辦公室：在海灘想出來的點子通常會離題太遠不允許笨想法：如果每個想法都要能實行才能提出，我敢保證這場動腦會議會超級冷一五一十記錄會議內容：只要記錄重點與建議事項即可，而且不可由主持人擔任。

陳昇瑋 / 資料科學人才的養成二十條創意守則 #1 by Charles Thompson 1. 只要想出走在時代前十五分鐘的點子，不必想出比時代早幾個光年的點子。
2. 得到偉大點子的最佳方法，就是先想出許許多多點子，然後再把壞點子淘汰。 3. 不要只尋求唯一的正確答案。 4. 如果一時想不出來……暫時休息一下。 5. 一想到點子，馬上紀錄下來，免得忘記。 6. 如果每個人都認為你錯了，你就比他們早了一步；如果每個人都取笑你的點子，那麼你就比他們早了兩步。 7. 當你提出一個笨問題時，通常可以得到一個聰明的答案。 8. 每個問題都有答案，只要問對問題，答案自然顯現。 9. 絕對不要以最基本的看法來解決問題。 10. 在問題未解決之前，先想像困難解決之後的景像。

陳昇瑋 / 資料科學人才的養成二十條創意守則 #2 by Charles Thompson 11. 成功的創意家通常用反證法來解決問題或發想創意。
12. 向傳統想法挑戰，可化不利為機會點。 13. 如果套上不同的鞋子不管用的話，試著從直昇機或太空船上看事情。 14. 用大自然的角度觀看目標或問題，可大大提昇眼界，得到不同的解決方案。 15. 把握擷取別人一流的創意原則，精益求精。 16. 對失敗的懲罰，絕對不可重於對不做任何事的懲罰！ 17. 通常點子的有趣特質導向創新，而非正面或負面評價。 18. 把你的點子寫下來，就像把錢存在銀行裡。 19. 在六十分鐘會議前，請做一分鐘頭腦熱身運動。 20. 把洗澡當作一件樂事吧！也許就在你刷刷洗洗.哼哼唱唱之間，靈感就來了。

陳昇瑋 / 資料科學人才的養成個人建議

陳昇瑋 / 資料科學人才的養成建議 #1－大量閱讀先找一些名著墊底要把閱讀範圍延伸到專業之外應立足于個人靜讀讀書卡片不宜多做書中真正深切觸動你的內容，想丟也丟不掉，對此你要有更多的洒脫和自信
「早歲讀書無甚解，晚年省事有奇功」 by 蘇轍有空到書店走走，逛逛圖書館也很好 [1] 余秋雨〈青年人的閱讀〉

陳昇瑋 / 資料科學人才的養成建議 #2－不放過所有的發想隨時可記錄，從來不刪除的筆記方式定時瀏覽記錄，重新檢視所有的發想一有機會就讓旁人幫忙驗證

陳昇瑋 / 資料科學人才的養成建議 #3－杜拉克問句的練習一答接一問，在答案中起問題追根究柢務必追到問題核心很棒的附加價值－再也不怕參加社交活動！做學問要於不疑處有疑，待人要於有疑處
不疑。 “ ” - 胡適

陳昇瑋 / 資料科學人才的養成建議 #4－獨處與熱情舒適的獨處，快樂的孤寂對原創性思想有多麼大的幫忙啊。 “ ” 熱情會提高我們的知覺力，讓我們能體會
最細微的表現。就像一個戀人，每天在他的愛身上，都可以發現新的事物。 “ ”

陳昇瑋 / 資料科學人才的養成延伸閱讀 Advice to a Young Scientist P.B.
Medawar, BasicBooks, 1979. 科學之路：科學家的心路歷程貝弗里奇著/ 楊新北譯, 長堤出版社, 1984. 善用你的思考風格哈里森(Harrison, A. F.), 布朗森(Bramson, R. M.) 著/廖立文譯, 遠流出版公司, 1985. 創造與人生 Robert Olson 著/ 呂勝瑛, 翁淑緣譯, 遠流出版公司, 1985.

陳昇瑋 / 資料科學人才的養成延伸閱讀應用想像力 Osborn, Alex Fraickney 著/ 卲一杭
譯, 協志工業叢書, 1987. The Grace of Great Things Robert Grudin, Ticknor, Fields, 1990. 如何撰寫零錯誤程式 Steve Maguire 著/ 施威銘研究室譯, 旗標, 1994. The Craft of Scientific Writing Michael Alley, Springer, 1996.

陳昇瑋 / 資料科學人才的養成延伸閱讀 A Whack on the Side of
the Head Roger von Oech, Warner Books, 1998. 創意人：創意思考的自我訓練詹宏志, 臉譜文化, 1998. The Elements of Style William Strunk Jr., Longman, 1918. 文案自動販賣機：第一本本土廣告文案寫作指南楊梨鶴,商周出版,2000.

陳昇瑋 / 資料科學人才的養成延伸閱讀如何撰寫學術論文與報告 Janice R. Matthews, John M.
Bowen, Robert W. Matthews 著/ 蔡東龍譯, 合記圖書出版社, 2002. 如何閱讀一本書 Mortimer J. Adler, Charles Van Doren 著/ 郝明義, 朱衣譯, 台灣商務印書館, 1972. The Craft of Scientific Presentations Michael Alley, Springer, 2003. Adios, Strunk and White Gray, Glynis Hoffman, Verve press, 2003.

陳昇瑋 / 資料科學人才的養成延伸閱讀英語論文寫作技巧崎村耕二著/ 張嘉容譯, 眾文圖書公司,
2003. 傑出學者給年輕學子的67封信李遠哲, 蕭新煌, 天下文化, 2003. 問對問題，找答案：批判性思考的智慧學 M. Neil Browne, Stuart M. Keeley, 商智文化, 2006. 英文科學論文寫作 R. Lewis, N. Whitby, E. Whitby, 眾文圖書公司, 2007.

陳昇瑋 / 資料科學人才的養成延伸閱讀你會說話嗎 Nick Morgan 著/ 蔡櫻素譯,
臉譜文化, 2006. 研究科學的第一步：給年輕探索者的建議 Santiago Ramon y Cajal, 究竟出版社, 2007. 撰寫論文的第一本書周春塘, 書泉出版社, 2007. 英語論文﹝句型、片語﹞表現集小田麻里子, 味園真紀著/ 馮慧瑛譯, 眾文圖書公司, 2007.

陳昇瑋 / 資料科學人才的養成延伸閱讀英文研究論文寫作文法指引廖柏森, 眾文圖書公司, 2007.. 創意的生成楊傑美
著/ 許晉福譯, 經濟新潮社, 2009. 語言與人生 S.I. Hayakawa 著/ 鄧海珠譯, 遠流出版公司, 1994

陳昇瑋 / 資料科學人才的養成建議閱讀

交流時間

陳昇瑋中央研究院資訊科學研究所資料科學團隊的建立

陳昇瑋 / 資料科學團隊的建立今天談些什麼資料科學團隊的組成先佈置廚房還是先做菜？如何分工合作？企業組織及文化社會物理學與企業管理 848

陳昇瑋 / 資料科學團隊的建立找不到有經驗的專家怎麼辦？三個出發點: 資訊, 數學統計, 問題領域專精一項就很不錯，專精兩項即少見不用等待完美的人出現
個人特質細心 yet 富創意溝通能力 849 全文網址

陳昇瑋 / 資料科學團隊的建立最小團隊組成理想的初始團隊規模兩個不嫌少，先求有再求好但也不要忽略 Data Project Manager－對於資料分析
技術及流程、目標設定能有掌握度的 PM 850 PM Data Scientist Data Engineer Data Engineer Visual Designer

陳昇瑋 / 資料科學團隊的建立先佈置廚房還是先做菜？ 851

陳昇瑋 / 資料科學團隊的建立「大」數據處理平台？對於許多組織而言，「大」並非最重要的特質。根據 2012 年由 New Vantage
Partners 針對大型組織的五十名經理人所做的一項調查，在大公司裡，他們所處理的較屬於「資料缺乏結構」的問題，而非「資料過於龐大」的問題。 30% 的大數據問題主要在於「必須分析來自多個來源的資料」； 22% 的受訪者則主要聚焦於「分析新型態的資料」； 12% 的人主要是「分析動態的資料串流」；只有 28% 的受訪者是以分析大於 1TB 的資料集為主要工作，且當中有 13% 是處理介於 1TB 與 100TB 間的資料集。 852

陳昇瑋 / 資料科學團隊的建立 853

陳昇瑋 / 資料科學團隊的建立 854

陳昇瑋 / 資料科學團隊的建立有時候這是真的 855

陳昇瑋 / 資料科學團隊的建立 Proof of Content 856 Prove your hypotheses
work with small datasets.

陳昇瑋 / 資料科學團隊的建立 Proof of Content: How? ALWAYS start from
small samples Random sampling is very helpful A workstation + R/Python is normally enough Post-PoC stages Deployment of big data infrastructures Verification using FULL datasets  Exception for deep learning and similar methods 857

陳昇瑋 / 資料科學團隊的建立 Start with simple data analysis then moving
to more complex ones 858

陳昇瑋 / 資料科學團隊的建立 Draft Zero http://video.eyny.com/index.php/video/index/215908.html

陳昇瑋 / 資料科學團隊的建立 Draft Zero http://video.eyny.com/index.php/video/index/115310.html

陳昇瑋 / 資料科學團隊的建立 861 It’s not how much data you
process, it’s about how much insight you draw.

陳昇瑋 / 資料科學團隊的建立如何分工合作？ 862

陳昇瑋 / 資料科學團隊的建立資料科學團隊 ≠ 資料倉儲團隊資料倉儲團隊管理／整合資料處理行銷／業務／管理團隊的資料／報表需求資料庫／欄位／報表方式會變，但多數問題是事先定義的
資料科學團隊資料倉儲團隊的「客戶」企業領導階層指出方向後，由資料科學團隊（協同領域專家）定義問題、解答，再與企業領導階層互動或將分析結果導入既有系統 863

陳昇瑋 / 資料科學團隊的建立資料團隊與領域專家領域專家負責發問（或指出方向）要問出對的/重要的問題是最困難的一件事資料團隊負責重新定義問題及尋找答案問題的形式有時決定該問題能否得到解決拿掉人為的假設，找到最有效益的問題來聚焦 e.g.,
怎麼提升利潤？ 864 提升產品品質加強包裝加強行銷降低生產成本提升工作效率找到對的人提升回頭率打壓對手 XD

陳昇瑋 / 資料科學團隊的建立資料科學團隊 ≠ 報表產生器授權團隊把報告撰寫和基本資料處理從資料科學家的工作中剝離開來，讓他們可以集中於更有效的工作。培養對資料好奇的文化
教導所有的員工使用工具（例如儀表板），消除數據的壁壘，激發他們的好奇心，告訴他們每個人如何可以更好地利用數據。類似行為有助於改變他們把統計報告當做是臨時請求的思想，可以解放資料科學家。 865

陳昇瑋 / 資料科學團隊的建立混出資料科學家「每周要跟管理業務的負責人吃兩次飯，最起碼兩次，這就是你的 KPI。」商業敏感是要靠「混」出來的，它並不會憑空出現。更一般性來說，數據部的人要和業務部的人經常在一起，不只是一同開會，更要一起喝茶、吃飯。
866

陳昇瑋 / 資料科學團隊的建立護才與養才把資料科學家規範得太緊，他們不會有好表現。與負責產品與服務的高階主管，而非督導業務職能的人建立關係。應該多花些時間參與技術社群/研討會及進行技術分享。為公司增加的最大價值，不在於寫出報告或向資深高階主管做簡報，而是在面對顧客的產品和流程上創新
。 867

陳昇瑋 / 資料科學團隊的建立企業組織及文化 869

陳昇瑋 / 資料科學團隊的建立 870 If you want to build a
data organization, everybody has to first believe in data.

陳昇瑋 / 資料科學團隊的建立資料必須是一等公民資料不只是配角，不是程式設計師 debug 使用，也不是要符合主管機關的要求而收集／保存而已。資料收集、保存及提供也是系統規格的一部分由資料科學團隊在事前檢視資料收集完整性及品質，
但要另有資料倉儲團隊來負責資料的整合維護 871

陳昇瑋 / 資料科學團隊的建立讓資料成為企業資產企業資產，而非部門資產，或是沒爹沒娘的孤兒… 理想作法：程式／資料透明化／共有可行作法所有資料由單一團隊統一管理資料團隊為戰略編組，高層全力支援真實案例：以上皆非
872

陳昇瑋 / 資料科學團隊的建立 875 Build an environment that can support
quick experiments.

陳昇瑋 / 資料科學團隊的建立資料科學團隊 KPI 績效量化通常不是無成本的，需要額外的投資，且需要時間累積。 A/B testing is
our good friend 唯有如此，效果才能夠真實地呈現, e.g., # users, # session time, # transactions 876 「你沒測量過的東西，是無法管理的。」 --W. Edwards Deming

陳昇瑋 / 資料科學團隊的建立 KPI 的共享建立績效共享制度資料蒐集團隊／資料倉儲團隊發生／提出問題的團隊
實作資料產品的團隊 877

陳昇瑋 / 資料科學團隊的建立為什麼導入資料團隊這麼困難？ 878

陳昇瑋 / 資料科學團隊的建立典範移轉 (Paradigm Shift) 879 私有開放定義
探索經驗測量

陳昇瑋 / 資料科學團隊的建立小結資料科學團隊的組成有挑戰性但要提供良好的工作環境讓資料科學團隊得以發揮，需要更大的變革及改造 880

陳昇瑋 / 資料科學團隊的建立社會物理學 881

陳昇瑋 / 資料科學團隊的建立社會物理學現實探勘 (reality mining)－以巨量行為資料來解釋社會行為的新科學不僅是複雜數學與量化預測，更是現實情境下可應用的實踐科學
社會學習 (social learning) 社群網絡中的意念流 (idea flow) 882 Alex “Sandy” Pentland

陳昇瑋 / 資料科學團隊的建立 eToro + OpenBook 883 www.tradermaker.com/wp-content/themes/tradermarket/images/reviewscreens/etoro-openbook.jpg

陳昇瑋 / 資料科學團隊的建立 eToro + OpenBook 用戶可以查看／模仿其他用戶的交易、投資組合和績效紀錄，但不能看到其他用戶模仿誰的交易投資效益分析收集
2011 年裡 160 萬名用戶、近 1000 萬筆的美元 / 歐元交易行為資料 884

陳昇瑋 / 資料科學團隊的建立社會學習的證據 885 單打獨鬥回音室相同想法重複出現甜蜜點此區用戶投報率
高於其他人30% hbr.org/2012/04/the-new-science-of-building-great-teams

陳昇瑋 / 資料科學團隊的建立 Idea Flow vs. RoI 886 sites.nationalacademies.org/cs/groups/pgasite/documents/webpage/pga_082159.pdf 回音室
單打獨鬥

陳昇瑋 / 資料科學團隊的建立量化群體智慧為什麼有些企業比其它企業來得有開創性？決定群體表現的因素專業能力？凝聚力？成就感？薪水？
領導者風格？文化？ 888

陳昇瑋 / 資料科學團隊的建立社會計量識別牌 (Sociometer) 與誰互動以及互動行為談話語氣是否面對面 (距離) 手勢多寡
交談時聆聽和 (被) 打斷頻率「對話輪替」的均等程度 889 www.bostonglobe.com/business/2013/11/02/breakthrough-management-tool-big-brother-workplace/WKMDFFieBC9M98EWUPbFZL/story.html

陳昇瑋 / 資料科學團隊的建立伺服器銷售公司為期 1 個月，23 人，約 1,900 小時的互動觀察
客製化訂單任務派工：紀錄任務開始和結束的確切時間  衡量每名業務助理每項任務的確切花費時間參與程度排名前 1/3 的員工  生產力較一般員工高出 10% 890

陳昇瑋 / 資料科學團隊的建立 Bank of America 電話客服中心為期 6 週，每組
20 人，共 4 組的客服人員行為資料效率指標－個案的平均處理時間若降低平均處理時間 5 %  每年節省 USD $1M 從 idea flow 角度來改善客服輪流休息改為團隊輪值 增加客服之間的互動和參與提昇 30% 參與程度  平均效率提升 8% (20% for the previously worst case) 估計有 USD $15M 效益 (given 3,000 位客服人員) 891

陳昇瑋 / 資料科學團隊的建立調整輪休時間後工作效率提昇 892 sites.nationalacademies.org/cs/groups/pgasite/documents/webpage/pga_082159.pdf

陳昇瑋 / 資料科學團隊的建立量化團隊參與及探索參與 (engagement) 團隊內的互動探索 (exploration) 跨團隊的交流
893 hbr.org/2012/04/the-new-science-of-building-great-teams

陳昇瑋 / 資料科學團隊的建立參與和探索行為星狀網路：產生團隊以外的新意念流密集互連：豐沛互動，有助檢視新意念，並融入團隊的規範和習慣之中 894 Alex "Sandy"
Pentland, Social Physics

陳昇瑋 / 資料科學團隊的建立一個典型的企業架構為期 1 個月，5 個團隊，22 名員工，2,200 小時的資
料變化，並監控電子郵件流量，共 880 封郵件。 895 電子郵件面對面互動 sites.nationalacademies.org/cs/groups/pgasite/documents/webpage/pga_082159.pdf 管理開發銷售技服客服

陳昇瑋 / 資料科學團隊的建立一個典型的企業架構 (cont) 設計新行銷專案的團隊在探索和參與兩種模式間擺盪負責製作的團隊則否，主要是團隊內部互動，新想法很少流入意念流黑洞其他部門很少與客服部門面對面交談
可能解法：改變座位安排，確保所有人都在互動交流圈中，得以改善部門間協調問題 896

陳昇瑋 / 資料科學團隊的建立一個失敗的專案 20 天的專案監控可從專案起始觀測意念流隨時間的變化，看出不健康的、互動性低的意念流表現 897

陳昇瑋 / 資料科學團隊的建立專案初始：意念流由管理團隊發出 898 hbr.org/resources/images/article_assets/hbr/1204/R1204C_B_LG.gif

陳昇瑋 / 資料科學團隊的建立僅銷售和支援部門有較多當面溝通 899 hbr.org/resources/images/article_assets/hbr/1204/R1204C_B_LG.gif

陳昇瑋 / 資料科學團隊的建立接近結案期限，面對面互動量大幅降低 900 hbr.org/resources/images/article_assets/hbr/1204/R1204C_B_LG.gif

陳昇瑋 / 資料科學團隊的建立交貨發生問題後，部門間開始大量溝通 901 hbr.org/resources/images/article_assets/hbr/1204/R1204C_B_LG.gif

陳昇瑋 / 資料科學團隊的建立改善工作團隊的意念流周五下午 4:30pm 開啤酒趴？把員工餐廳的方桌改成長桌？ 902

陳昇瑋 / 資料科學團隊的建立不僅是觀測，希望進一步改善會議即時反饋系統：社會計量識別牌＋互動視覺化利用即時視覺反饋鼓勵群體中均衡、高度的參與 903 參與程度高特定人士主導 alumni.media.mit.edu/~taemie/research.htm
vismod.media.mit.edu//tech-reports/TR-623.pdf

陳昇瑋 / 資料科學團隊的建立高效能表現來自良好的互動型態點子很多：貢獻簡短意見，而非只有少數長篇大論密集互動：即時短評（支持或否定），幫助建立共識主意多樣性：個人參與互動程度相對平均 904 Alex "Sandy"
Pentland, Social Physics

陳昇瑋 / 資料科學團隊的建立「貝爾明星」研究卓越 v.s. 平凡人脈網絡多樣性 (diversity) 預備式探索
(preparatory exploration) 905 http://www.thestevensmithblog.com/153/how-can-reaching-out-to-others-build-a-community-and-solve-business-issues/

陳昇瑋 / 資料科學團隊的建立找到魅力型連結者魅力型連結者意念蒐集者，充滿好奇，積極發問精力充沛、推動對話有系統地與他人互動，非支配討論，而是鼓勵良好的意念流
型態使意念得以跨越群體的界線流通派對動物口若懸河但總是言不及義注重表象，跟隨流行熱潮好出鋒頭，喜歡成為眾人焦點 906

陳昇瑋 / 資料科學團隊的建立資料科學用之於企業管理找出未來新星找出可能相處有問題的小團隊找出無法融入族群的新人更準的面試方法預測離職預測人
vs 人 and 人 vs 團隊的速配度預測決策的效果 (e.g., 預測市場) 907

交流時間

陳昇瑋 / 從大數據走向人工智慧 1. Machine Learning is Key to Uncover
Hidden Information 2. Unstructured Data Can Be Highly Valuable 3. Small Data May Contain BIG Values

911 原文網址

陳昇瑋 / 從大數據走向人工智慧 918 Google Search Frequency on “Big Data”
and “Data Science” Data Science Big Data

陳昇瑋 / 從大數據走向人工智慧 920 Google Search Frequency on “Machine Learning”
and “Deep Learning” Deep Learning Machine Learning

陳昇瑋 / 從大數據走向人工智慧 923 Google Search Frequency on “Machine Learning”
normalized by “Data Science” Deep Learning Machine Learning

陳昇瑋 / 從大數據走向人工智慧 924 如同精靈寶可夢需要有訓練師才能發揮能力，擁有大數據後，我們也需要很多很多的機器學習專家（人呼為 AI 訓練師），才能讓我們手中的大數據真正發揮價值。

陳昇瑋 / 從大數據走向人工智慧七百頁投影片線上閱讀／下載 926 http://www.iis.sinica.edu.tw/~swc/talk/data_science_overview.html

2017 台灣資料科學年會 2017.11.9 (Thu) ~ 2017.11.12 (Sun) 中央研究院人文社會科學館 929

930 Data Insights Research Lab

Sales prediction based on book covers 932

Book cover generation based on titles 933

Semi-supervised private traits prediction 934 收入高收入低商業周刊李亮瑾 Andy老爹
連靜雯joanne lien 背包客棧綜藝大集合 citiesocial 旗山天后宮 relux 連靜雯專屬後援會台灣賓士授權經銷商-中華賓士三條崙海清宮閻羅天子包公祖廟李開復 Kai-Fu Lee 楊丞琳 RainieYang Mobile01 郭靜 Claire Mercedes-Benz Taiwan 台灣賓士九族文化村天下雜誌寶島神很大

935 Semi-supervised private traits prediction

Inferring traits from personal faces 936

No longer simple rules #1 937 https://www.theatlantic.com/health/archive/2014/10/the-introverted-face/381697/

No longer simple rules #2 938

Eye contact in video conferencing 939

Enabling eye contact in conferencing 940

人工智慧發展策略建議 AI-assisted Manufacturing 台灣的絕佳時機最強的製造 know-how 第一手的，絕無僅有的獨特資料，而且源源不絕從接單、備料、生產到庫存及出貨，每一個環節，都有 AI 輔
助進行最佳化的空間自動視覺化缺陷檢測自動參數調控達成良率最佳化 AI Associate (AA): 每個設備都有自己的 AI 助理，隨時監測硬體狀態調整參數、登記檢修，以達節能、降低損壞及隨時維護產品品質 941

人工智慧發展策略建議 942

人工智慧發展策略建議 PCB 鹼性蝕刻良率問題蝕刻速率降低蝕刻液出現沉澱金屬抗蝕鍍層被浸蝕銅表面發黑，蝕刻不動基板表面有殘銅基板兩面蝕刻效果差異明顯板面蝕刻不均使部分還有留有殘銅
蝕刻後發現導線嚴重的側蝕輸送帶上前進的基板呈現斜走現象線路蝕銅未徹底，部分邊緣留有殘銅兩面蝕刻效果不同步鹼性蝕刻液過度結晶光致抗蝕劑脫落（幹膜或油墨）蝕刻過度導線變細蝕刻不足，殘足太大 1. 檢查銅層厚度與蝕刻機傳送速度之間的關係，通過工藝試驗法找出最佳操作條件。 2. 檢測蝕刻液的 PH 值，當該值低於 80 時即需採取提高的方法，如添加氨水或加速子液的補充與降低抽風等。 3. 檢測蝕刻液的比重值，並加較多子液以降低比重值至工藝規定範圍。 4. 檢查子液補給系統是否失靈。 5. 檢查加熱器的功能是否有異常。 6. 檢查噴淋壓力，應調整到最隹狀態。 7. 備液槽中水位太低，造成泵空轉，檢查液位控制、補充、與排放泵的操作程序。 943 https://tw.wxwenku.com/d/100127553 1. 調整 PH 值到達規定值或適當降低抽風量。 2. 適當降低抽風量執行。 3. 排放出部分比重高的溶液經分析後補加氯化銨和氨的水溶液，使蝕刻液的比重調整到工藝容許的範圍。

Obstacles to Strong AI Machines need to learn / understand
how the world works Physical world, digital world, people, … They need to acquire some level of common knowledge Machines need to perceive the state of the world So as to make accurate predictions and planning Machines need to update and remember estimates of the state of the world Paying attention to important events. Remember relevant events Machines need to reason and plan Predict which sequences of actions will lead to a desired state of the world 945 (Credit: Yann LeCun “Deep Learning and the Path to AI”)

Common sense is the ability to fill in blanks Infer
the state of the world from partial information Infer the future from the past and present Infer past events from the present state Fill in occluded images Fill in missing segments in text, missing word in speech Predicting consequences of actions Predicting the sequence of actions leading to a result Predicting any part of the past, present, or future percepts from whatever information is available.  predictive learning 946 (Credit: Yann LeCun “Deep Learning and the Path to AI”)

Elements of Strong AI Common Sense Perception Prediction Memory Reason
/ Planning 947

擁抱資料，更不要錯過 AI 陳昇瑋台灣資料科學協會中央研究院資訊科學研究所

從大數據走向人工智慧

從大數據走向人工智慧

Other Decks in Research

Featured

Transcript