Slide 1

Slide 1 text

後⼤數據時代的陷阱與機會 Mosky

Slide 2

Slide 2 text

資料科學家 是 21 世紀最性感的職業。

Slide 3

Slide 3 text

「我以為在建模  結果⼤部分在 Debug」 11/19  「品質檢查、除錯、修復,⾄少要花 65% 的時間。」

Slide 4

Slide 4 text

「AI 影片錯殺率太⾼  YouTube 重設審查⼈員」 9/22  「下架的 1,100 多萬則中,有 32 萬則接獲申訴,其中近半審查後重新上架。」

Slide 5

Slide 5 text

「Nature 論⽂遭受嚴重質疑:  實驗⽅法有根本缺陷」 2019/6/29  「演算法在測試集上的表現,遠遠超過了訓練集,這不是有資料洩漏嗎?」

Slide 6

Slide 6 text

「Science:  某些 AI 領域多年無實際進展」 5/29  「⼀篇論⽂聲稱獲得了巨⼤的性能提升,⽽實際上是對比對象精度較低。」

Slide 7

Slide 7 text

「近期⼤規模裁員 5/2  為什麼那麼多公司先裁 data?」  「資料驅動產⽣的額外 impact,減去團隊薪資、資料收集的開銷,才是 return」

Slide 8

Slide 8 text

https://speakerdeck.com/mosky/ the-post-data-era

Slide 9

Slide 9 text

Backend Lead / Backend Engineer Mosky

Slide 10

Slide 10 text

2014: Graph-Tool 2017: Data Science With Python 2018: Hypothesis Testing With Python 2019: Statistical Regression With Python Mosky

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

→ 2020's

Slide 17

Slide 17 text

科技總是這樣,會熱也會冷。

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

持續創造價值就會⽣存下來。

Slide 20

Slide 20 text

有哪些地雷?

Slide 21

Slide 21 text

Overfitting

Slide 22

Slide 22 text

Data Leakage

Slide 23

Slide 23 text

Husky Wolf Spurious Relationship

Slide 24

Slide 24 text

Stationarity

Slide 25

Slide 25 text

Model-Market Fit

Slide 26

Slide 26 text

如何安全閃避?

Slide 27

Slide 27 text

MUST USE Cross Validation, Pipeline, etc. and don't get them wrong.

Slide 28

Slide 28 text

➤ Statistics constructs more solid inferences. ➤ Machine learning constructs more interesting predictions. ➤ Machine Learning ⊃ Deep Learning ➤ The models may be the same, but the focuses are different. ➤ Good predictions usually needs good inferences on dataset. Statistics vs. Machine Learning

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

Study Designs • RCT (A/B testing) • Cohort Study: Group by exposure. • Case-Control Study: Diff to find the exposure. • Case Series • Case Report • —Oxford CEBM 2009, Study Designs

Slide 32

Slide 32 text

MUST HAVE Domain Knowledge to detect the issues.

Slide 33

Slide 33 text

Science, Analysis, Scientist, and Engineering ➤ Data Engineering / Data Engineer ➤ Prepare the data infra to enable others to work with. ➤ Data Analysis / Data Analyst ➤ Analyze to help the company's decisions. ➤ Data Scientist ➤ Create software to optimize the company's operations. Role Matters

Slide 34

Slide 34 text

Teamwork Helps

Slide 35

Slide 35 text

Delight People With Fast Release e.g., per two weeks

Slide 36

Slide 36 text

• 會⾃然發⽣許多隱晦的技術問題 → 需要扎實的基礎功 • 不只可解釋,還要理解資料與模型 → 統計學、研究⽅法中有豐富的⼯具 • 還會⾃然發⽣許多隱晦的非技術問題 → 需要領域知識才能發現 • ⼀個⼈時間有限 → 定位⾓⾊、磨練協作技能、持之以恆   例如專案管理、產品管理 • 創造價值?讓⼈感到開⼼!除了使⽤者,同事、老闆也是。

Slide 37

Slide 37 text

Image Credits • “NoSQL”: https://www.reddit.com/r/ProgrammerHumor/comments/2mk8sb/history_of_nosql/ • “NoSQL Databases”: https://www.tech2shout.com/nosql-database-solutions-5-types/ • “Hype Cycle”: https://en.wikipedia.org/wiki/Hype_cycle#/media/File:Hype-Cycle-General.png • “Overfitting”: https://en.wikipedia.org/wiki/Overfitting#/media/File:Overfitting.svg • “Data Leakage”: https://www.kaggle.com/dansbecker/data-leakage • “Husky”: https://en.wikipedia.org/wiki/Husky • “Wolf”: https://en.wikipedia.org/wiki/Wolf#/media/File:Front_view_of_a_resting_Canis_lupus_ssp.jpg • “Stationarity”: https://en.wikipedia.org/wiki/Expected_value#/media/File:Largenumbers.svg • “Non-Stationarity”: https://en.wikipedia.org/wiki/Stationary_process#/media/File:Stationarycomparison.png • “Houses”: https://unsplash.com/photos/vZEPXDQHR4s • “Linear PCA vs. Nonlinear Principal Manifolds”: https://en.wikipedia.org/wiki/Principal_component_analysis#/media/ File:Elmap_breastcancer_wiki.png • “Teamwork”: https://unsplash.com/photos/g1Kr4Ozfoac • “Smile”: https://unsplash.com/photos/4K2lIP0zc_k