Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Post–Data Era

D16bc1f94b17ddc794c2dfb48ef59456?s=47 Mosky
November 29, 2020

The Post–Data Era

「資料科學家是 21 世紀最性感的職業。」然而,身處在資料領域的你覺得性感嗎?上代最性感的軟體工程師、現在的後端團隊主管,經歷過數次資訊狂潮,想跟你分享這些資料領域的暗流。

我們都喜愛資料,希望利用資料創造價值、造福人類,期待這場演講能幫助大家避開雷區,更有效率地利用資料解決問題!

D16bc1f94b17ddc794c2dfb48ef59456?s=128

Mosky

November 29, 2020
Tweet

Transcript

  1. 後⼤數據時代的陷阱與機會 Mosky

  2. 資料科學家 是 21 世紀最性感的職業。

  3. 「我以為在建模  結果⼤部分在 Debug」 11/19  「品質檢查、除錯、修復,⾄少要花 65% 的時間。」

  4. 「AI 影片錯殺率太⾼  YouTube 重設審查⼈員」 9/22  「下架的 1,100 多萬則中,有 32 萬則接獲申訴,其中近半審查後重新上架。」

  5. 「Nature 論⽂遭受嚴重質疑:  實驗⽅法有根本缺陷」 2019/6/29  「演算法在測試集上的表現,遠遠超過了訓練集,這不是有資料洩漏嗎?」

  6. 「Science:  某些 AI 領域多年無實際進展」 5/29  「⼀篇論⽂聲稱獲得了巨⼤的性能提升,⽽實際上是對比對象精度較低。」

  7. 「近期⼤規模裁員 5/2  為什麼那麼多公司先裁 data?」  「資料驅動產⽣的額外 impact,減去團隊薪資、資料收集的開銷,才是 return」

  8. https://speakerdeck.com/mosky/ the-post-data-era

  9. Backend Lead / Backend Engineer Mosky

  10. 2014: Graph-Tool 2017: Data Science With Python 2018: Hypothesis Testing

    With Python 2019: Statistical Regression With Python Mosky
  11. None
  12. None
  13. None
  14. None
  15. None
  16. → 2020's

  17. 科技總是這樣,會熱也會冷。

  18. None
  19. 持續創造價值就會⽣存下來。

  20. 有哪些地雷?

  21. Overfitting

  22. Data Leakage

  23. Husky Wolf Spurious Relationship

  24. Stationarity

  25. Model-Market Fit

  26. 如何安全閃避?

  27. MUST USE Cross Validation, Pipeline, etc. and don't get them

    wrong.
  28. ➤ Statistics constructs more solid inferences. ➤ Machine learning constructs

    more interesting predictions. ➤ Machine Learning ⊃ Deep Learning ➤ The models may be the same, but the focuses are different. ➤ Good predictions usually needs good inferences on dataset. Statistics vs. Machine Learning
  29. None
  30. None
  31. Study Designs • RCT (A/B testing) • Cohort Study: Group

    by exposure. • Case-Control Study: Diff to find the exposure. • Case Series • Case Report • —Oxford CEBM 2009, Study Designs
  32. MUST HAVE Domain Knowledge to detect the issues.

  33. Science, Analysis, Scientist, and Engineering ➤ Data Engineering / Data

    Engineer ➤ Prepare the data infra to enable others to work with. ➤ Data Analysis / Data Analyst ➤ Analyze to help the company's decisions. ➤ Data Scientist ➤ Create software to optimize the company's operations. Role Matters
  34. Teamwork Helps

  35. Delight People With Fast Release e.g., per two weeks

  36. • 會⾃然發⽣許多隱晦的技術問題 → 需要扎實的基礎功 • 不只可解釋,還要理解資料與模型 → 統計學、研究⽅法中有豐富的⼯具 • 還會⾃然發⽣許多隱晦的非技術問題

    → 需要領域知識才能發現 • ⼀個⼈時間有限 → 定位⾓⾊、磨練協作技能、持之以恆   例如專案管理、產品管理 • 創造價值?讓⼈感到開⼼!除了使⽤者,同事、老闆也是。
  37. Image Credits • “NoSQL”: https://www.reddit.com/r/ProgrammerHumor/comments/2mk8sb/history_of_nosql/ • “NoSQL Databases”: https://www.tech2shout.com/nosql-database-solutions-5-types/ •

    “Hype Cycle”: https://en.wikipedia.org/wiki/Hype_cycle#/media/File:Hype-Cycle-General.png • “Overfitting”: https://en.wikipedia.org/wiki/Overfitting#/media/File:Overfitting.svg • “Data Leakage”: https://www.kaggle.com/dansbecker/data-leakage • “Husky”: https://en.wikipedia.org/wiki/Husky • “Wolf”: https://en.wikipedia.org/wiki/Wolf#/media/File:Front_view_of_a_resting_Canis_lupus_ssp.jpg • “Stationarity”: https://en.wikipedia.org/wiki/Expected_value#/media/File:Largenumbers.svg • “Non-Stationarity”: https://en.wikipedia.org/wiki/Stationary_process#/media/File:Stationarycomparison.png • “Houses”: https://unsplash.com/photos/vZEPXDQHR4s • “Linear PCA vs. Nonlinear Principal Manifolds”: https://en.wikipedia.org/wiki/Principal_component_analysis#/media/ File:Elmap_breastcancer_wiki.png • “Teamwork”: https://unsplash.com/photos/g1Kr4Ozfoac • “Smile”: https://unsplash.com/photos/4K2lIP0zc_k