Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Post–Data Era

Mosky Liu
November 29, 2020

The Post–Data Era

「資料科學家是 21 世紀最性感的職業。」然而,身處在資料領域的你覺得性感嗎?上代最性感的軟體工程師、現在的後端團隊主管,經歷過數次資訊狂潮,想跟你分享這些資料領域的暗流。

我們都喜愛資料,希望利用資料創造價值、造福人類,期待這場演講能幫助大家避開雷區,更有效率地利用資料解決問題!

Mosky Liu

November 29, 2020
Tweet

More Decks by Mosky Liu

Other Decks in Technology

Transcript

  1. 後⼤數據時代的陷阱與機會
    Mosky

    View Slide

  2. 資料科學家
    是 21 世紀最性感的職業。

    View Slide

  3. 「我以為在建模
     結果⼤部分在 Debug」
    11/19
     「品質檢查、除錯、修復,⾄少要花 65% 的時間。」

    View Slide

  4. 「AI 影片錯殺率太⾼
     YouTube 重設審查⼈員」
    9/22
     「下架的 1,100 多萬則中,有 32 萬則接獲申訴,其中近半審查後重新上架。」

    View Slide

  5. 「Nature 論⽂遭受嚴重質疑:
     實驗⽅法有根本缺陷」
    2019/6/29
     「演算法在測試集上的表現,遠遠超過了訓練集,這不是有資料洩漏嗎?」

    View Slide

  6. 「Science:
     某些 AI 領域多年無實際進展」
    5/29
     「⼀篇論⽂聲稱獲得了巨⼤的性能提升,⽽實際上是對比對象精度較低。」

    View Slide

  7. 「近期⼤規模裁員
    5/2
     為什麼那麼多公司先裁 data?」
     「資料驅動產⽣的額外 impact,減去團隊薪資、資料收集的開銷,才是 return」

    View Slide

  8. https://speakerdeck.com/mosky/
    the-post-data-era

    View Slide

  9. Backend Lead /
    Backend Engineer
    Mosky

    View Slide

  10. 2014: Graph-Tool
    2017: Data Science With Python
    2018: Hypothesis Testing With Python
    2019: Statistical Regression With Python
    Mosky

    View Slide

  11. View Slide

  12. View Slide

  13. View Slide

  14. View Slide

  15. View Slide

  16. → 2020's

    View Slide

  17. 科技總是這樣,會熱也會冷。

    View Slide

  18. View Slide

  19. 持續創造價值就會⽣存下來。

    View Slide

  20. 有哪些地雷?

    View Slide

  21. Overfitting

    View Slide

  22. Data Leakage

    View Slide

  23. Husky
    Wolf
    Spurious
    Relationship

    View Slide

  24. Stationarity

    View Slide

  25. Model-Market
    Fit

    View Slide

  26. 如何安全閃避?

    View Slide

  27. MUST USE
    Cross Validation, Pipeline, etc.
    and don't get them wrong.

    View Slide

  28. ➤ Statistics constructs more solid inferences.
    ➤ Machine learning constructs more interesting predictions.
    ➤ Machine Learning ⊃ Deep Learning
    ➤ The models may be the same, but the focuses are different.
    ➤ Good predictions usually needs good inferences on dataset.
    Statistics vs. Machine Learning

    View Slide

  29. View Slide

  30. View Slide

  31. Study Designs
    • RCT (A/B testing)
    • Cohort Study: Group by exposure.
    • Case-Control Study: Diff to find the exposure.
    • Case Series
    • Case Report
    • —Oxford CEBM 2009, Study Designs

    View Slide

  32. MUST HAVE
    Domain Knowledge
    to detect the issues.

    View Slide

  33. Science, Analysis, Scientist, and Engineering
    ➤ Data Engineering / Data Engineer
    ➤ Prepare the data infra to enable others to work with.
    ➤ Data Analysis / Data Analyst
    ➤ Analyze to help the company's decisions.
    ➤ Data Scientist
    ➤ Create software to optimize the company's operations.
    Role Matters

    View Slide

  34. Teamwork
    Helps

    View Slide

  35. Delight People
    With Fast Release
    e.g., per two weeks

    View Slide

  36. • 會⾃然發⽣許多隱晦的技術問題
    → 需要扎實的基礎功
    • 不只可解釋,還要理解資料與模型
    → 統計學、研究⽅法中有豐富的⼯具
    • 還會⾃然發⽣許多隱晦的非技術問題
    → 需要領域知識才能發現
    • ⼀個⼈時間有限
    → 定位⾓⾊、磨練協作技能、持之以恆
      例如專案管理、產品管理
    • 創造價值?讓⼈感到開⼼!除了使⽤者,同事、老闆也是。

    View Slide

  37. Image Credits
    • “NoSQL”: https://www.reddit.com/r/ProgrammerHumor/comments/2mk8sb/history_of_nosql/
    • “NoSQL Databases”: https://www.tech2shout.com/nosql-database-solutions-5-types/
    • “Hype Cycle”: https://en.wikipedia.org/wiki/Hype_cycle#/media/File:Hype-Cycle-General.png
    • “Overfitting”: https://en.wikipedia.org/wiki/Overfitting#/media/File:Overfitting.svg
    • “Data Leakage”: https://www.kaggle.com/dansbecker/data-leakage
    • “Husky”: https://en.wikipedia.org/wiki/Husky
    • “Wolf”: https://en.wikipedia.org/wiki/Wolf#/media/File:Front_view_of_a_resting_Canis_lupus_ssp.jpg
    • “Stationarity”: https://en.wikipedia.org/wiki/Expected_value#/media/File:Largenumbers.svg
    • “Non-Stationarity”: https://en.wikipedia.org/wiki/Stationary_process#/media/File:Stationarycomparison.png
    • “Houses”: https://unsplash.com/photos/vZEPXDQHR4s
    • “Linear PCA vs. Nonlinear Principal Manifolds”: https://en.wikipedia.org/wiki/Principal_component_analysis#/media/
    File:Elmap_breastcancer_wiki.png
    • “Teamwork”: https://unsplash.com/photos/g1Kr4Ozfoac
    • “Smile”: https://unsplash.com/photos/4K2lIP0zc_k

    View Slide