A Random Walk in Data Science and Machine Learning in Practice - Use Cases Seminar, MS Biz Analytics, CEU - Budapest, May 2019

Ce8e94cc306ba164175f693fb01aa8b0?s=47 szilard
May 08, 2019
16

A Random Walk in Data Science and Machine Learning in Practice - Use Cases Seminar, MS Biz Analytics, CEU - Budapest, May 2019

Ce8e94cc306ba164175f693fb01aa8b0?s=128

szilard

May 08, 2019
Tweet

Transcript

  1. A Random Walk in Data Science and Machine Learning in

    Practice Szilard Pafka, PhD Chief Scientist, Epoch (USA) CEU, Business Analytics Masters Budapest, May 2019
  2. None
  3. Disclaimer: I am not representing my employer (Epoch) in this

    talk I cannot confirm nor deny if Epoch is using any of the methods, tools, results etc. mentioned in this talk
  4. None
  5. None
  6. CRISP-DM, 1999

  7. None
  8. None
  9. None
  10. None
  11. None
  12. None
  13. None
  14. None
  15. None
  16. None
  17. None
  18. None
  19. None
  20. None
  21. None
  22. None
  23. None
  24. None
  25. None
  26. None
  27. None
  28. None
  29. None
  30. None
  31. None
  32. None
  33. None
  34. None
  35. None
  36. None
  37. None
  38. None
  39. None
  40. None
  41. None
  42. None
  43. None
  44. None
  45. None
  46. None
  47. Best Practices for Using Machine Learning in Businesses in 2018

    Szilárd Pafka, PhD Chief Scientist, Epoch (USA) Budapest BI Forum Conference November 2018
  48. None
  49. Disclaimer: I am not representing my employer (Epoch) in this

    talk I cannot confirm nor deny if Epoch is using any of the methods, tools, results etc. mentioned in this talk
  50. https://twitter.com/baroquepasa/

  51. None
  52. None
  53. None
  54. None
  55. None
  56. y = f (x1, x2, ... , xn) Source: Hastie

    etal, ESL 2ed
  57. y = f (x1, x2, ... , xn)

  58. None
  59. None
  60. Source: Yann LeCun

  61. None
  62. None
  63. 2018?

  64. 2018?

  65. #1 Use the Right Algo

  66. Source: Andrew Ng

  67. None
  68. None
  69. None
  70. None
  71. None
  72. None
  73. None
  74. None
  75. None
  76. None
  77. None
  78. None
  79. None
  80. None
  81. None
  82. None
  83. None
  84. *

  85. #2 Use Open Source

  86. None
  87. None
  88. None
  89. None
  90. None
  91. in 2006 - cost was not a factor! - data.frame

    - [800] packages
  92. None
  93. None
  94. None
  95. None
  96. None
  97. #3 Simple > Complex

  98. None
  99. 10x

  100. None
  101. None
  102. None
  103. None
  104. None
  105. None
  106. None
  107. None
  108. #4 Incorporate Domain Knowledge Do Feature Engineering (Still) Explore Your

    Data Clean Your Data
  109. None
  110. None
  111. None
  112. None
  113. None
  114. None
  115. None
  116. None
  117. None
  118. None
  119. None
  120. #5 Do Proper Validation Avoid: Overfitting, Data Leakage

  121. None
  122. None
  123. None
  124. None
  125. None
  126. None
  127. None
  128. None
  129. None
  130. None
  131. None
  132. None
  133. None
  134. None
  135. #6 Batch or Real-Time Scoring?

  136. None
  137. https://medium.com/@HarlanH/patterns-for-connecting-predictive-models-to-software-products-f9b6e923f02d

  138. https://medium.com/@dvelsner/deploying-a-simple-machine-learning-model-in-a-modern-web-application-flask-angular-docker-a657db075280 your app

  139. None
  140. None
  141. R/Python: - Slow(er) - Encoding of categ. variables

  142. #7 Do Online Validation as Well

  143. None
  144. https://www.oreilly.com/ideas/evaluating-machine-learning-models/page/2/orientation

  145. https://www.oreilly.com/ideas/evaluating-machine-learning-models/page/2/orientation

  146. https://www.oreilly.com/ideas/evaluating-machine-learning-models/page/2/orientation https://www.slideshare.net/FaisalZakariaSiddiqi/netflix-recommendations-feature-engineering-with-time-travel

  147. #8 Monitor Your Models

  148. None
  149. https://www.retentionscience.com/blog/automating-machine-learning-monitoring-rs-labs/

  150. https://www.retentionscience.com/blog/automating-machine-learning-monitoring-rs-labs/

  151. None
  152. 20% 80% (my guess)

  153. 20% 80% (my guess)

  154. #9 Business Value Seek / Measure / Sell

  155. None
  156. None
  157. None
  158. None
  159. None
  160. #10 Make it Reproducible

  161. None
  162. None
  163. None
  164. None
  165. None
  166. None
  167. None
  168. None
  169. None
  170. Cloud (servers)

  171. ML training: lots of CPU cores lots of RAM limited

    time
  172. ML training: lots of CPU cores lots of RAM limited

    time ML scoring: separated servers
  173. ML (cloud) services (MLaaS)

  174. None
  175. “people that know what they’re doing just use open source

    [...] the same open source tools that the MLaaS services offer” - Bradford Cross
  176. Kaggle

  177. None
  178. already pre-processed data less domain knowledge (or deliberately hidden) AUC

    0.0001 increases "relevant" no business metric no actual deployment models too complex no online evaluation no monitoring data leakage
  179. Tuning and Auto ML

  180. Ben Recht, Kevin Jamieson: http://www.argmin.net/2016/06/20/hypertuning/

  181. GPUs

  182. Aggregation 100M rows 1M groups Join 100M rows x 1M

    rows time [s] time [s]
  183. Aggregation 100M rows 1M groups Join 100M rows x 1M

    rows time [s] time [s] “Motherfucka!”
  184. None
  185. API and GUIs

  186. None
  187. None
  188. AI?

  189. None
  190. None
  191. None
  192. How to Start?

  193. None
  194. None
  195. Better than Deep Learning: Gradient Boosting Machines (GBM) Szilard Pafka,

    PhD Chief Scientist, Epoch (USA) DataWorks Summit, Barcelona, Spain March 2019
  196. None
  197. Disclaimer: I am not representing my employer (Epoch) in this

    talk I cannot confirm nor deny if Epoch is using any of the methods, tools, results etc. mentioned in this talk
  198. Source: Andrew Ng

  199. Source: Andrew Ng

  200. Source: Andrew Ng

  201. None
  202. None
  203. None
  204. None
  205. Source: https://twitter.com/iamdevloper/

  206. None
  207. None
  208. ...

  209. None
  210. None
  211. http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf http://lowrank.net/nikos/pubs/empirical.pdf

  212. None
  213. None
  214. None
  215. None
  216. None
  217. None
  218. None
  219. None
  220. None
  221. None
  222. None
  223. None
  224. None
  225. None
  226. None
  227. Source: Hastie etal, ESL 2ed

  228. Source: Hastie etal, ESL 2ed

  229. Source: Hastie etal, ESL 2ed

  230. Source: Hastie etal, ESL 2ed

  231. None
  232. I usually use other people’s code [...] I can find

    open source code for what I want to do, and my time is much better spent doing research and feature engineering -- Owen Zhang
  233. None
  234. None
  235. None
  236. None
  237. None
  238. None
  239. 10x

  240. None
  241. None
  242. 10x

  243. None
  244. None
  245. None
  246. None
  247. None
  248. None
  249. None
  250. None
  251. None
  252. None
  253. None
  254. None
  255. None
  256. None
  257. None
  258. None
  259. None
  260. http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf

  261. http://www.argmin.net/2016/06/20/hypertuning/

  262. None
  263. None
  264. None
  265. None
  266. None
  267. None
  268. no-one is using this crap

  269. None
  270. None
  271. More:

  272. None
  273. Machine Learning Software in Practice: Quo Vadis? Szilárd Pafka, PhD

    Chief Scientist, Epoch KDD Conference - Applied Data Science Track Invited Talk August 2017, Halifax, Canada
  274. Machine Learning Software in Practice: Quo Vadis? Szilárd Pafka, PhD

    Chief Scientist, Epoch KDD Conference - Applied Data Science Track Invited Talk August 2017, Halifax, Canada SOME OF
  275. None
  276. None
  277. ML Tools Mismatch: - What practitioners wish for - What

    they truly need
  278. ML Tools Mismatch: - What practitioners wish for - What

    they truly need - What’s available - What’s advertised - What developers/researchers focus on
  279. This talk is mostly in the context of (binary) classification

  280. Warning: This talk is a series or rants observations with

    the aim to provoke encourage thinking and constructive discussions about topics of impact on our industry.
  281. Warning: This talk is a series or rants observations with

    the aim to provoke encourage thinking and constructive discussions about topics of impact on our industry. Rantometer:
  282. Our tools are optimized for what use cases?

  283. Is building this the best allocation of our developer resources?

  284. Efficiency for users during usage?

  285. None
  286. None
  287. Big Data

  288. None
  289. None
  290. None
  291. None
  292. None
  293. None
  294. None
  295. None
  296. None
  297. None
  298. None
  299. None
  300. None
  301. None
  302. None
  303. None
  304. None
  305. None
  306. Machine Learning Tools Speed, Memory, Accuracy

  307. None
  308. I usually use other people’s code [...] I can find

    open source code for what I want to do, and my time is much better spent doing research and feature engineering -- Owen Zhang
  309. binary classification, 10M records numeric & categorical features, non-sparse

  310. http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf http://lowrank.net/nikos/pubs/empirical.pdf

  311. http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf http://lowrank.net/nikos/pubs/empirical.pdf

  312. None
  313. None
  314. None
  315. None
  316. None
  317. None
  318. EC2

  319. n = 10K, 100K, 1M, 10M, 100M Training time RAM

    usage AUC CPU % by core read data, pre-process, score test data
  320. n = 10K, 100K, 1M, 10M, 100M Training time RAM

    usage AUC CPU % by core read data, pre-process, score test data
  321. None
  322. None
  323. None
  324. None
  325. None
  326. None
  327. None
  328. 10x

  329. None
  330. None
  331. None
  332. None
  333. None
  334. http://datascience.la/benchmarking-random-forest-implementations/#comment-53599

  335. None
  336. None
  337. Best linear: 71.1

  338. None
  339. None
  340. learn_rate = 0.1, max_depth = 6, n_trees = 300 learn_rate

    = 0.01, max_depth = 16, n_trees = 1000
  341. None
  342. None
  343. None
  344. Deep Learning AI Oh my... OUT

  345. Distributed ML OUT

  346. Multicore ML

  347. None
  348. None
  349. 1M: CPU cache effects

  350. (lightgbm 10M)

  351. 16 cores vs 1: 16 cores:

  352. GPUs

  353. None
  354. Aggregation 100M rows 1M groups Join 100M rows x 1M

    rows time [s] time [s]
  355. None
  356. Benchmarks

  357. None
  358. None
  359. Wishlist: - more datasets (10-100, structure, size) - automation: upgrading

    tools, re-running ($$)
  360. Wishlist: - more datasets (10-100, structure, size) - automation: upgrading

    tools, re-running ($$) - more algos, more tools (OS/commercial?) - (even) more tuning of parameters
  361. Wishlist: - more datasets (10-100, structure, size) - automation: upgrading

    tools, re-running ($$) - more algos, more tools (OS/commercial?) - (even) more tuning of parameters - BaaS? crowdsourcing (data, tools/tuning)? - other ML problems (recsys, NLP…)
  362. so far we discussed performance + (some) system architecture but

    for training only
  363. None
  364. APIs (and GUIs) OUT

  365. Cloud (MLaaS) OUT

  366. Real-Time Scoring

  367. None
  368. R/Python: - Slow(er) - Encoding of categ. variables

  369. Kaggle OUT

  370. Tuning & AutoML OUT

  371. Model Understanding, Accountability

  372. Evaluation Metrics OUT

  373. Machine Learning with H2O.ai Szilárd Pafka, PhD Chief Scientist, Epoch

    LA H2O Meetup @ AT&T January 2017
  374. Machine Learning with H2O.ai Szilárd Pafka, PhD Chief Scientist, Epoch

    LA H2O Meetup @ AT&T January 2017 SOME OF
  375. None
  376. Supervised Learning y = f(x) train: “learn” f from data

    X (n*p), y (n) score: f(x’) algos: k-NN, LR, NB, RF, GBM, SVM, NN, DL… goal: max accuracy measure (on new data) f ∈ F(θ) min θ ( L(y, f(x,θ)) + R(θ) ) on train set evaluate on separate test set /cross validation
  377. Structure/Hyperparameters λ min θ ( L(y, f(x,θ[,λ])) + R(θ,λ) )

    often λ ~ capacity/complexity
  378. Model selection: Need Vary λ and get model with best

    accuracy on validation set Evaluate final model on test set /cross validation
  379. overfitting

  380. http://datascience.la/meetup-summary-winning-data-science-competitions/

  381. http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf http://lowrank.net/nikos/pubs/empirical.pdf

  382. http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf http://lowrank.net/nikos/pubs/empirical.pdf

  383. http://datascience.la/meetup-summary-winning-data-science-competitions/

  384. None
  385. data size [M] training time [s] 10x Gradient Boosting Machines

  386. None
  387. Disclaimer: I’m not affiliated with H2O.ai. It’s just that in

    my opinion H2O is a machine learning tool with several advantages. There are many other good tools (and many more awful ones).
  388. - high-performance implementation of best algos (RF, GBM, NN etc.)

    - R, Python etc. interfaces, easy to use API
  389. - high-performance implementation of best algos (RF, GBM, NN etc.)

    - R, Python etc. interfaces, easy to use API - open source - advisors: Hastie, Tibshirani
  390. - high-performance implementation of best algos (RF, GBM, NN etc.)

    - R, Python etc. interfaces, easy to use API - open source - advisors: Hastie, Tibshirani - Java, but C-style memalloc, by Java gurus - distributed, “big data”
  391. - high-performance implementation of best algos (RF, GBM, NN etc.)

    - R, Python etc. interfaces, easy to use API - open source - advisors: Hastie, Tibshirani - Java, but C-style memalloc, by Java gurus - distributed, “big data” - many knobs/tuning, model evaluation, cross validation, model selection (hyperparameter search)
  392. - high-performance implementation of best algos (RF, GBM, NN etc.)

    - R, Python etc. interfaces, easy to use API - open source - advisors: Hastie, Tibshirani - Java, but C-style memalloc, by Java gurus - distributed, “big data” - many knobs/tuning, model evaluation, cross validation, model selection (hyperparameter search)
  393. install.packages("h2o") http://www.h2o.ai/

  394. https://gist.github.com/szilard/b87233bbf41a4b366c26eede7bb1a0f3 Laptop / 1 server / cluster

  395. None
  396. None
  397. None
  398. No need for manual 1-hot encoding of categorical variables

  399. None
  400. None
  401. None
  402. None
  403. None
  404. None
  405. None
  406. None
  407. None
  408. None
  409. None
  410. None
  411. None
  412. None
  413. None
  414. None
  415. None
  416. https://gist.github.com/szilard/b87233bbf41a4b366c26eede7bb1a0f3

  417. None
  418. Some Updates

  419. None
  420. None
  421. None
  422. None
  423. None
  424. A Few More Thoughts

  425. None
  426. None
  427. None
  428. None
  429. None
  430. None
  431. None
  432. None
  433. None
  434. None
  435. None
  436. None
  437. None