Data Science at Scale @ Barricade.io

Data Science at Scale @ Barricade.io

This talk describes the challenges with data science and how we run data analysis at scale at https://Barricade.io

C6ec08260e13aa2d5e9a7519546bed27?s=128

David Coallier

November 04, 2015
Tweet

Transcript

  1. Data Science @ Scale

  2. @davidcoallier Part of an amazing team at Barricade.io

  3. Data Science is Hard

  4. Data Hacking is “Easy”

  5. Data Analysis is “Easy”

  6. Data Expertise is “Easy”

  7. Got all? Having the three is real hard!

  8. None
  9. Is that it? Well don’t forget your purpose.

  10. You are not an economist. ɪˈkɒnəmɪst/: Someone with all the

    answers, and none of the questions.
  11. The Data Scientific Method

  12. Find a question.

  13. Use the data you have

  14. Features & Tests

  15. Analyse Results You will be sad.

  16. Conversate Talk about your findings.

  17. Good Chats Imply egoless and collaborative data scientists.

  18. Recap.

  19. 1. Hacking 2. Maths & Stats 3. Expertise

  20. And

  21. 1. Question 2. Be Pragmatic 3. Features 4. Analyse 5.

    Share.
  22. A team! Rarely a single-person effort.

  23. An Example Fraud Prevention — Business Prevention

  24. I knew better. Obviously… duh

  25. We didn’t share. Science has historically been shared.

  26. Not with p-values

  27. Empathise. Use human language, not lingo.

  28. For us at Barricade

  29. None
  30. Doing this at scale is hard.

  31. We’re still small About a billion data points a day.

  32. Humble Beginnings Typically… an Queue and an API.

  33. This had issues. Hard to scale, hard to decouple, etc.

  34. Enter the Lambda Architecture.

  35. None
  36. None
  37. None
  38. None
  39. Speed Layer

  40. None
  41. Batch Layer

  42. None
  43. Speed Layer: U new behaviour from new data Batch Layer:

    All classified behaviour since T
  44. Serving Layer

  45. None
  46. Speed Layer: U new behaviour from new data Batch Layer:

    All classified behaviour since T Serve Layer: Batch layer U Speed Layer
  47. Cache Layer

  48. None
  49. On Amazon AWS

  50. Identifying an Attack.

  51. None
  52. Ahh! What’s that?

  53. Kafka Queue. Distributed messaging system Append-only log Consumers have offsets

    Partition for parallelism Replicate for redundancy Message order guaranteed, per-partition
  54. None
  55. None
  56. None
  57. None
  58. None
  59. Barricade Customer

  60. None
  61. Questions?

  62. @davidcoallier @barricadeio