Behavioral Analytics - Understanding the "why" and "how" of your users.

Behavioral Analytics - Understanding the "why" and "how" of your users.

The slides for my talk at the Boulder/Denver Big Data Meetup on April 24th, 2013.

Video of the talk can be found on YouTube: http://www.youtube.com/watch?v=STtjSm-5A-8

6c76488dff9b5d9a872dff88f008f88e?s=128

benbjohnson

April 19, 2013
Tweet

Transcript

  1. Behavioral Analytics U!"#r$%&!"'!( %)# H*w & W)+ *f U$#r$ Ben

    Johnson, Skyland Labs
  2. My Background

  3. Former Oracle DBA Data Visualization Behavioral Analytics

  4. W)&% '$ B#)&v'*r&, A!&,+%'-$?

  5. W)&% '$ B#)&v'*r&, A!&,+%'-$? Let’s start with an example.

  6. Our Example: You are a SaaS company with a web

    app.
  7. Our Example: You are a SaaS company with a web

    app. SQL database?
  8. Our Example: You are a SaaS company with a web

    app. SQL database? check.
  9. Our Example: You are a SaaS company with a web

    app. SQL database? check. Tons of Logs?
  10. Our Example: You are a SaaS company with a web

    app. SQL database? check. Tons of Logs? hell yeah!
  11. Let’s use SQL to understand our users!

  12. What are users doing?

  13. What are users doing? SELECT url, count(*) FROM page_views GROUP

    BY url;
  14. What are users doing? 0 12.5 25 37.5 50 /index.html

    /about.html /pricing.html /signup.html
  15. Who is doing it?

  16. Who is doing it? SELECT url, gender, count(*) FROM page_views

    INNER JOIN users GROUP BY url, gender;
  17. Who is doing it? 0 15 30 45 60 /index.html

    /about.html /pricing.html /signup.html Men Women
  18. Where are they doing it?

  19. Where are they doing it? SELECT url, city, count(*) FROM

    page_views INNER JOIN users GROUP BY url, city;
  20. Where are they doing it? 0 15 30 45 60

    /index.html /about.html /pricing.html /signup.html San Francisco Denver London
  21. When are they doing it?

  22. When are they doing it? SELECT url, DATE_FORMAT(‘%Y-%m), count(*) FROM

    page_views GROUP BY url, ...;
  23. When are they doing it? 0 10 20 30 40

    /index.html /about.html /pricing.html /signup.html Jan 2013 Feb 2013 Mar 2013 April 2013
  24. SQL works great for these questions:

  25. SQL works great for these questions: WHAT

  26. SQL works great for these questions: WHAT WHO

  27. SQL works great for these questions: WHO WHEN WHAT

  28. SQL works great for these questions: WHEN WHERE WHO WHAT

  29. SQL analyzes state.

  30. How do we understand actions?

  31. How are they doing it?

  32. How are they doing it? “H*w” .#&!$ /!*w'!( w)&% )&pp#!#"

    "0r'!( &! &-%'*! *r %&$/.
  33. How are they doing it? “H*w” .#&!$ /!*w'!( w)&% )&pp#!#"

    "0r'!( &! &-%'*! *r %&$/. 1&.p,#: H*w "* 0$#r$ -)#-/*0%?
  34. How are they doing it? SELECT t1.url, t2.url AS next_url,

    count(*) FROM page_views t1 LEFT JOIN ( SELECT url FROM page_views WHERE id > t1.id AND user_id = t1.user_id ORDER BY id ASC LIMIT 1) AS t2 WHERE t1.url = ‘/checkout.html’ GROUP BY t1.url, t2.url;
  35. How are they doing it? 0 37.5 75 112.5 150

    /enter_cc_info.html /product.html /index.html NULL
  36. How are they doing it? 0 37.5 75 112.5 150

    /enter_cc_info.html /product.html /index.html NULL Ab&!"*!#" S)*pp'!( C&r%$!
  37. Why are they doing it?

  38. Why are they doing it? “W)+” .#&!$ /!*w'!( w)&% -&.#

    b#f*r# &! &-%'*!.
  39. Why are they doing it? “W)+” .#&!$ /!*w'!( w)&% -&.#

    b#f*r# &! &-%'*!. 1&.p,#: W)+ "* 0$#r$ -&!-#, &--*0!%$?
  40. Why are they doing it? SELECT t1.url, t2.url AS prev_url,

    count(*) FROM page_views t1 LEFT JOIN ( SELECT url FROM page_views WHERE id < t1.id AND user_id = t1.user_id ORDER BY id DESC LIMIT 1) AS t2 WHERE t1.url = ‘/cancel_account.html’ GROUP BY t1.url, t2.url;
  41. Why are they doing it? 0 20 40 60 80

    /help.html /index.html /404.html /contact_us.html
  42. Why are they doing it? 0 20 40 60 80

    /help.html /index.html /404.html /contact_us.html Y*0r “)#,p” p&(#$ "*!’% )#,p.
  43. Why are they doing it? 0 20 40 60 80

    /help.html /index.html /404.html /contact_us.html U$#r$ -*.'!( %* +*0r $'%# %* q0'%
  44. Why are they doing it? 0 20 40 60 80

    /help.html /index.html /404.html /contact_us.html Y*0r w#b $'%# '$ br*/#!
  45. Why are they doing it? 0 20 40 60 80

    /help.html /index.html /404.html /contact_us.html Y*0r -0$%*.#r $0pp*r% $0-/$.
  46. B#)&v'*r&, &!&,+%'-$ '$ &b*0% 0!"#r$%&!"'!( %)# “)*w” &!" %)# “w)+”

  47. Now let’s use Hadoop to understand our users

  48. Building Funnel Analysis in Hadoop

  49. Building Funnel Analysis in Hadoop Step 1: Parse logs

  50. Building Funnel Analysis in Hadoop Step 1: Parse logs Step

    2: Group by user
  51. Building Funnel Analysis in Hadoop Step 1: Parse logs Step

    2: Group by user Step 3: Sessionize
  52. Building Funnel Analysis in Hadoop Step 1: Parse logs Step

    2: Group by user Step 3: Sessionize Step 4: Apply Pattern Matching
  53. Building Funnel Analysis in Hadoop Step 1: Parse logs Step

    2: Group by user Step 3: Sessionize Step 4: Apply Pattern Matching Step 5: Aggregate
  54. Building Funnel Analysis in Hadoop Step 1: Parse logs Step

    2: Group by user Step 3: Sessionize Step 4: Apply Pattern Matching Step 5: Aggregate Step 6: Pull hair repeatedly.
  55. Why not make a system & language for describing behavior?

  56. Behavior is...

  57. A-%'*!$

  58. A-%'*!$ & S%&%#

  59. A-%'*!$ + 2.# & S%&%#

  60. You need a way to describe time:

  61. “WHEN”

  62. “WHEN” Let’s see some examples

  63. Simple Hit Count H*w .&!+ p#*p,# -,'-/ *! .+ )*.#

    p&(#?
  64. Simple Hit Count WHEN action == ‘/index.html’ THEN SELECT count();

    END H*w .&!+ p#*p,# -,'-/ *! .+ )*.# p&(#?
  65. Simple Hit Count H*w .&!+ p#*p,# -,'-/ *! .+ )*.#

    p&(#? 0 25 50 75 100 /index.html
  66. Simple Hit Count w/ Demographics H*w .&!+ .#! & w*.#!

    -,'-/ *! .+ )*.# p&(#?
  67. Simple Hit Count w/ Demographics WHEN action == ‘/index.html’ THEN

    SELECT count() GROUP BY gender; END H*w .&!+ .#! & w*.#! -,'-/ *! .+ )*.# p&(#?
  68. Simple Hit Count w/ Demographics 0 25 50 75 100

    Men Women H*w .&!+ .#! & w*.#! -,'-/ *! .+ )*.# p&(#?
  69. Answering the “How” H*w "* 0$#r$ -)#-/*0%?

  70. Answering the “How” WHEN action == ‘/checkout.html’ THEN WHEN WITHIN

    1 STEP THEN SELECT count() GROUP BY action; END END H*w "* 0$#r$ -)#-/*0%?
  71. 0 25 50 75 100 /enter_cc.html /product.html /index.html NULL Answering

    the “How” H*w "* 0$#r$ -)#-/*0%?
  72. Answering the “How” H*w "* 0$#r$ -)#-/*0% (%w* $%#p$ '!)?

  73. Answering the “How” WHEN action == ‘/checkout.html’ THEN WHEN action

    == ‘/enter_cc.html’ WITHIN 1 STEP THEN WHEN WITHIN 1 STEP THEN SELECT count() GROUP BY action; END END END H*w "* 0$#r$ -)#-/*0% (%w* $%#p$ '!)?
  74. 0 25 50 75 100 /confirm.html NULL Answering the “How”

    H*w "* 0$#r$ -)#-/*0% (%w* $%#p$ '!)?
  75. Combine these to make a funnel:

  76. F0!!#, Q0#r+:

  77. WHEN action == ‘/checkout.html’ THEN SELECT count() INTO “step0”; WHEN

    action == ‘/enter_cc.html’ WITHIN 1 STEP THEN SELECT count() INTO “step1”; WHEN action == ‘/confirm.html’ WITHIN 1 STEP THEN SELECT count() INTO “step2”; END END F0!!#, Q0#r+:
  78. F0!!#, O0%p0%:

  79. { “step0”:{“count”:100}, “step1”:{“count”:40}, “step2”:{“count”:32} } F0!!#, O0%p0%:

  80. 0 25 50 75 100 /checkout.html /enter_cc.html /confirm.html Answering the

    “How” H*w "* 0$#r$ -)#-/*0%? (step0) (step1) (step2)
  81. 0 25 50 75 100 /checkout.html /enter_cc.html /confirm.html Answering the

    “How” H*w "* 0$#r$ -)#-/*0%? (step0) (step1) (step2)
  82. 0 25 50 75 100 /checkout.html /enter_cc.html /confirm.html Answering the

    “How” H*w "* 0$#r$ -)#-/*0%? (step0) (step1) (step2)
  83. Answering the “Why” W)+ "* 0$#r$ -&!-#, &--*0!%$?

  84. Answering the “Why” WHEN action == ‘/cancel_account’ THEN WHEN WITHIN

    -1 STEP THEN SELECT count() GROUP BY action; END END W)+ "* 0$#r$ -&!-#, &--*0!%$?
  85. Answering the “Why” WHEN action == ‘/cancel_account’ THEN WHEN WITHIN

    -1 STEP THEN SELECT count() GROUP BY action; END END W)+ "* 0$#r$ -&!-#, &--*0!%$?
  86. What can we do with “within”?

  87. Within +/- n Steps

  88. Within 1 Step

  89. Within 2 Steps

  90. Within 1..3 Steps

  91. Within -1 Steps

  92. Within -2 Steps

  93. Within +/- Sessions

  94. Within +/- Sessions

  95. Within 0 Sessions (current session only)

  96. Within 1 Session

  97. Within -1 Session

  98. W'%)'! &,$* w*r/$ *! "&%#$ Days, Weeks, Months, etc.

  99. S/+ An open source, behavioral analytics database.

  100. Why Build a Database?

  101. Data Model

  102. Performance Data Model

  103. Performance Scaling Data Model

  104. The Data Model

  105. The Data Model Relational databases look like this:

  106. The Data Model Relational databases look like this: ID name

    gender state 1 Bob M CO 2 Susy F CA users
  107. The Data Model Relational databases look like this: ID name

    gender state 1 Bob M CO 2 Susy F CA users User ID url date 1 /home ... 1 /signup ... 2 /home ... page_views
  108. The Data Model Logs look like this:

  109. The Data Model Logs look like this: User ID url

    date 1 /home ... 2 /home ... 2 /signup ... 1 /product/123 ... 3 /login ... 1 /checkout ...
  110. The Data Model Logs look like this: User ID url

    date 1 /home ... 2 /home ... 2 /signup ... 1 /product/123 ... 3 /login ... 1 /checkout ...
  111. The Data Model We think of behavior in terms of

    timelines
  112. The Data Model We think of behavior in terms of

    timelines Bob
  113. The Data Model We think of behavior in terms of

    timelines Bob /home
  114. The Data Model We think of behavior in terms of

    timelines Bob /home /product/123
  115. The Data Model We think of behavior in terms of

    timelines Bob /home /product/123 /checkout
  116. The Data Model We think of behavior in terms of

    timelines Bob /home /product/123 /checkout Susy
  117. The Data Model We think of behavior in terms of

    timelines Bob /home /product/123 /checkout Susy /signup
  118. The Data Model We think of behavior in terms of

    timelines Bob /home /product/123 /checkout Susy /signup /welcome
  119. The Data Model So that’s how Sky stores it. Bob

    /home /product/123 /checkout Susy /signup /welcome
  120. Performance

  121. Organize data how you’ll analyze it Performance

  122. Organize data how you’ll analyze it Performance Minimize data movement

  123. Organize data how you’ll analyze it Performance Minimize data movement

    Tightly pack data (so more stays in memory)
  124. Expect to query about Performance 10,000,000 #v#!%$ / -*r# /

    $#-*!"
  125. Expect to query about Performance 10,000,000 #v#!%$ / -*r# /

    $#-*!" (with linear scaling)
  126. Scaling

  127. Timelines are analyzed in isolation Scaling

  128. Timelines are analyzed in isolation Automatically shards data across cores

    Scaling
  129. Timelines are analyzed in isolation Automatically shards data across cores

    Automatically shards across nodes (v0.3.1) Scaling
  130. S/+ I!%#r!&,$ A quick peek under the hood

  131. S/+ I!%#r!&,$

  132. S/+ I!%#r!&,$ Recently ported to Go (originally C)

  133. S/+ I!%#r!&,$ Recently ported to Go (originally C) Query Engine

    still written in C
  134. S/+ I!%#r!&,$ Recently ported to Go (originally C) Query Engine

    still written in C Storage handled with LevelDB
  135. S/+ I!%#r!&,$ Recently ported to Go (originally C) Query Engine

    still written in C Storage handled with LevelDB RESTful JSON over HTTP protocol
  136. S/+ I!%#r!&,$ Loose Schema, No Limit On “Columns”

  137. S/+ I!%#r!&,$ Loose Schema, No Limit On “Columns” Types: Strings,

    Integers, Floats & Booleans
  138. S/+ I!%#r!&,$ Loose Schema, No Limit On “Columns” Types: Strings,

    Integers, Floats & Booleans Possible Future Types: Map, Array, Dates, Lat/Long
  139. S/+ I!%#r!&,$ Loose Schema, No Limit On “Columns” Types: Strings,

    Integers, Floats & Booleans Possible Future Types: Map, Array, Dates, Lat/Long Flexible Query System Built On LuaJIT
  140. O%)#r R&" S%0ff

  141. O%)#r R&" S%0ff (C*.'!( S**!)

  142. SQL-like Query Language (Currently uses a JSON query interface)

  143. Multi-node Distribution (coming in next version, v0.3.1)

  144. Predictive Behavioral Analytics

  145. Landmark

  146. Landmark Hosted Behavioral Analytics

  147. Landmark Hosted Behavioral Analytics Sign up at http://landmark.io

  148. Demo

  149. Questions? http://skydb.io Google Group: skydb Twitter: @benbjohnson ben@skylandlabs.com

  150. Image Attribution Database designed by Sergey Shmidt from The Noun

    Project Line Graph designed by Cris Dobbins from The Noun Project Computer designed by Olivier Guin from The Noun Project Database designed by Cees de Vries from The Noun Project Document designed by Piotrek Chuchla from The Noun Project Worker designed by Bart Laugs from The Noun Project People designed by Studio Het Mes from The Noun Project Clock designed by Infinity Kim from The Noun Project Question designed by Greg Pabst from The Noun Project Hammer designed by John Caserta from The Noun Project Rocket designed by James Fenton from The Noun Project Expand designed by Dmitry Baranovskiy from The Noun Project Rock n Roll designed by Cengiz SARI from The Noun Project Robot designed by Simon Child from The Noun Project Profanity designed by Juan Pablo Bravo from The Noun Project