Introduction to Data Science for PHP Users

Introduction to Data Science for PHP Users

PHPカンファレンス2013「PHPerのためのデータサイエンス入門」 #phpcon2013

4564d3601f6230782300dff8499b2d5a?s=128

Sotaro Karasawa

September 14, 2013
Tweet

Transcript

  1. Crocos, Inc. Sotaro Karasawa @sotarok http://facebook.com/sotarok 1)1FSͷͨΊͷ σʔλαΠΤϯεೖ໳ QIQDPO 1)1ΧϯϑΝϨϯε

  2. ࣗݾ঺հ 4PUBSP,BSBTBXB!TPUBSPL ฑ୔૱ଠ࿠ EIBUFOBOFKQTPUBSPL גࣜձࣾΫϩίε$SPDPT*OD 1)1 (JU 5% 3FE#VMM

  3. ύʔϑΣΫτ1)1 ٕज़ධ࿦ࣾ ౰વΈͳ͞Μ࣋ͬͯ·͢ΑͶʂʁ ˡ

  4. σʔλαΠΤϯε

  5. ৄ͍͜͠ͱ͸ σʔλαΠΤϯςΟετ ཆ੒ಡຊ ٕज़ධ࿦ࣾ IUUQXXXBNB[PODPKQEQ

  6. σʔλαΠΤϯε ۀ຿ཧղ σʔλཧղ σʔλநग़ σʔλՃ޻ ϞσϦϯά ޮՌݕূ αʔϏε࣮૷ Ҿ༻σʔλαΠΤϯςΟετཆ੒ಡຊ 1ୈষσʔλαΠΤϯεͷϓϩηε

  7. σʔλαΠΤϯε ஝ੵ͞ΕͨσʔλΛ෼ੳɾϞσϦϯάͯ͠ ϏδωεΛ਱ߦ͢ΔͨΊʹॏཁͳ ࢦඪΛಘΔ Λ܁Γฦ͢

  8. σʔλαΠΤϯε ஝ੵ͞ΕͨσʔλΛ෼ੳɾϞσϦϯάͯ͠ ϏδωεΛ਱ߦ͢ΔͨΊʹॏཁͳ ࢦඪΛಘΔ Λ܁Γฦ͢ ΍Βͳ͚Ε͹͍͚ͳ͍͜ͱ͕ଟ͍ ஌ࣝͷྖҬɾ෯͕޿͍

  9. ࠷௿ݶͷͱ͜Ζ͔Β खܰʹ࢝ΊΒΕΔͱ͜Ζ͔Β ࠷ॳͷาΛ;Έͩͦ͏

  10. σʔλαΠΤϯε ۀ຿ཧղ σʔλཧղ σʔλநग़ σʔλՃ޻ ϞσϦϯά ޮՌݕূ αʔϏε࣮૷ Ҿ༻σʔλαΠΤϯςΟετཆ੒ಡຊ 1ୈষσʔλαΠΤϯεͷϓϩηε

  11. 1)1FS 8FCΞϓϦέʔγϣϯʹͱͬͯ σʔλͱ͸Կ͔

  12. 1)1FS 8FCΞϓϦέʔγϣϯʹͱͬͯ σʔλͱ͸Կ͔ σʔλϕʔε ϩά

  13. ࠓճ͸ϩάͷ͓࿩

  14. େྔͷΞϓϦέʔγϣϯϩάΛ ͍͔ʹऩू͠ ͲͷΑ͏ʹूܭ͢Δ͔

  15. ͦΕΛ౿·͑ͯ ࠓ೔ͷΞδΣϯμ ϩάऩूͱ෼ੳͷ೰Έ 1)1ΞϓϦέʔγϣϯͷϩάऩू ෼ੳ

  16. ϩάͷऩूͱ෼ੳͷ೰Έ

  17. ೰Έͷਚ͖ͳ͍ ϩάͷऩूͱ෼ੳ େྔͷσʔλ Ͳ͏ूΊΔ Ͳ͜ʹஷΊΔ Ͳ͏औΓग़͢ Ͳ͏ूܭ͢Δ

  18. ೰Έͷਚ͖ͳ͍ ϩάͷऩूͱ෼ੳ େྔͷσʔλ Ͳ͏ूΊΔ Ͳ͜ʹஷΊΔ Ͳ͏औΓग़͢ Ͳ͏ूܭ͢Δ ωοτϫʔΫଳҬ σΟεΫ༰ྔ Ϗοάσʔλॲཧܥ

    ॲཧ࣌ؒ
  19. IUUQXXXUSFBTVSFEBUBDPN

  20. TD Web Server Web Server fluentd S3 Hadoop Client Hive

    MySQL etc... Result
  21. TD Web Server Web Server fluentd S3 Hadoop Client Hive

    MySQL etc... Result ͋ͬͪଆʹσʔλ͕ஷ·ΓɺΫΤ ϦΛ౤͛Δͱ͋ͬͪͰ)BEPPQ ͕ىಈͯ݁͠ՌΛฦͯ͘͠ΕΔ
  22. ϩά෼ੳΛਐΊΔʹ͋ͨΓ ໽հͳɺσʔλͷऩूɾ஝ੵɾσʔλॲཧ ɹˠ5%͕΍ͬͯ͘ΕΔ ຊ࣭తͳۀ຿ ɾͲͷΑ͏ͳσʔλ ɾͲͷΑ͏ʹूܭ ͷઃܭɾ࣮૷ʹίϛοτͰ͖Δʂ

  23. $SPDPTʹ͓͚Δϩάͷ׆༻ wΞϓϦέʔγϣϯϩά w'BDFCPPLͷଐੑ৘ใʹجͮ͘෼ੳ wओཁͳΞΫγϣϯͷ࣮ߦ਺΍࣮ߦ࣌ؒ wτϥϯβΫγϣϯ਺ɾଐੑผɾܦ࿏ผ wΠϕϯτϩά wιʔγϟϧ΁ͷγΣΞ w.PEBMͷ։ดFUD wͦͷଞ΋Ζ΋Ζ

  24. 1)1ΞϓϦέʔγϣϯͷ ϩάऩू

  25. ͲΜͳΞϓϦέʔγϣϯϩά جຊతͳϩάઃܭ

  26. ͲΜͳϩάΛूΊͯΔʁ

  27. 8FCαʔόͷϩά

  28. ϩάͱ͍͑͹ 8FCαʔόʔͷϩά 5SFBTVSF%BUBͷνϡʔτϦ Ξϧ΋"QBDIFͷϩά http://docs.treasure-data.com/articles/quickstart

  29. ͚ͩͲຊ౰ʹཉ͍͠ͷ͸

  30. ͲΜͳϢʔβʔ͕ʁ ͲΜͳ୺຤ͰʁͲ͔͜Βʁ ͍ͭԿΛͨ͠ͷ͔ʁ ͲΜͳϘλϯΛΫϦοΫͨ͠ ͷ͔ʁλοϓͨ͠ͷ͔ʁ

  31. ΞϓϦέʔγϣϯϩά

  32. ͲΜͳϢʔβʔ͕ʁ ɹˠϢʔβʔొ࿥৘ใ ͲΜͳ୺຤ͰʁͲ͔͜Βʁ ɹˠ6"(&0 ͍ͭԿΛͨ͠ͷ͔ʁ ɹˠ63*ΞΫγϣϯ

  33. ΞϓϦέʔγϣϯϩάΛ Ͳ͏ूΊΔ͔

  34. ͦͷલʹ ܰ͘εΩʔϚϨεϩάʹ͍ͭͯ

  35. εΩʔϚϨεϩάͱ͸ʁ εΩʔϚͷແ͍ϩά

  36. ϩάͷεΩʔϚ ͜Ε·Ͱ ˠྫ͑͹547

  37. ΧϥϜ໨UJNF ΧϥϜ໨TUBUVT ΧϥϜ໨VSJ ΧϥϜ໨VTFS@JE  IPHF εΩʔϚ

  38. foreach (file('app.log') as $line) { $column = explode("\t", trim($line)); $time

    = $column[0]; $status = $column[1]; ... } ˞࣮ࡍʹ͸1)1ͳΜ͔Ͱ΍ͬͯΒΕͳ͍ͷͰTFE΍BXLͰ
  39. ߲໨ͷΘ͔ΓͮΒ͞ εΩʔϚมߋͷ೉͠͞ ෼ੳऀͱऩूऀͷೝࣝࠩҟʹ ΑΔࣄނ

  40. 5%ͷϩά ͱ͍͏͔qVFOUE +40/ { "time":1373876885, "status":200, "uri":"/52495/facebook", "session_id":"kn6avn2fuh21r25a65mgm3rjh3", "fb_id":"7c40c5dd2e55cde37a8c40ed80e1", ...

    }
  41. ϩάͷ1045

  42. qVFOUQIQMPHHFS use Fluent\Logger\FluentLogger; $logger = new FluentLogger("localhost","24224"); $logger->post( "debug.test", array("hello"=>"world")

    ); IUUQTHJUIVCDPNqVFOUqVFOUMPHHFSQIQ
  43. جຊతͳϩάઃܭ

  44.  ΞΫηεϨίʔυͱͳΔΑ ͏ʹه࿥͢Δ

  45. Ϩεϙϯεʹͻ͔͚ͬΔ ϑϨʔϜϫʔΫʹ͍͍ͩͨ ϨεϙϯεΠϕϯτ΁ͷϑοΫϙΠϯτ͋ΔΑͶʁ 4ZNGPOZͳΒ PO,FSOFM3FTQPOTF

  46. tags: - { name: kernel.event_listener, event: kernel.response } public function

    onKernelResponse(FilterResponseEvent $event) { $request = $event->getRequest(); $response = $event->getResponse(); // ͳΜ͔഑ྻͭͬͯ͘ $data = $this->onAccess($request, $response); // log data $this->logger->post("access",$data); } ˞࣮ࡍʹ͸΋ͬͱෳ਺ͷ-JTUFOFS΍-PHHFS͕ొ࿥Ͱ͖ΔΑ͏ʹͯ͋͠Γ·͕͢
  47.  جຊతͳεΩʔϚΛܾΊΔ

  48. εΩʔϚϨεͱ͍ͬͯ΋ Ͳ͏͍͏ϩάΛѻ͍ͬͯΔͷ͔ ֤ϨίʔυͰҙຯ͕ҧͬͯ͸ҙ ຯ͕ແ͍

  49. جຊతͳεΩʔϚΛܾΊΔ UJNF TUBUVT VSJ VB SFGFSSFS  LTSVͬΆ໊͍લʹ߹Θͤͯ ͓͘ͱΘ͔Γ΍͍͔͢΋

  50. 8FCαʔόʹ͋Δϩά ͚ͩͰͳ͘ BQQ SPVUF DPOUSPMMFS QSPDFTT@UJNF EFWJDF  ϑϨʔϜϫʔΫ಺Ͱͷ ϧʔςΟϯά໊ͱ͔ɺ

    ίϯτϩʔϥ໊ͱ͔ (uri ʹϊΠζ͕͋ͬͯ΋ routing ໊ͰूܭͰ͖Δ)
  51.  ΞϓϦέʔγϣϯͷ஌Γ͏Δ ଐੑΛඇਖ਼نԽͯ͠Ϩίʔυ ʹؚΊΔ

  52. ඇਖ਼نԽ͞ΕͨϨίʔυ TFTTJPO@JE VTFS@JE HFOEFS BHF EFWJDF 

  53. ͳͥඇਖ਼نԽ͔ͷϝϦοτ +0*/ͤͣʹूܭؔ਺ʹ͔ΔͨΊ )BEPPQͰ΋+0*/͸Ͱ͖Δ͕ɺ ͜͏͓ͯ͘͠ͱ޻ఔ͕ݮΔ͔Β ଎͍ˍγϯϓϧ

  54. ͪͳΈʹ VTFS@JE TFTTJPO@JE ͳͲ͸IBTIԽ͓ͯ͘͠ͱྑ͍ ˞ສҰͷͱ͖ͷϓϥΠόγʔʹ ഑ྀ

  55. ·ͱΊΔͱ ΞΫηεϨίʔυͱͳΔΑ͏ ʹه࿥͢Δ جຊతͳεΩʔϚΛܾΊΔ ΞϓϦέʔγϣϯͷ஌Γ͏Δଐ ੑΛඇਖ਼نԽͯ͠ϨίʔυʹؚΊΔ

  56. ͜͜·ͰདྷΔͱɺ΋͏෼ੳ͕Մೳ

  57. ෼ੳͷྫ SELECT AVG(v['process_time']) FROM access WHERE v['route'] = 'crocos_index'

  58. ෼ੳͷྫ SELECT v['gender'], COUNT(*) FROM access GROUP BY v['gender'] ඇਖ਼نԽ͓͍ͯ͠

    ͯΑ͔ͬͨʂ
  59. ෼ੳͷྫ Τϥʔͷௐࠪʹ΋ SELECT v['route'], v['status'], v['ua'] FROM access WHERE v['user_id']

    = 'xxx'
  60. ˞௕͘ͳΔͷͰ೔෇ؔ࿈ͷॲཧ͸লུͯ͠·͢ ɹຊ౰͸೔ผʹ(3061#:ͨ͠Γ8&)&3۟ͰߜͬͨΓ

  61. εΩʔϚϨεϩάͷ׆༻ྫ τϥϯβΫγϣϯ

  62. ͯ͞ جຊతͳεΩʔϚΛ࣋ͭ ϩά͕ͨ·Γ࢝Ί·ͨ͠

  63. ಛผͳҙຯΛ࣋ͭ ΞΫγϣϯͷ੒ޭͳͲΛ ه࿥͍ͨ͠

  64. τϥϯβΫγϣϯ uri ΍ route: ϦΫΤετ͕དྷͨ͜ͱ͸Θ͔Δ ͔͠͠ɺຊ౰ʹ੒ޭ͔ͨ͠͸ɺ ΞϓϦέʔγϣϯͰ͔͠Θ͔Β ͳ͍

  65. εΩʔϚϨεͷग़൪

  66. جຊతͳεΩʔϚ ௥ՃͷεΩʔϚ UJNF TUBUVT VSJ VB SFGFSSFS  ͳΜͪΌΒ ͔ΜͪΌΒ

    ಛఆͷϨίʔυʹɺಛผ ͳҙຯΛ΋ͨͤΔ͜ͱ͕Ͱ ͖Δʂ ͔͠΋ଞͷϨίʔυʹӨڹ Λ͋ͨ͑Δ͜ͱͳ͘ɻ
  67. τϥϯβΫγϣϯ key_action key_attr_*

  68. τϥϯβΫγϣϯ key_action shop:buy:completed ΞϓϦ:ಈ࡞:ঢ়گ ※͜ͷྫ͸ʮߪೖ׬ྃʯ

  69. τϥϯβΫγϣϯ key_attr_* τϥϯβΫγϣϯʹؔΘΔ෇Ճ తͳ৘ใΛͭͬ͜Ή εΩʔϚ͸ɺkey_action ͝ͱʹ ҟͳΔ

  70. τϥϯβΫγϣϯྫ key_action = shop:buy:completed key_attr_item_id = xxxxx key_attr_ref = fb_share

  71. τϥϯβΫγϣϯ෼ੳͷྫ SELECT item_id, ref, COUNT(*) FROM access WHERE key_action =

    'shop:buy:completed' GROUP BY item_id, ref ˞จࣈ਺ͷؔ܎্W<>ল͍ͯΔ
  72. τϥϯβΫγϣϯ෼ੳ ׆༻ྫ: ࢪࡦผʹΞΫηεݩΛه࿥ τϥϯβΫγϣϯ੒ޭ਺͔Β ࠷΋ޮՌతͳࢪࡦΛݟ͚ͭΔ

  73. /&9545&1

  74. ूܭ݁Ռ͔Β ɾ౷ܭతղੳख๏ ɾϞσϦϯά Ϗδωεʹରͯ͠ΫϦςΟΧϧͳࢦඪ ͷࢉग़ͱվળϓϩηεͷཱ֬

  75. ·ͱΊ

  76. ϩάΛूΊͨΓ෼ੳͨ͠Γ͢Δͷ͸େม ɹ→ Fluentd ΍ Hadoop ࢖͏ ɹ→ Treasure Data ࢖͏

    Ͳ͏͍͏ϩάΛूΊΕ͹͍͍ͷ͔ ɹ→ 1ΞΫηε1Ϩίʔυඇਖ਼نԽϩά ɹ→ ϩάϑΥʔϚοτࣗମͷઃܭ ɹ→ εΩʔϚϨεͷ׆༻
  77. ࠷ޙʹ 8FBSFIJSJOH ύʔϑΣΫτ1)1ஶऀਓ ݩ1)1ΧϯϑΝϨϯεҕһ௕ਓ ݩඇϞςਓ ݩυϥ່ਓ ͱಇ͚Δͷ͸$SPDPT͚ͩ

  78. None