Upgrade to Pro — share decks privately, control downloads, hide ads and more …

データを中心とするWebサービスの 開発・運用を実現する技術

Livesense Inc.
June 26, 2018
150

データを中心とするWebサービスの 開発・運用を実現する技術

2018/6/23 【CAMPHOR x LIVESENSE】 Webサービスの開発・運用におけるデータエンジニアリング

Livesense Inc.

June 26, 2018
Tweet

More Decks by Livesense Inc.

Transcript

  1. About me ాத ঵ଠ࿠ (@yubessy) ςΫϊϩδΧϧϚʔέςΟϯά෦ σʔλϓϥοτϑΥʔϜάϧʔϓ • 2010 -

    2014 ژ౎େֶ޻ֶ෦৘ใֶՊ ܭࢉػՊֶίʔε • 2014 - 2016 ژ౎େֶେֶӃ৘ใֶݚڀՊ ࣾձ৘ใֶઐ߈ • 2016 - גࣜձࣾϦϒηϯε • σʔλ෼ੳج൫ Livesense Analytics ͷ։ൃɾӡ༻ • ػցֶशج൫ Livesense Brain ͷ։ൃɾӡ༻
  2. Works • ࣄۀʹدΓఴ͏σʔλج൫ͷҭͯํ • ػցֶशγεςϜͷ৴པੑΤϯδχΞϦϯά • σʔλ෼ੳΛࢧ͑ΔʮศརΧϥϜʯͷ໰୊఺ͱͦͷղܾࡦ • Ϛϧνίϯςφߏ੒ʹΑΔػցֶशΞϧΰϦζϜͱΞϓϦέʔγϣϯͷૄ݁߹Խ •

    ػցֶशγεςϜͷӡ༻՝୊ͱίϯςφΦʔέετϨʔγϣϯ͕΋ͨΒ͢΋ͷ • Apache Sparkͷ3ͭͷAPI: RDD, DataFrame͔ΒDataset΁ • Facebookͷߴ଎ϕΫτϧܭࢉϥΠϒϥϦfaissΛϨίϝϯυAPIʹ࢖ͬͯΈΔ
  3. ͨͱ͑͹ : ϚοϋόΠτͷ͘͠Έ ܝࡌ՝ۚ (ଞࣾαʔϏε) 1. ΞϧόΠτΛޏ͍͍ͨ 2. ٻਓΛܝࡌ ←

    අ༻ൃੜ ࠾༻՝ۚ (ϚοϋόΠτ) 1. ΞϧόΠτΛޏ͍͍ͨ 2. ٻਓΛܝࡌ ← ແྉ 3. Ԡืऀ͕࠾༻͞Εͯॳग़ۈ ← අ༻ൃੜ
  4. σʔλͱฉ͍ͯɾɾɾ • Ϗοάσʔλ, ϦΞϧλΠϜσʔλ, .... • σʔλ෼ੳ, σʔλαΠΤϯε, .... •

    ਓ޻஌ೳ, ू߹஌, .... • ػցֶश, ࣗવݴޠॲཧ, ... • σΟʔϓϥʔχϯά, ... • Hadoop, Spark, ...
  5. WebαʔϏε͕ѻ͏σʔλ ίϯςϯπ : αʔϏε͕Ϣʔβʹఏڙ͢Δ৘ใ • ΞϧόΠτٻਓථ • ΞϧόΠτઌͷΫνίϛ Ϣʔβσʔλ :

    αʔϏε͕Ϣʔβ͔Βड͚औΔ৘ใ • ر๬৚݅ (৬छ, ۈ຿஍, ...ʣ • ݕࡧͷΩʔϫʔυ • ϖʔδͷΞΫηεϩά
  6. Ϣʔβσʔλͷొ৔ ੲͷWeb : ϦϯΫΛͨͲͬͯࣗྗͰίϯςϯπʹͨͲΓண͘ • αΠτϚοϓɾϦϯΫू • Yahoo! σΟϨΫτϦݕࡧ Google

    ݕࡧͷొ৔ • ίϯςϯπʹϦϯΫΛషΕ͹ Google Bot ͕ΫϩʔϦϯά • Ϣʔβ͸΄͍͠৘ใΛ Google ʹݕࡧΩʔϫʔυͱͯ͠༩͑Δ
  7. Ϣʔβσʔλ΁ͷ஫໨ Ϣʔβ͕༷ʑͳσʔλΛఏڙ͢ΔΑ͏ʹͳͬͨ • Google, Twitter, Instagram, LINE, ... σʔλ׆༻ͷͨΊͷٕज़͕ൃୡͨ͠ •

    େྔσʔλͷ஝ੵ : ෼ࢄॲཧ, Ϋϥ΢υ, ... • ༗༻ͳ৘ใͷநग़ : ػցֶश, ࣗવݴޠॲཧ, ... → ϢʔβσʔλΛར༻ͯ͠ίϯςϯπΛ͏·͘ಧ͚͍ͨ
  8. (࠶ܝ) ϚοϋόΠτͷ͘͠Έ ܝࡌ՝ۚ (ଞࣾαʔϏε) 1. ΞϧόΠτΛޏ͍͍ͨ 2. ٻਓΛܝࡌ ← අ༻ൃੜ

    ࠾༻՝ۚ (ϚοϋόΠτ) 1. ΞϧόΠτΛޏ͍͍ͨ 2. ٻਓΛܝࡌ ← ແྉ 3. Ԡืऀ͕࠾༻͞Εͯॳग़ۈ ← අ༻ൃੜ
  9. Ԡื཰ͱ࠾༻཰ Ԡื཰ = Ԡื਺ / Ӿཡ਺ ࠾༻཰ = ࠾༻਺ /

    Ԡื਺ ٻਓͷՁ஋ ∝ Ԡื཰ ✕ ࠾༻཰ • Ԡื཰͕ߴ͍ → Ϣʔβʹਓؾ͕͋Δ • ࠾༻཰͕ߴ͍ → اۀ͕ਓΛޏ͍͍ͨ
  10. Ԡื཰ɾ࠾༻཰ͷ༧ଌ • ΞϧΰϦζϜ : ϩδεςΟοΫճؼ • ಛ௃ : ৬छ, ۈ຿஍,

    ళฮ਺, ٳ೔, γϑτ਺, ... • ڭࢣ : աڈͷଟ਺ͷٻਓͷӾཡɾԠืɾ࠾༻ ← Ϣʔβσʔλ (ߦಈϩά)
  11. ݕࡧ݁ՌͷϥϯΩϯά 1. ϢʔβͷݕࡧΩʔϫʔυʹϚον͢ΔٻਓͷϦετΛऔಘ 2. ༧Ί༧ଌ͓͍ͯͨ͠Ԡื཰ɾ࠾༻཰ʹج͍ͮͯϦετΛฒͼସ͑ public void addDocs(TopFieldDocs mainDocs, int

    reSortDocsNum) throws JsenPageLimitExceedingException, JsenInvalidReSortUnitException { int sortUnitIndex = getIndexOf(dhondtUnit, mainDocs.fields); this.docCountTable = new DocCountTable(mainDocs.fields, reSortUnits, reSortDocsNum); java.util.ListIterator<FieldDoc> iter = fieldDocsIter(mainDocs.scoreDocs, reSortDocsNum); HashMap<FieldDoc, Float> reSortedDocs = new HashMap(reSortDocsNum); while (iter.hasNext()) { FieldDoc doc = iter.next(); float score = ((Integer) (doc.fields[sortUnitIndex])).floatValue(); int count = docCountTable.getCountOf(doc); reSortedDocs.put(doc, score / (1 + count * dhondtCoef)); docCountTable.add(doc); } ArrayList<Entry<FieldDoc, Float>> priorityList = new ArrayList<>(reSortedDocs.entrySet()); Collections.sort(priorityList, (obj1, obj2) -> obj2.getValue().compareTo(obj1.getValue())); this.allDocs = priorityList.stream().map(Entry::getKey).toArray(FieldDoc[]::new); }
  12. γεςϜͷෆ࣮֬ੑ͕ߴ͍ ී௨ͷ৔߹ = σʔλ͕มΘͬͯ΋ϩδοΫ͸มΘΒͳ͍ • ϩδοΫ͸ίʔυ͚ͩͰܾ·Δ • ίʔυΛݟΕ͹ೖྗʹର͢Δग़ྗ͸Θ͔Δ ػցֶश =

    σʔλ͕มΘΔͱϩδοΫ͕มΘΔ • ಉ͡ίʔυͰ΋ֶशσʔλ͕ҧ͑͹ผͷϞσϧ • Ϟσϧͷύϥϝʔλ͸ਓ͕ؒཧղ͠ʹ͍͘
  13. ༷ʑͳઐ໳ੑ͕ඞཁʹͳΔ యܕతͳWeb։ൃ • αʔόαΠυ, ϑϩϯτ, Πϯϑϥ, ΞϓϦ σʔλΛ׆༻ͨ͠Web։ൃ • αʔόαΠυ,

    ϑϩϯτ, Πϯϑϥ, ΞϓϦ • ػցֶशΤϯδχΞ, σʔλΤϯδχΞ • ΞφϦετ, σʔλαΠΤϯςΟετ
  14. Ϧϒηϯεͷσʔλ෼ੳج൫ ௨শ : Livsense Analytics • ϚοϋόΠτɾస৬ձٞɾస৬φϏͳͲɺ֤αʔϏεͷσʔλΛू໿ • ίϯςϯπ (ٻਓ৘ใ౳)

    ͱϢʔβσʔλ (ߦಈϩά౳) ͷ݁߹෼ੳΛՄೳʹ όοΫΤϯυ : Amazon Redshift • ෼ੳ༻్ʹಛԽͨ͠ྻࢤ޲ܕ෼ࢄσʔλ΢ΣΞϋ΢ε • ϨΠςϯγΑΓεϧʔϓοτΛॏࢹͨ͠ΫΤϦΤϯδϯ • 500+ ςʔϒϧ / 100ԯ+ Ϩίʔυ / 100+ ࣾ಺Ϣʔβ
  15. ೔ʑ౤͛ΒΕ͍ͯΔΫΤϦ SELECT DATE(n_cpl.pull_at) AS dt, CASE WHEN n_cpl.pull_condition_id = 47

    THEN '1' WHEN n_cpl.pull_condition_id = 48 THEN '2' WHEN n_cpl.pull_condition_id = 49 THEN '3' ELSE 'other' END AS list, COUNT(DISTINCT n_c.id) AS contacts, COUNT(DISTINCT CASE WHEN service_start.contact_id IS NOT NULL THEN n_c.id ELSE NULL END) AS agreements, COUNT(DISTINCT CASE WHEN application_proxy.summarizable_id IS NOT NULL THEN n_c.id ELSE NULL END) AS agencies, COUNT(DISTINCT CASE WHEN cv.id IS NOT NULL THEN n_c.id ELSE NULL END) AS conversions FROM nkd.contact_pull_logs AS n_cpl INNER JOIN nkd.contacts AS n_c ON n_cpl.contact_id = n_c.id LEFT JOIN ( SELECT contact_id, staff_id, action_at FROM nkd.histories AS n_h INNER JOIN nkd.histories_history_actions AS n_hha ON n_h.id = n_hha.history_id WHERE n_hha.history_action_id = 1 AND n_h.history_result_id = 1 AND n_h.created_at >= '2017-12-19' ) AS service_start ON n_c.id = service_start.contact_id LEFT JOIN ( SELECT * FROM nkd.kpi_raw_data WHERE name = 'application_proxy' AND created_at >= '2017-12-19' ) AS application_proxy ON n_c.member_id = application_proxy.member_id LEFT JOIN ( SELECT id FROM nkd.applications WHERE recruit_state IN ('hire','pre_hire_confirm_admin') ) AS cv ON application_proxy.summarizable_id = cv.id WHERE n_cpl.pull_at >= '2017-12-19' GROUP BY dt, list ;
  16. σʔλ׆༻ٕज़ͷଟ༷ੑ ՝୊ɾ໨తɾಘҙ෼໺ʹΑͬͯద͢Δٕज़͸ҟͳΔ ຊདྷ͸֤͕ࣗ100%ͷྗΛൃشͰ͖Δٕज़Λ࢖͍͍ͨ • Web։ൃ : Ruby etc. 㲗 σʔλ׆༻

    : Python etc. • σʔλ׆༻Ͱ΋ R, Python, Julia ͳͲ͸Ұ௕Ұ୹ → ॊೈͳٕज़બ୒͕Ͱ͖Δ࣮ߦ؀ڥΛ੔උ͢Δ
  17. Ϧϒηϯεͷػցֶशج൫ ௨শ : Livsense Brain • ༧ଌɾϨίϝϯυͳͲ֤αʔϏεͷػցֶशγεςϜΛ࣮ߦ • ؀ڥΛίϯςφԽ͠ɺ༷ʑͳγεςϜΛ؆୯ʹσϓϩΠ όοΫΤϯυ

    : Google Kubernetes Engine • ෳ਺ίϯςφͷ࣮ߦ΍ڠௐΛՄೳʹ͢ΔΦʔέετϨʔγϣϯαʔϏε • ҟͳΔγεςϜͰ΋ϦιʔεΛڞ༗ɾ༗ޮ׆༻ • 16+core CPU / 100+GB RAM / 15+छྨ ίϯςφ
  18. Julia ʹΑΔϨίϝϯυΞϧΰϦζϜ࣮૷ function bpmf( users::Vector{Int}, items::Vector{Int}, rates::Vector{Float64}, N::Int, M::Int, D::Int,

    n_sample::Int) rateidx = sparse(users, items, 1:length(rates)) U = zeros(n_sample, N, D) V = zeros(n_sample, M, D) for t in 1:(n_sample-1) SS = (1. / N) * ΣXXt(U[t, :, :]) SU = (1. / N) * sum(U[t, :, :], 1)[:] W0ast = inv(inv(W0) + N * SS + (β0 * N) / (β0 + N) * (μ0 - SU) * (μ0 - SU)') λ_u = rand(Wishart(ν0 + N, symmetric(W0ast))) μ0ast = (β0 * μ0 + N * SU) / (β0 + N) μ_u = rand(MvNormal(μ0ast, symmetric(inv((β0 + N) * λ_u)))) ... for i in 1:N js = find(rateidx[i, :]) SV_i = ΣXXt(V[t, js, :]) λast_inv = inv(λ_u + α * SV_i) SVR_i = sum(V[t, js, :] .* rates[rateidx[i, js].nzval], 1)[:] μast = λast_inv * (α * SVR_i + λ_u * μ_u) U[t+1, i, :] = rand(MvNormal(μast, symmetric(λast_inv))) end ... end return U, V end
  19. σʔλ׆༻ͱઐ໳ੑ σʔλʹؔΘΔઐ໳෼໺͸ඇৗʹͨ͘͞Μ͋Δ • ΞφϦετ, σʔλαΠΤϯςΟετ, Ϛʔέλʔ, ... • ػցֶशΤϯδχΞ, σʔλΤϯδχΞ,

    ... ʮ͜Ε͕Ͱ͖Ε͹શ෦Ͱ͖Δʯͱ͍ͬͨۜͷ஄ؙ͸ଘࡏ͠ͳ͍ • PythonͰػցֶश͚ͩͰ͖Ε͹OK → ❌ • RͰσʔλ෼ੳ͚ͩͰ͖Ε͹OK → ❌
  20. αϚʔΠϯλʔϯ • ظؒ : Ԡ૬ஊ • ର৅ : ػցֶशɾσʔλ෼ੳɾαʔόαΠυ։ൃͷܦݧऀ •

    څ༩ : ࣌څ1200ԁʙ (ަ௨අࢧڅɾԕํͷ৔߹͸॓ധࢪઃख഑) • ಺༰ྫ : • ΫνίϛΛར༻ͨࣗ͠વݴޠॲཧʹΑΔ৽ػೳͷ։ൃ • Factorization Machine ͳͲͷΞϧΰϦζϜΛར༻ͨ͠Ϩίϝϯυ։ൃ • GCP্ͰͷABςετɾόϯσΟοτج൫ͷ։ൃ