web scraping with polite package

465a530291aedb53fff5a6333e3dedaa?s=47 bk
December 05, 2020

web scraping with polite package

465a530291aedb53fff5a6333e3dedaa?s=128

bk

December 05, 2020
Tweet

Transcript

  1. “polite” ͰकΔ WebεΫϨΠϐϯάͷΤνέοτ ʙpoliteύοέʔδͷ঺հʙ

  2. ໨࣍ 2 ର৅ͱ໨త……………………………………………………………………………………p.3 WebεΫϨΠϐϯάͱ͸…………………………………………………………………p.4ʙ7 politeύοέʔδͱ͸………………………………………………………………………p.8ʙ12 politeύοέʔδͷ࢖༻खॱ……………………………………………………………p.13ʙ30 ·ͱΊ……………………………………………………………………………………………p.31 ͦͷଞ஫ҙࣄ߲………………………………………………………………………………p.32ʙ33 ࢀߟจݙ…………………………………………………………………………………………p.34ʙ36 Appendix……………………………………………………………………………………………p.37ʙ40

  3. ର৅ͱ໨త 3 ର৅ࢹௌऀ ໨త ࿩͞ͳ͍͜ͱ • WebεΫϨΠϐϯά Λ;ΜΘΓ஌͍ͬͯ Δ •

    WebεΫϨΠϐϯά ͷن੍पΓʹෆ҆Λ ײ͍ͯ͡Δ • politeύοέʔδΛ௨ ͯ͠WebεΫϨΠϐ ϯάΤνέοτͷج ૅΛֶͿ • rvestͳͲΛ࢖༻ͨ͠ εΫϨΠϐϯάͦͷ ΋ͷͷૢ࡞ • औಘͨ͠৘ใͷ࢖༻ ্ͷ੍ݶ
  4. WebεΫϨΠϐϯάͱ͸

  5. WebεΫϨΠϐϯάͱ͸ 5 WebεΫϨΠϐϯάͱ͸ WebεΫϨΠϐϯάʢWebσʔλநग़ɺεΫϦʔϯεΫϨΠϐϯάɺ Webσʔλऩूͱ΋ݺ͹ΕΔʣ͸ɺWebαΠτ͔Β৘ใΛநग़͢Δίϯ ϐϡʔλιϑτ΢ΣΞٕज़ͷ͜ͱͰ͢ɻ https://www.octoparse.jp/blog/web-scraping/

  6. WebεΫϨΠϐϯάͱ͸ 6 εΫϨΠϐϯά͕ҧ๏ʹͳΔ̐έʔε ✕ɹαʔόʔʹෛՙΛ͔͚ͯ͠·ͬͨ৔߹
 ʢܐ๏233৚ِܭۀ຿๦֐ࡑɺಉ234৚ిࢠܭࢉػଛյ౳ۀ຿๦֐ࡑʣ ✕ɹݸਓ৘ใΛಉҙͳ͘औಘɺެ։ɺചങͯ͠͠·ͬͨ৔߹
 ʢݸਓ৘ใอޢ๏ҧ൓ʣ ✕ɹεΫϨΠϐϯά͢ΔWebαΠτͷར༻ن໿ʹҧ൓͢Δ৔߹ ✕ɹஶ࡞ݖΛແࢹͨ͠ར༻ɺෳ੡౳Λߦͬͯ͠·ͬͨ৔߹
 ʢஶ࡞ݖ๏21৚ͳͲʣ

    PigData, ITหޢ࢜ʹฉ͘ʮاۀͱͯ͠ͷεΫϨΠϐϯά͸ҧ๏ͳͷ͔ʁʯ, https://services.sms-datatech.co.jp/pig-data/2019/07/03/scrapinglaw/
  7. WebεΫϨΠϐϯάͱ͸ εΫϨΠϐϯά͕ҧ๏ʹͳΔ̐έʔε ✕ɹαʔόʔʹෛՙΛ͔͚ͯ͠·ͬͨ৔߹
 ʢܐ๏233৚ِܭۀ຿๦֐ࡑɺಉ234৚ిࢠܭࢉػଛյ౳ۀ຿๦֐ࡑʣ ✕ɹݸਓ৘ใΛಉҙͳ͘औಘɺެ։ɺചങͯ͠͠·ͬͨ৔߹
 ʢݸਓ৘ใอޢ๏ҧ൓ʣ ✕ɹεΫϨΠϐϯά͢ΔWebαΠτͷར༻ن໿ʹҧ൓͢Δ৔߹ ✕ɹஶ࡞ݖΛແࢹͨ͠ར༻ɺෳ੡౳Λߦͬͯ͠·ͬͨ৔߹
 ʢஶ࡞ݖ๏21৚ͳͲʣ PigData,

    ITหޢ࢜ʹฉ͘ʮاۀͱͯ͠ͷεΫϨΠϐϯά͸ҧ๏ͳͷ͔ʁʯ, https://services.sms-datatech.co.jp/pig-data/2019/07/03/scrapinglaw/ 7 Be Polite!!!
  8. politeύοέʔδͱ͸

  9. politeύοέʔδͱ͸ 9 ϦϦʔε೔ɿr ։ൃऀɿDmytro Perepolkin URLɿhttps://github.com/dmi3kno/polite https://www.rdocumentation.org/packages/polite/versions/0.1.1 The goal of

    polite is to promote responsible web etiquette. Be Nice on the Web
  10. politeύοέʔδͱ͸ 10 https://www.rdocumentation.org/packages/polite/versions/0.1.1 The three pillars of a polite session

    are seeking permission, taking slowly and never asking twice. Be Nice on the Web
  11. 11 politeͷ࣮ફ͢ΔεΫϨΠϐϯάͷΤνέοτ ɾऔಘִؒΛۭ͚Δ ʙtaking slowlyʙ ɾऔಘڐՄͷ֬ೝ ʙseeking permissionʙ ɾ֬ೝΛ܁Γฦ͞ͳ͍ ʙnever

    asking twiceʙ politeύοέʔδͱ͸
  12. 12 politeͷ࣮ફ͢ΔεΫϨΠϐϯάͷΤνέοτ ɾऔಘڐՄͷ֬ೝ ʙseeking permissionʙ ɾऔಘִؒΛۭ͚Δ ʙtaking slowlyʙ ɾ֬ೝΛ܁Γฦ͞ͳ͍ ʙnever

    asking twiceʙ politeύοέʔδͱ͸ ๏తͳ՝୊ͳͲΛશͯճආͰ͖Δ༁Ͱ͸ͳ͍ ݸਓ৘ใ ར༻ن໿ ஶ࡞ݖ
  13. politeύοέʔδͷ࢖༻खॱ

  14. politeύοέʔδͷ࢖༻खॱ 14 1, bow ʙ Introduce yourself to the hostʙ

    2, scrape ʙ Scrape the content of authorized page/APIʙ ( 3, rvest ʙ helps you scrape information from web pagesʙ ) 4, nod ʙ Agree Modification Of Session Path With The Hostʙ polite࢖༻खॱ https://www.rdocumentation.org/packages/polite/versions/0.1.1 https://www.rdocumentation.org/packages/rvest/versions/0.3.6
  15. bow

  16. politeύοέʔδͷ࢖༻खॱʙ bowʙ 16 ϗετʹࣗ਎ͷ৘ใΛ఻ୡ͢Δ ओͳҾ਺ urlɿର৅ͱ͢ΔURL user_agentɿUAͷจࣈྻʢࣗ਎ͷ࿈བྷઌ΋ࡌͤΔͱঘྑ͠ʣ delayɿεΫϨΠϐϯάͷִؒ

  17. ϗετʹࣗ਎ͷ৘ใΛ఻ୡ͢Δ ओͳҾ਺ urlɿର৅ͱ͢ΔURL user_agentɿUAͷจࣈྻʢࣗ਎ͷ࿈བྷઌ΋ࡌͤΔͱঘྑ͠ʣ delayɿεΫϨΠϐϯάͷִؒ politeύοέʔδͷ࢖༻खॱʙ bowʙ delayɿεΫϨΠϐϯάͷִؒ 17

  18. ϗετʹࣗ਎ͷ৘ใΛ఻ୡ͢Δ ओͳҾ਺ urlɿର৅ͱ͢ΔURL user_agentɿUAͷจࣈྻʢࣗ਎ͷ࿈བྷઌ΋ࡌͤΔͱঘྑ͠ʣ delayɿεΫϨΠϐϯάͷִؒ politeύοέʔδͷ࢖༻खॱʙ bowʙ delayɿεΫϨΠϐϯάͷִؒ 18 औಘִؒΛۭ͚Δ

    ʙtaking slowlyʙ
  19. politeύοέʔδͷ࢖༻खॱʙ bowʙ ϗετʹࣗ਎ͷ৘ใΛ఻ୡ͢Δ ओͳҾ਺ urlɿର৅ͱ͢ΔURL user_agentɿUAͷจࣈྻʢࣗ਎ͷ࿈བྷઌ΋ࡌͤΔͱঘྑ͠ʣ delayɿεΫϨΠϐϯάͷִؒ 19

  20. 20 robots.txtͱ͸ Ϋϩʔϥʔ΁ͷࢦࣔΛهड़͢ΔςΩετϑΝΠϧͷ͜ͱ*1 *1 https://wacul-ai.com/blog/seo/internal-seo/seo-robots-txt/ *2 https://www.rdocumentation.org/packages/robotstxt/versions/0.7.13 robots.txt ࠓճର৅ʹ͍ͯ͠ΔWebαΠτͷrobots.txtʢhttps://www.cheese.com/robots.txtʣ User-agent:

    * Sitemap: https://www.cheese.com/sitemap.xml *2 politeύοέʔδͷ࢖༻खॱʙ bowʙ
  21. 21 ڐՄ͞Ε͍ͯͳ͍৔߹ politeύοέʔδͷ࢖༻खॱʙ bowʙ

  22. ڐՄ͞Ε͍ͯͳ͍৔߹ politeύοέʔδͷ࢖༻खॱʙ bowʙ 22 औಘڐՄͷ֬ೝ ʙseeking permissionʙ

  23. scrape

  24. 24 ର৅ϖʔδͷ৘ใΛऔಘ͢Δ ओͳҾ਺ bowɿbowͰ࡞੒͞ΕͨΦϒδΣΫτ queryɿbowͰࢦఆͨ͠URLʹ௥Ճ͢Δύϥϝʔλ politeύοέʔδͷ࢖༻खॱʙ scrapeʙ

  25. rvest

  26. 26 ϦϦʔε೔ɿr ։ൃऀɿ Hadley Wickham URLɿhttps://github.com/tidyverse/rvest Easily Harvest (Scrape) Web

    Pages https://www.rdocumentation.org/packages/rvest/versions/0.3.6 politeύοέʔδͷ࢖༻खॱʙ rvestʙ
  27. 27 scrapeͰऔಘͨ͠HTML Document͔Βඞཁͳ৘ใΛநग़ ࢀߟจݙɿ Intro to {polite} Web Scraping of

    Soccer Data with R!ʢhttps://ryo-n7.github.io/2020-05-14-webscrape-soccer-data-with-R/ʣ RʹΑΔεΫϨΠϐϯάೖ໳ʢhttps://www.amazon.co.jp/dp/486354216Xʣ RϢʔβͷͨΊͷRStudio[࣮ફ]ೖ໳−tidyverseʹΑΔϞμϯͳ෼ੳϑϩʔͷੈք−ʢhttps://www.amazon.co.jp/dp/4774198536ʣ politeύοέʔδͷ࢖༻खॱʙ rvestʙ
  28. nod

  29. 29 औಘઌͷύεΛมߋ͢Δ ओͳҾ਺ bowɿbowͰ࡞੒͞ΕͨΦϒδΣΫτ pathɿมߋઌͷύε politeύοέʔδͷ࢖༻खॱʙ nodʙ

  30. औಘઌͷύεΛมߋ͢Δ ओͳҾ਺ bowɿbowͰ࡞੒͞ΕͨΦϒδΣΫτ pathɿมߋઌͷύε politeύοέʔδͷ࢖༻खॱʙ nodʙ ֬ೝΛ܁Γฦ͞ͳ͍ ʙnever asking twiceʙ

    30
  31. ·ͱΊ 31 politeͷखॱ ɹɹ1, bow……robots.txtͷ֬ೝɺUserAgentͷ௨஌ɺdelayͷઃఆɻ ɹɹ2, scrape……ίϯςϯπͷऔಘ ɹɹ( 3, rvest……ඞཁͳ৘ใͷநग़

    ) ɹɹ4, nod……ର৅ύεͷมߋ politeͷ3ͭͷओػೳ ɾऔಘִؒΛۭ͚Δ ʙtaking slowlyʙ ɹɹɾऔಘڐՄͷ֬ೝ ʙseeking permissionʙ ɹɹɾ֬ೝΛ܁Γฦ͞ͳ͍ ʙnever asking twiceʙ → bowͰઃఆɻσϑΥϧτ͸5ඵɻ → nodͰରԠɻ → bowͰrobots.txt͔Βऔಘɻ
  32. ͦͷଞ஫ҙࣄ߲ 32 ར༻ن໿ ن໿ͰεΫϨΠϐϯά͕ېࢭ͞Ε͍ͯΔαΠτͰεΫϨΠϐϯάΛߦ͏ͱɺଛ֐ഛঈ ੥ٻͳͲͷՄೳੑ͕͋Δɻʢͨͩ͠ɺձһొ࿥ͳͲϢʔβʔ͕ಉҙͨ͠৔߹ͷΈɻʣ ݸਓ৘ใ ݸਓ৘ใͷऔಘ͢Δࡍʹ͸ɺར༻໨తΛຊਓʹ໌ࣔ͢Δඞཁ͕͋Δɻ ஶ࡞ݖ ஶ࡞෺Λݖརऀͷಉҙͳ͘ίϐʔ΍อଘΛ͢Δߦҝ͸ɺஶ࡞ݖ৵֐ʹ౰ͨΔɻ ʢͨͩ͠ɺ৘ใղੳͷͨΊͷෳ੡౳͸ɺݖརऀͷಉҙͳ͘ߦ͏͜ͱ͕Ͱ͖Δɻʣ

    PigData, ITหޢ࢜ʹฉ͘ʮاۀͱͯ͠ͷεΫϨΠϐϯά͸ҧ๏ͳͷ͔ʁʯ, https://services.sms-datatech.co.jp/pig-data/2019/07/03/scrapinglaw/ IT๏຿ɾAIɾFintechͷ๏཯ʹৄ͍͠หޢ࢜ʛத໺लढ़, ʮʲεΫϨΠϐϯάͱ๏཯ʳεΫϨΠϐϯάͬͯ๏཯తʹԿ͕OKͰԿ͕OUTͳͷ͔Λหޢ͕࢜ղઆɻʯ, https://it-bengosi.com/blog/scraping/
  33. ͦͷଞ஫ҙࣄ߲ ར༻ن໿ ن໿ͰεΫϨΠϐϯά͕ېࢭ͞Ε͍ͯΔαΠτͰεΫϨΠϐϯάΛߦ͏ͱɺଛ֐ഛঈ ੥ٻͳͲͷՄೳੑ͕͋Δɻʢͨͩ͠ɺձһొ࿥ͳͲϢʔβʔ͕ಉҙͨ͠৔߹ͷΈɻʣ ݸਓ৘ใ ݸਓ৘ใͷऔಘ͢Δࡍʹ͸ɺར༻໨తΛຊਓʹ໌ࣔ͢Δඞཁ͕͋Δɻ ஶ࡞ݖ ஶ࡞෺Λݖརऀͷಉҙͳ͘ίϐʔ΍อଘΛ͢Δߦҝ͸ɺஶ࡞ݖ৵֐ʹ౰ͨΔɻ ʢͨͩ͠ɺ৘ใղੳͷͨΊͷෳ੡౳͸ɺݖརऀͷಉҙͳ͘ߦ͏͜ͱ͕Ͱ͖Δɻʣ PigData,

    ITหޢ࢜ʹฉ͘ʮاۀͱͯ͠ͷεΫϨΠϐϯά͸ҧ๏ͳͷ͔ʁʯ, https://services.sms-datatech.co.jp/pig-data/2019/07/03/scrapinglaw/ IT๏຿ɾAIɾFintechͷ๏཯ʹৄ͍͠หޢ࢜ʛத໺लढ़, ʮʲεΫϨΠϐϯάͱ๏཯ʳεΫϨΠϐϯάͬͯ๏཯తʹԿ͕OKͰԿ͕OUTͳͷ͔Λหޢ͕࢜ղઆɻʯ, https://it-bengosi.com/blog/scraping/ 33 ֤ࣗɺཁ֬ೝͰ͓ئ͍͠·͢……ɻ
  34. ࢀߟจݙ 34 R Documentation, ʮpolite packageʯ https://www.rdocumentation.org/packages/polite/versions/0.1.1 R Documentation, ʮrvest

    packageʯ https://www.rdocumentation.org/packages/rvest/versions/0.3.6 ΞΫηεղੳπʔϧʮAIΞφϦετʯϒϩά, ʮrobots.txtͱ͸ʁҙຯ͔Βઃఆํ๏·Ͱৄ͘͠ղઆʯ https://wacul-ai.com/blog/seo/internal-seo/seo-robots-txt/ Octoparse, ʮWebεΫϨΠϐϯάͱ͸ʁఆ͔ٛΒԠ༻·Ͱͷઆ໌ʯ https://www.octoparse.jp/blog/web-scraping/ PigData, ITหޢ࢜ʹฉ͘ʮاۀͱͯ͠ͷεΫϨΠϐϯά͸ҧ๏ͳͷ͔ʁʯ ɹhttps://services.sms-datatech.co.jp/pig-data/2019/07/03/scrapinglaw/
  35. ࢀߟจݙ 35 จԽி, ʮஶ࡞෺͕ࣗ༝ʹ࢖͑Δ৔߹ʯ https://www.bunka.go.jp/seisaku/chosakuken/seidokaisetsu/gaiyo/chosakubutsu_jiyu.html Stimulator, ʮWebεΫϨΠϐϯά͢ΔࡍͷϧʔϧͱPythonʹΑΔن໿ͷಡΈࠐΈʯ https://vaaaaaanquish.hatenablog.com/entry/2017/12/01/064227 IT๏຿ɾAIɾFintechͷ๏཯ʹৄ͍͠หޢ࢜ʛத໺लढ़, ʮʲεΫϨΠϐϯάͱ๏཯ʳεΫϨΠ

    ϐϯάͬͯ๏཯తʹԿ͕OKͰԿ͕OUTͳͷ͔Λหޢ͕࢜ղઆɻʯ https://it-bengosi.com/blog/scraping/ PigData, ʮʲεΫϨΠϐϯάʳҧ๏ʹͳΒͳ͍αʔϏεύλʔϯ5બʯ https://services.sms-datatech.co.jp/pig-data/2020/01/15/scrapinglaw3/ R Documentation, ʮrobotstxt packageʯ https://www.rdocumentation.org/packages/robotstxt/versions/0.7.13
  36. ࢀߟจݙ 36 ੴా ج޿, ࢢ઒ ଠ༞, ӝੜ ਅ໵, ౬୩ ܒ໌,

    γʔΞϯυΞʔϧݚڀॴ,ʮRʹΑΔεΫϨΠϐϯάೖ໳ʯ https://www.amazon.co.jp/dp/486354216X দଜ ༏࠸, ౬୩ ܒ໌, લా ࿨׮, لϊఆ อྱ, ٕज़ධ࿦ࣾ, ʮRϢʔβͷͨΊͷRStudio[࣮ફ]ೖ໳−tidyverseʹΑΔϞμϯͳ෼ੳϑϩʔͷੈք−ʯ https://www.amazon.co.jp/dp/4774198536 ʮIntro to {polite} Web Scraping of Soccer Data with R!ʯ https://ryo-n7.github.io/2020-05-14-webscrape-soccer-data-with-R/
  37. Appendix

  38. Appendix 38 Τνέοτͱ͸ Τνέοτʢӳޠදهʣʪϑϥϯεʫétiquette ࡞๏ͷ͜ͱɻϑϥϯεޠͰ͸ݩདྷɼਖ਼ࡳɼՙࡳͳͲͷʪ;ͩʫͷҙɻ ·ͨٶఊ΁ট͔ΕͨऀͷߦಈΛࢦࣔͨ͠௨༻ࡳΛ͍͍ɼ͜ΕΑΓٶఊ ّྱͷҙͱͳΔɻసͯ͡Ұൠࣾձੜ׆ͷ֤छ࡞๏Λ͢͞ɻ https://kotobank.jp/word/%E3%82%A8%E3%83%81%E3%82%B1%E3%83%83%E3%83%88-445121

  39. Appendix 39 Ԭ࡚ࢢཱதԝਤॻؗࣄ݅ Ԭ࡚ࢢཱதԝਤॻؗࣄ݅ʢ͓͔͖͟͠ΓͭͪΎ͏͓͏ͱ͠ΐ͔Μ͚͡Μʣ͸ɺ2010 ೥3݄ࠒʹԬ࡚ࢢཱਤॻؗͷଂॻݕࡧγεςϜʹΞΫηεো֐͕ൃੜ͠ɺར༻ऀͷҰ ਓ͕ୁั͞Εͨࣄ݅Ͱ͋Δɻʢதུʣ உੑ͕࣮ࡍʹߦ͍ͬͯͨͷ͸ɺଂॻݕࡧγεςϜͷ࢖͍উखʹຬ଍͠ͳ͔ͬͨͨΊࣗ ਎Ͱ࡞੒ͨ͠ΫϩʔϥΛ࣮ߦ͠ɺଂॻݕࡧγεςϜ͔Βਤॻ৘ใΛऔಘ͢Δ͜ͱͰ ͋ͬͨɻʢதུʣ 20೔ؒͷޯཹͱऔΓௐ΂ͷޙɺ6݄14೔ʹ͸உੑͷۀ຿๦֐ͷڧ͍ҙਤ͕ೝΊΒΕͳ

    ͍ͱͯ͠ىૌ༛༧ॲ෼ͱͳ͕ͬͨɺઐ໳Ո΍ٕज़ऀ͔Β͸௕ظʹΘͨΔޯཹͷਖ਼౰ੑ ͓ΑͼͦΕҎલʹୁั͕ඞཁͰ͋ͬͨͷ͔͕ٙ໰ࢹ͞Ε͍ͯΔɻʢWikipediaʣ https://ja.wikipedia.org/wiki/%E5%B2%A1%E5%B4%8E%E5%B8%82%E7%AB%8B%E4%B8%AD%E5%A4%AE%E5%9B%B3%E6%9B%B8%E9%A4%A8%E4%BA%8B%E4%BB%B6
  40. Appendix 40 UserAgentͱ͸ ϢʔβʔΤʔδΣϯτ(User Agent)ͱ͸ɺ΢ΣϒαΠτ΁ΞΫηε͢Δࡍ ʹ࢖༻͞ΕΔϓϩάϥϜɺ͋Δ͍͸ͦΕΒΛࣝผ͢ΔͨΊͷจࣈྻͷ ͜ͱΛࢦ͢ɻ*1 *1 https://www.irep.co.jp/knowledge/glossary/detail/id=10210/ *2

    https://qiita.com/nightyknite/items/b2590a69f2e0135756dc ྫɿMozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 ɹɹ(KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36 *2
  41. enjoy!!