Upgrade to Pro — share decks privately, control downloads, hide ads and more …

web scraping with polite package

bk
December 05, 2020

web scraping with polite package

bk

December 05, 2020
Tweet

More Decks by bk

Other Decks in Programming

Transcript

  1. ର৅ͱ໨త 3 ର৅ࢹௌऀ ໨త ࿩͞ͳ͍͜ͱ • WebεΫϨΠϐϯά Λ;ΜΘΓ஌͍ͬͯ Δ •

    WebεΫϨΠϐϯά ͷن੍पΓʹෆ҆Λ ײ͍ͯ͡Δ • politeύοέʔδΛ௨ ͯ͠WebεΫϨΠϐ ϯάΤνέοτͷج ૅΛֶͿ • rvestͳͲΛ࢖༻ͨ͠ εΫϨΠϐϯάͦͷ ΋ͷͷૢ࡞ • औಘͨ͠৘ใͷ࢖༻ ্ͷ੍ݶ
  2. politeύοέʔδͱ͸ 10 https://www.rdocumentation.org/packages/polite/versions/0.1.1 The three pillars of a polite session

    are seeking permission, taking slowly and never asking twice. Be Nice on the Web
  3. 12 politeͷ࣮ફ͢ΔεΫϨΠϐϯάͷΤνέοτ ɾऔಘڐՄͷ֬ೝ ʙseeking permissionʙ ɾऔಘִؒΛۭ͚Δ ʙtaking slowlyʙ ɾ֬ೝΛ܁Γฦ͞ͳ͍ ʙnever

    asking twiceʙ politeύοέʔδͱ͸ ๏తͳ՝୊ͳͲΛશͯճආͰ͖Δ༁Ͱ͸ͳ͍ ݸਓ৘ใ ར༻ن໿ ஶ࡞ݖ
  4. politeύοέʔδͷ࢖༻खॱ 14 1, bow ʙ Introduce yourself to the hostʙ

    2, scrape ʙ Scrape the content of authorized page/APIʙ ( 3, rvest ʙ helps you scrape information from web pagesʙ ) 4, nod ʙ Agree Modification Of Session Path With The Hostʙ polite࢖༻खॱ https://www.rdocumentation.org/packages/polite/versions/0.1.1 https://www.rdocumentation.org/packages/rvest/versions/0.3.6
  5. bow

  6. 26 ϦϦʔε೔ɿr ։ൃऀɿ Hadley Wickham URLɿhttps://github.com/tidyverse/rvest Easily Harvest (Scrape) Web

    Pages https://www.rdocumentation.org/packages/rvest/versions/0.3.6 politeύοέʔδͷ࢖༻खॱʙ rvestʙ
  7. 27 scrapeͰऔಘͨ͠HTML Document͔Βඞཁͳ৘ใΛநग़ ࢀߟจݙɿ Intro to {polite} Web Scraping of

    Soccer Data with R!ʢhttps://ryo-n7.github.io/2020-05-14-webscrape-soccer-data-with-R/ʣ RʹΑΔεΫϨΠϐϯάೖ໳ʢhttps://www.amazon.co.jp/dp/486354216Xʣ RϢʔβͷͨΊͷRStudio[࣮ફ]ೖ໳−tidyverseʹΑΔϞμϯͳ෼ੳϑϩʔͷੈք−ʢhttps://www.amazon.co.jp/dp/4774198536ʣ politeύοέʔδͷ࢖༻खॱʙ rvestʙ
  8. nod

  9. ·ͱΊ 31 politeͷखॱ ɹɹ1, bow……robots.txtͷ֬ೝɺUserAgentͷ௨஌ɺdelayͷઃఆɻ ɹɹ2, scrape……ίϯςϯπͷऔಘ ɹɹ( 3, rvest……ඞཁͳ৘ใͷநग़

    ) ɹɹ4, nod……ର৅ύεͷมߋ politeͷ3ͭͷओػೳ ɾऔಘִؒΛۭ͚Δ ʙtaking slowlyʙ ɹɹɾऔಘڐՄͷ֬ೝ ʙseeking permissionʙ ɹɹɾ֬ೝΛ܁Γฦ͞ͳ͍ ʙnever asking twiceʙ → bowͰઃఆɻσϑΥϧτ͸5ඵɻ → nodͰରԠɻ → bowͰrobots.txt͔Βऔಘɻ
  10. ͦͷଞ஫ҙࣄ߲ 32 ར༻ن໿ ن໿ͰεΫϨΠϐϯά͕ېࢭ͞Ε͍ͯΔαΠτͰεΫϨΠϐϯάΛߦ͏ͱɺଛ֐ഛঈ ੥ٻͳͲͷՄೳੑ͕͋Δɻʢͨͩ͠ɺձһొ࿥ͳͲϢʔβʔ͕ಉҙͨ͠৔߹ͷΈɻʣ ݸਓ৘ใ ݸਓ৘ใͷऔಘ͢Δࡍʹ͸ɺར༻໨తΛຊਓʹ໌ࣔ͢Δඞཁ͕͋Δɻ ஶ࡞ݖ ஶ࡞෺Λݖརऀͷಉҙͳ͘ίϐʔ΍อଘΛ͢Δߦҝ͸ɺஶ࡞ݖ৵֐ʹ౰ͨΔɻ ʢͨͩ͠ɺ৘ใղੳͷͨΊͷෳ੡౳͸ɺݖརऀͷಉҙͳ͘ߦ͏͜ͱ͕Ͱ͖Δɻʣ

    PigData, ITหޢ࢜ʹฉ͘ʮاۀͱͯ͠ͷεΫϨΠϐϯά͸ҧ๏ͳͷ͔ʁʯ, https://services.sms-datatech.co.jp/pig-data/2019/07/03/scrapinglaw/ IT๏຿ɾAIɾFintechͷ๏཯ʹৄ͍͠หޢ࢜ʛத໺लढ़, ʮʲεΫϨΠϐϯάͱ๏཯ʳεΫϨΠϐϯάͬͯ๏཯తʹԿ͕OKͰԿ͕OUTͳͷ͔Λหޢ͕࢜ղઆɻʯ, https://it-bengosi.com/blog/scraping/
  11. ͦͷଞ஫ҙࣄ߲ ར༻ن໿ ن໿ͰεΫϨΠϐϯά͕ېࢭ͞Ε͍ͯΔαΠτͰεΫϨΠϐϯάΛߦ͏ͱɺଛ֐ഛঈ ੥ٻͳͲͷՄೳੑ͕͋Δɻʢͨͩ͠ɺձһొ࿥ͳͲϢʔβʔ͕ಉҙͨ͠৔߹ͷΈɻʣ ݸਓ৘ใ ݸਓ৘ใͷऔಘ͢Δࡍʹ͸ɺར༻໨తΛຊਓʹ໌ࣔ͢Δඞཁ͕͋Δɻ ஶ࡞ݖ ஶ࡞෺Λݖརऀͷಉҙͳ͘ίϐʔ΍อଘΛ͢Δߦҝ͸ɺஶ࡞ݖ৵֐ʹ౰ͨΔɻ ʢͨͩ͠ɺ৘ใղੳͷͨΊͷෳ੡౳͸ɺݖརऀͷಉҙͳ͘ߦ͏͜ͱ͕Ͱ͖Δɻʣ PigData,

    ITหޢ࢜ʹฉ͘ʮاۀͱͯ͠ͷεΫϨΠϐϯά͸ҧ๏ͳͷ͔ʁʯ, https://services.sms-datatech.co.jp/pig-data/2019/07/03/scrapinglaw/ IT๏຿ɾAIɾFintechͷ๏཯ʹৄ͍͠หޢ࢜ʛத໺लढ़, ʮʲεΫϨΠϐϯάͱ๏཯ʳεΫϨΠϐϯάͬͯ๏཯తʹԿ͕OKͰԿ͕OUTͳͷ͔Λหޢ͕࢜ղઆɻʯ, https://it-bengosi.com/blog/scraping/ 33 ֤ࣗɺཁ֬ೝͰ͓ئ͍͠·͢……ɻ
  12. ࢀߟจݙ 34 R Documentation, ʮpolite packageʯ https://www.rdocumentation.org/packages/polite/versions/0.1.1 R Documentation, ʮrvest

    packageʯ https://www.rdocumentation.org/packages/rvest/versions/0.3.6 ΞΫηεղੳπʔϧʮAIΞφϦετʯϒϩά, ʮrobots.txtͱ͸ʁҙຯ͔Βઃఆํ๏·Ͱৄ͘͠ղઆʯ https://wacul-ai.com/blog/seo/internal-seo/seo-robots-txt/ Octoparse, ʮWebεΫϨΠϐϯάͱ͸ʁఆ͔ٛΒԠ༻·Ͱͷઆ໌ʯ https://www.octoparse.jp/blog/web-scraping/ PigData, ITหޢ࢜ʹฉ͘ʮاۀͱͯ͠ͷεΫϨΠϐϯά͸ҧ๏ͳͷ͔ʁʯ ɹhttps://services.sms-datatech.co.jp/pig-data/2019/07/03/scrapinglaw/
  13. ࢀߟจݙ 35 จԽி, ʮஶ࡞෺͕ࣗ༝ʹ࢖͑Δ৔߹ʯ https://www.bunka.go.jp/seisaku/chosakuken/seidokaisetsu/gaiyo/chosakubutsu_jiyu.html Stimulator, ʮWebεΫϨΠϐϯά͢ΔࡍͷϧʔϧͱPythonʹΑΔن໿ͷಡΈࠐΈʯ https://vaaaaaanquish.hatenablog.com/entry/2017/12/01/064227 IT๏຿ɾAIɾFintechͷ๏཯ʹৄ͍͠หޢ࢜ʛத໺लढ़, ʮʲεΫϨΠϐϯάͱ๏཯ʳεΫϨΠ

    ϐϯάͬͯ๏཯తʹԿ͕OKͰԿ͕OUTͳͷ͔Λหޢ͕࢜ղઆɻʯ https://it-bengosi.com/blog/scraping/ PigData, ʮʲεΫϨΠϐϯάʳҧ๏ʹͳΒͳ͍αʔϏεύλʔϯ5બʯ https://services.sms-datatech.co.jp/pig-data/2020/01/15/scrapinglaw3/ R Documentation, ʮrobotstxt packageʯ https://www.rdocumentation.org/packages/robotstxt/versions/0.7.13
  14. ࢀߟจݙ 36 ੴా ج޿, ࢢ઒ ଠ༞, ӝੜ ਅ໵, ౬୩ ܒ໌,

    γʔΞϯυΞʔϧݚڀॴ,ʮRʹΑΔεΫϨΠϐϯάೖ໳ʯ https://www.amazon.co.jp/dp/486354216X দଜ ༏࠸, ౬୩ ܒ໌, લా ࿨׮, لϊఆ อྱ, ٕज़ධ࿦ࣾ, ʮRϢʔβͷͨΊͷRStudio[࣮ફ]ೖ໳−tidyverseʹΑΔϞμϯͳ෼ੳϑϩʔͷੈք−ʯ https://www.amazon.co.jp/dp/4774198536 ʮIntro to {polite} Web Scraping of Soccer Data with R!ʯ https://ryo-n7.github.io/2020-05-14-webscrape-soccer-data-with-R/
  15. Appendix 40 UserAgentͱ͸ ϢʔβʔΤʔδΣϯτ(User Agent)ͱ͸ɺ΢ΣϒαΠτ΁ΞΫηε͢Δࡍ ʹ࢖༻͞ΕΔϓϩάϥϜɺ͋Δ͍͸ͦΕΒΛࣝผ͢ΔͨΊͷจࣈྻͷ ͜ͱΛࢦ͢ɻ*1 *1 https://www.irep.co.jp/knowledge/glossary/detail/id=10210/ *2

    https://qiita.com/nightyknite/items/b2590a69f2e0135756dc ྫɿMozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 ɹɹ(KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36 *2