Slide 1

Slide 1 text

“polite” ͰकΔ WebεΫϨΠϐϯάͷΤνέοτ ʙpoliteύοέʔδͷ঺հʙ

Slide 2

Slide 2 text

໨࣍ 2 ର৅ͱ໨త……………………………………………………………………………………p.3 WebεΫϨΠϐϯάͱ͸…………………………………………………………………p.4ʙ7 politeύοέʔδͱ͸………………………………………………………………………p.8ʙ12 politeύοέʔδͷ࢖༻खॱ……………………………………………………………p.13ʙ30 ·ͱΊ……………………………………………………………………………………………p.31 ͦͷଞ஫ҙࣄ߲………………………………………………………………………………p.32ʙ33 ࢀߟจݙ…………………………………………………………………………………………p.34ʙ36 Appendix……………………………………………………………………………………………p.37ʙ40

Slide 3

Slide 3 text

ର৅ͱ໨త 3 ର৅ࢹௌऀ ໨త ࿩͞ͳ͍͜ͱ • WebεΫϨΠϐϯά Λ;ΜΘΓ஌͍ͬͯ Δ • WebεΫϨΠϐϯά ͷن੍पΓʹෆ҆Λ ײ͍ͯ͡Δ • politeύοέʔδΛ௨ ͯ͠WebεΫϨΠϐ ϯάΤνέοτͷج ૅΛֶͿ • rvestͳͲΛ࢖༻ͨ͠ εΫϨΠϐϯάͦͷ ΋ͷͷૢ࡞ • औಘͨ͠৘ใͷ࢖༻ ্ͷ੍ݶ

Slide 4

Slide 4 text

WebεΫϨΠϐϯάͱ͸

Slide 5

Slide 5 text

WebεΫϨΠϐϯάͱ͸ 5 WebεΫϨΠϐϯάͱ͸ WebεΫϨΠϐϯάʢWebσʔλநग़ɺεΫϦʔϯεΫϨΠϐϯάɺ Webσʔλऩूͱ΋ݺ͹ΕΔʣ͸ɺWebαΠτ͔Β৘ใΛநग़͢Δίϯ ϐϡʔλιϑτ΢ΣΞٕज़ͷ͜ͱͰ͢ɻ https://www.octoparse.jp/blog/web-scraping/

Slide 6

Slide 6 text

WebεΫϨΠϐϯάͱ͸ 6 εΫϨΠϐϯά͕ҧ๏ʹͳΔ̐έʔε ✕ɹαʔόʔʹෛՙΛ͔͚ͯ͠·ͬͨ৔߹
 ʢܐ๏233৚ِܭۀ຿๦֐ࡑɺಉ234৚ిࢠܭࢉػଛյ౳ۀ຿๦֐ࡑʣ ✕ɹݸਓ৘ใΛಉҙͳ͘औಘɺެ։ɺചങͯ͠͠·ͬͨ৔߹
 ʢݸਓ৘ใอޢ๏ҧ൓ʣ ✕ɹεΫϨΠϐϯά͢ΔWebαΠτͷར༻ن໿ʹҧ൓͢Δ৔߹ ✕ɹஶ࡞ݖΛແࢹͨ͠ར༻ɺෳ੡౳Λߦͬͯ͠·ͬͨ৔߹
 ʢஶ࡞ݖ๏21৚ͳͲʣ PigData, ITหޢ࢜ʹฉ͘ʮاۀͱͯ͠ͷεΫϨΠϐϯά͸ҧ๏ͳͷ͔ʁʯ, https://services.sms-datatech.co.jp/pig-data/2019/07/03/scrapinglaw/

Slide 7

Slide 7 text

WebεΫϨΠϐϯάͱ͸ εΫϨΠϐϯά͕ҧ๏ʹͳΔ̐έʔε ✕ɹαʔόʔʹෛՙΛ͔͚ͯ͠·ͬͨ৔߹
 ʢܐ๏233৚ِܭۀ຿๦֐ࡑɺಉ234৚ిࢠܭࢉػଛյ౳ۀ຿๦֐ࡑʣ ✕ɹݸਓ৘ใΛಉҙͳ͘औಘɺެ։ɺചങͯ͠͠·ͬͨ৔߹
 ʢݸਓ৘ใอޢ๏ҧ൓ʣ ✕ɹεΫϨΠϐϯά͢ΔWebαΠτͷར༻ن໿ʹҧ൓͢Δ৔߹ ✕ɹஶ࡞ݖΛແࢹͨ͠ར༻ɺෳ੡౳Λߦͬͯ͠·ͬͨ৔߹
 ʢஶ࡞ݖ๏21৚ͳͲʣ PigData, ITหޢ࢜ʹฉ͘ʮاۀͱͯ͠ͷεΫϨΠϐϯά͸ҧ๏ͳͷ͔ʁʯ, https://services.sms-datatech.co.jp/pig-data/2019/07/03/scrapinglaw/ 7 Be Polite!!!

Slide 8

Slide 8 text

politeύοέʔδͱ͸

Slide 9

Slide 9 text

politeύοέʔδͱ͸ 9 ϦϦʔε೔ɿr ։ൃऀɿDmytro Perepolkin URLɿhttps://github.com/dmi3kno/polite https://www.rdocumentation.org/packages/polite/versions/0.1.1 The goal of polite is to promote responsible web etiquette. Be Nice on the Web

Slide 10

Slide 10 text

politeύοέʔδͱ͸ 10 https://www.rdocumentation.org/packages/polite/versions/0.1.1 The three pillars of a polite session are seeking permission, taking slowly and never asking twice. Be Nice on the Web

Slide 11

Slide 11 text

11 politeͷ࣮ફ͢ΔεΫϨΠϐϯάͷΤνέοτ ɾऔಘִؒΛۭ͚Δ ʙtaking slowlyʙ ɾऔಘڐՄͷ֬ೝ ʙseeking permissionʙ ɾ֬ೝΛ܁Γฦ͞ͳ͍ ʙnever asking twiceʙ politeύοέʔδͱ͸

Slide 12

Slide 12 text

12 politeͷ࣮ફ͢ΔεΫϨΠϐϯάͷΤνέοτ ɾऔಘڐՄͷ֬ೝ ʙseeking permissionʙ ɾऔಘִؒΛۭ͚Δ ʙtaking slowlyʙ ɾ֬ೝΛ܁Γฦ͞ͳ͍ ʙnever asking twiceʙ politeύοέʔδͱ͸ ๏తͳ՝୊ͳͲΛશͯճආͰ͖Δ༁Ͱ͸ͳ͍ ݸਓ৘ใ ར༻ن໿ ஶ࡞ݖ

Slide 13

Slide 13 text

politeύοέʔδͷ࢖༻खॱ

Slide 14

Slide 14 text

politeύοέʔδͷ࢖༻खॱ 14 1, bow ʙ Introduce yourself to the hostʙ 2, scrape ʙ Scrape the content of authorized page/APIʙ ( 3, rvest ʙ helps you scrape information from web pagesʙ ) 4, nod ʙ Agree Modification Of Session Path With The Hostʙ polite࢖༻खॱ https://www.rdocumentation.org/packages/polite/versions/0.1.1 https://www.rdocumentation.org/packages/rvest/versions/0.3.6

Slide 15

Slide 15 text

bow

Slide 16

Slide 16 text

politeύοέʔδͷ࢖༻खॱʙ bowʙ 16 ϗετʹࣗ਎ͷ৘ใΛ఻ୡ͢Δ ओͳҾ਺ urlɿର৅ͱ͢ΔURL user_agentɿUAͷจࣈྻʢࣗ਎ͷ࿈བྷઌ΋ࡌͤΔͱঘྑ͠ʣ delayɿεΫϨΠϐϯάͷִؒ

Slide 17

Slide 17 text

ϗετʹࣗ਎ͷ৘ใΛ఻ୡ͢Δ ओͳҾ਺ urlɿର৅ͱ͢ΔURL user_agentɿUAͷจࣈྻʢࣗ਎ͷ࿈བྷઌ΋ࡌͤΔͱঘྑ͠ʣ delayɿεΫϨΠϐϯάͷִؒ politeύοέʔδͷ࢖༻खॱʙ bowʙ delayɿεΫϨΠϐϯάͷִؒ 17

Slide 18

Slide 18 text

ϗετʹࣗ਎ͷ৘ใΛ఻ୡ͢Δ ओͳҾ਺ urlɿର৅ͱ͢ΔURL user_agentɿUAͷจࣈྻʢࣗ਎ͷ࿈བྷઌ΋ࡌͤΔͱঘྑ͠ʣ delayɿεΫϨΠϐϯάͷִؒ politeύοέʔδͷ࢖༻खॱʙ bowʙ delayɿεΫϨΠϐϯάͷִؒ 18 औಘִؒΛۭ͚Δ ʙtaking slowlyʙ

Slide 19

Slide 19 text

politeύοέʔδͷ࢖༻खॱʙ bowʙ ϗετʹࣗ਎ͷ৘ใΛ఻ୡ͢Δ ओͳҾ਺ urlɿର৅ͱ͢ΔURL user_agentɿUAͷจࣈྻʢࣗ਎ͷ࿈བྷઌ΋ࡌͤΔͱঘྑ͠ʣ delayɿεΫϨΠϐϯάͷִؒ 19

Slide 20

Slide 20 text

20 robots.txtͱ͸ Ϋϩʔϥʔ΁ͷࢦࣔΛهड़͢ΔςΩετϑΝΠϧͷ͜ͱ*1 *1 https://wacul-ai.com/blog/seo/internal-seo/seo-robots-txt/ *2 https://www.rdocumentation.org/packages/robotstxt/versions/0.7.13 robots.txt ࠓճର৅ʹ͍ͯ͠ΔWebαΠτͷrobots.txtʢhttps://www.cheese.com/robots.txtʣ User-agent: * Sitemap: https://www.cheese.com/sitemap.xml *2 politeύοέʔδͷ࢖༻खॱʙ bowʙ

Slide 21

Slide 21 text

21 ڐՄ͞Ε͍ͯͳ͍৔߹ politeύοέʔδͷ࢖༻खॱʙ bowʙ

Slide 22

Slide 22 text

ڐՄ͞Ε͍ͯͳ͍৔߹ politeύοέʔδͷ࢖༻खॱʙ bowʙ 22 औಘڐՄͷ֬ೝ ʙseeking permissionʙ

Slide 23

Slide 23 text

scrape

Slide 24

Slide 24 text

24 ର৅ϖʔδͷ৘ใΛऔಘ͢Δ ओͳҾ਺ bowɿbowͰ࡞੒͞ΕͨΦϒδΣΫτ queryɿbowͰࢦఆͨ͠URLʹ௥Ճ͢Δύϥϝʔλ politeύοέʔδͷ࢖༻खॱʙ scrapeʙ

Slide 25

Slide 25 text

rvest

Slide 26

Slide 26 text

26 ϦϦʔε೔ɿr ։ൃऀɿ Hadley Wickham URLɿhttps://github.com/tidyverse/rvest Easily Harvest (Scrape) Web Pages https://www.rdocumentation.org/packages/rvest/versions/0.3.6 politeύοέʔδͷ࢖༻खॱʙ rvestʙ

Slide 27

Slide 27 text

27 scrapeͰऔಘͨ͠HTML Document͔Βඞཁͳ৘ใΛநग़ ࢀߟจݙɿ Intro to {polite} Web Scraping of Soccer Data with R!ʢhttps://ryo-n7.github.io/2020-05-14-webscrape-soccer-data-with-R/ʣ RʹΑΔεΫϨΠϐϯάೖ໳ʢhttps://www.amazon.co.jp/dp/486354216Xʣ RϢʔβͷͨΊͷRStudio[࣮ફ]ೖ໳−tidyverseʹΑΔϞμϯͳ෼ੳϑϩʔͷੈք−ʢhttps://www.amazon.co.jp/dp/4774198536ʣ politeύοέʔδͷ࢖༻खॱʙ rvestʙ

Slide 28

Slide 28 text

nod

Slide 29

Slide 29 text

29 औಘઌͷύεΛมߋ͢Δ ओͳҾ਺ bowɿbowͰ࡞੒͞ΕͨΦϒδΣΫτ pathɿมߋઌͷύε politeύοέʔδͷ࢖༻खॱʙ nodʙ

Slide 30

Slide 30 text

औಘઌͷύεΛมߋ͢Δ ओͳҾ਺ bowɿbowͰ࡞੒͞ΕͨΦϒδΣΫτ pathɿมߋઌͷύε politeύοέʔδͷ࢖༻खॱʙ nodʙ ֬ೝΛ܁Γฦ͞ͳ͍ ʙnever asking twiceʙ 30

Slide 31

Slide 31 text

·ͱΊ 31 politeͷखॱ ɹɹ1, bow……robots.txtͷ֬ೝɺUserAgentͷ௨஌ɺdelayͷઃఆɻ ɹɹ2, scrape……ίϯςϯπͷऔಘ ɹɹ( 3, rvest……ඞཁͳ৘ใͷநग़ ) ɹɹ4, nod……ର৅ύεͷมߋ politeͷ3ͭͷओػೳ ɾऔಘִؒΛۭ͚Δ ʙtaking slowlyʙ ɹɹɾऔಘڐՄͷ֬ೝ ʙseeking permissionʙ ɹɹɾ֬ೝΛ܁Γฦ͞ͳ͍ ʙnever asking twiceʙ → bowͰઃఆɻσϑΥϧτ͸5ඵɻ → nodͰରԠɻ → bowͰrobots.txt͔Βऔಘɻ

Slide 32

Slide 32 text

ͦͷଞ஫ҙࣄ߲ 32 ར༻ن໿ ن໿ͰεΫϨΠϐϯά͕ېࢭ͞Ε͍ͯΔαΠτͰεΫϨΠϐϯάΛߦ͏ͱɺଛ֐ഛঈ ੥ٻͳͲͷՄೳੑ͕͋Δɻʢͨͩ͠ɺձһొ࿥ͳͲϢʔβʔ͕ಉҙͨ͠৔߹ͷΈɻʣ ݸਓ৘ใ ݸਓ৘ใͷऔಘ͢Δࡍʹ͸ɺར༻໨తΛຊਓʹ໌ࣔ͢Δඞཁ͕͋Δɻ ஶ࡞ݖ ஶ࡞෺Λݖརऀͷಉҙͳ͘ίϐʔ΍อଘΛ͢Δߦҝ͸ɺஶ࡞ݖ৵֐ʹ౰ͨΔɻ ʢͨͩ͠ɺ৘ใղੳͷͨΊͷෳ੡౳͸ɺݖརऀͷಉҙͳ͘ߦ͏͜ͱ͕Ͱ͖Δɻʣ PigData, ITหޢ࢜ʹฉ͘ʮاۀͱͯ͠ͷεΫϨΠϐϯά͸ҧ๏ͳͷ͔ʁʯ, https://services.sms-datatech.co.jp/pig-data/2019/07/03/scrapinglaw/ IT๏຿ɾAIɾFintechͷ๏཯ʹৄ͍͠หޢ࢜ʛத໺लढ़, ʮʲεΫϨΠϐϯάͱ๏཯ʳεΫϨΠϐϯάͬͯ๏཯తʹԿ͕OKͰԿ͕OUTͳͷ͔Λหޢ͕࢜ղઆɻʯ, https://it-bengosi.com/blog/scraping/

Slide 33

Slide 33 text

ͦͷଞ஫ҙࣄ߲ ར༻ن໿ ن໿ͰεΫϨΠϐϯά͕ېࢭ͞Ε͍ͯΔαΠτͰεΫϨΠϐϯάΛߦ͏ͱɺଛ֐ഛঈ ੥ٻͳͲͷՄೳੑ͕͋Δɻʢͨͩ͠ɺձһొ࿥ͳͲϢʔβʔ͕ಉҙͨ͠৔߹ͷΈɻʣ ݸਓ৘ใ ݸਓ৘ใͷऔಘ͢Δࡍʹ͸ɺར༻໨తΛຊਓʹ໌ࣔ͢Δඞཁ͕͋Δɻ ஶ࡞ݖ ஶ࡞෺Λݖརऀͷಉҙͳ͘ίϐʔ΍อଘΛ͢Δߦҝ͸ɺஶ࡞ݖ৵֐ʹ౰ͨΔɻ ʢͨͩ͠ɺ৘ใղੳͷͨΊͷෳ੡౳͸ɺݖརऀͷಉҙͳ͘ߦ͏͜ͱ͕Ͱ͖Δɻʣ PigData, ITหޢ࢜ʹฉ͘ʮاۀͱͯ͠ͷεΫϨΠϐϯά͸ҧ๏ͳͷ͔ʁʯ, https://services.sms-datatech.co.jp/pig-data/2019/07/03/scrapinglaw/ IT๏຿ɾAIɾFintechͷ๏཯ʹৄ͍͠หޢ࢜ʛத໺लढ़, ʮʲεΫϨΠϐϯάͱ๏཯ʳεΫϨΠϐϯάͬͯ๏཯తʹԿ͕OKͰԿ͕OUTͳͷ͔Λหޢ͕࢜ղઆɻʯ, https://it-bengosi.com/blog/scraping/ 33 ֤ࣗɺཁ֬ೝͰ͓ئ͍͠·͢……ɻ

Slide 34

Slide 34 text

ࢀߟจݙ 34 R Documentation, ʮpolite packageʯ https://www.rdocumentation.org/packages/polite/versions/0.1.1 R Documentation, ʮrvest packageʯ https://www.rdocumentation.org/packages/rvest/versions/0.3.6 ΞΫηεղੳπʔϧʮAIΞφϦετʯϒϩά, ʮrobots.txtͱ͸ʁҙຯ͔Βઃఆํ๏·Ͱৄ͘͠ղઆʯ https://wacul-ai.com/blog/seo/internal-seo/seo-robots-txt/ Octoparse, ʮWebεΫϨΠϐϯάͱ͸ʁఆ͔ٛΒԠ༻·Ͱͷઆ໌ʯ https://www.octoparse.jp/blog/web-scraping/ PigData, ITหޢ࢜ʹฉ͘ʮاۀͱͯ͠ͷεΫϨΠϐϯά͸ҧ๏ͳͷ͔ʁʯ ɹhttps://services.sms-datatech.co.jp/pig-data/2019/07/03/scrapinglaw/

Slide 35

Slide 35 text

ࢀߟจݙ 35 จԽி, ʮஶ࡞෺͕ࣗ༝ʹ࢖͑Δ৔߹ʯ https://www.bunka.go.jp/seisaku/chosakuken/seidokaisetsu/gaiyo/chosakubutsu_jiyu.html Stimulator, ʮWebεΫϨΠϐϯά͢ΔࡍͷϧʔϧͱPythonʹΑΔن໿ͷಡΈࠐΈʯ https://vaaaaaanquish.hatenablog.com/entry/2017/12/01/064227 IT๏຿ɾAIɾFintechͷ๏཯ʹৄ͍͠หޢ࢜ʛத໺लढ़, ʮʲεΫϨΠϐϯάͱ๏཯ʳεΫϨΠ ϐϯάͬͯ๏཯తʹԿ͕OKͰԿ͕OUTͳͷ͔Λหޢ͕࢜ղઆɻʯ https://it-bengosi.com/blog/scraping/ PigData, ʮʲεΫϨΠϐϯάʳҧ๏ʹͳΒͳ͍αʔϏεύλʔϯ5બʯ https://services.sms-datatech.co.jp/pig-data/2020/01/15/scrapinglaw3/ R Documentation, ʮrobotstxt packageʯ https://www.rdocumentation.org/packages/robotstxt/versions/0.7.13

Slide 36

Slide 36 text

ࢀߟจݙ 36 ੴా ج޿, ࢢ઒ ଠ༞, ӝੜ ਅ໵, ౬୩ ܒ໌, γʔΞϯυΞʔϧݚڀॴ,ʮRʹΑΔεΫϨΠϐϯάೖ໳ʯ https://www.amazon.co.jp/dp/486354216X দଜ ༏࠸, ౬୩ ܒ໌, લా ࿨׮, لϊఆ อྱ, ٕज़ධ࿦ࣾ, ʮRϢʔβͷͨΊͷRStudio[࣮ફ]ೖ໳−tidyverseʹΑΔϞμϯͳ෼ੳϑϩʔͷੈք−ʯ https://www.amazon.co.jp/dp/4774198536 ʮIntro to {polite} Web Scraping of Soccer Data with R!ʯ https://ryo-n7.github.io/2020-05-14-webscrape-soccer-data-with-R/

Slide 37

Slide 37 text

Appendix

Slide 38

Slide 38 text

Appendix 38 Τνέοτͱ͸ Τνέοτʢӳޠදهʣʪϑϥϯεʫétiquette ࡞๏ͷ͜ͱɻϑϥϯεޠͰ͸ݩདྷɼਖ਼ࡳɼՙࡳͳͲͷʪ;ͩʫͷҙɻ ·ͨٶఊ΁ট͔ΕͨऀͷߦಈΛࢦࣔͨ͠௨༻ࡳΛ͍͍ɼ͜ΕΑΓٶఊ ّྱͷҙͱͳΔɻసͯ͡Ұൠࣾձੜ׆ͷ֤छ࡞๏Λ͢͞ɻ https://kotobank.jp/word/%E3%82%A8%E3%83%81%E3%82%B1%E3%83%83%E3%83%88-445121

Slide 39

Slide 39 text

Appendix 39 Ԭ࡚ࢢཱதԝਤॻؗࣄ݅ Ԭ࡚ࢢཱதԝਤॻؗࣄ݅ʢ͓͔͖͟͠ΓͭͪΎ͏͓͏ͱ͠ΐ͔Μ͚͡Μʣ͸ɺ2010 ೥3݄ࠒʹԬ࡚ࢢཱਤॻؗͷଂॻݕࡧγεςϜʹΞΫηεো֐͕ൃੜ͠ɺར༻ऀͷҰ ਓ͕ୁั͞Εͨࣄ݅Ͱ͋Δɻʢதུʣ உੑ͕࣮ࡍʹߦ͍ͬͯͨͷ͸ɺଂॻݕࡧγεςϜͷ࢖͍উखʹຬ଍͠ͳ͔ͬͨͨΊࣗ ਎Ͱ࡞੒ͨ͠ΫϩʔϥΛ࣮ߦ͠ɺଂॻݕࡧγεςϜ͔Βਤॻ৘ใΛऔಘ͢Δ͜ͱͰ ͋ͬͨɻʢதུʣ 20೔ؒͷޯཹͱऔΓௐ΂ͷޙɺ6݄14೔ʹ͸உੑͷۀ຿๦֐ͷڧ͍ҙਤ͕ೝΊΒΕͳ ͍ͱͯ͠ىૌ༛༧ॲ෼ͱͳ͕ͬͨɺઐ໳Ո΍ٕज़ऀ͔Β͸௕ظʹΘͨΔޯཹͷਖ਼౰ੑ ͓ΑͼͦΕҎલʹୁั͕ඞཁͰ͋ͬͨͷ͔͕ٙ໰ࢹ͞Ε͍ͯΔɻʢWikipediaʣ https://ja.wikipedia.org/wiki/%E5%B2%A1%E5%B4%8E%E5%B8%82%E7%AB%8B%E4%B8%AD%E5%A4%AE%E5%9B%B3%E6%9B%B8%E9%A4%A8%E4%BA%8B%E4%BB%B6

Slide 40

Slide 40 text

Appendix 40 UserAgentͱ͸ ϢʔβʔΤʔδΣϯτ(User Agent)ͱ͸ɺ΢ΣϒαΠτ΁ΞΫηε͢Δࡍ ʹ࢖༻͞ΕΔϓϩάϥϜɺ͋Δ͍͸ͦΕΒΛࣝผ͢ΔͨΊͷจࣈྻͷ ͜ͱΛࢦ͢ɻ*1 *1 https://www.irep.co.jp/knowledge/glossary/detail/id=10210/ *2 https://qiita.com/nightyknite/items/b2590a69f2e0135756dc ྫɿMozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 ɹɹ(KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36 *2

Slide 41

Slide 41 text

enjoy!!