Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
スクレイピングは茨の道/Scraping is a thorny road
Search
cottondesu
December 17, 2018
Programming
0
170
スクレイピングは茨の道/Scraping is a thorny road
前回LT「Python + Selenium + Beautiful Soup でスクレイピング」の
1.おさらい
2.失敗から学ぶ
3.ソースの修正
4.結論
cottondesu
December 17, 2018
Tweet
Share
More Decks by cottondesu
See All by cottondesu
ヨシケイの撮り忘れを対策したPart2/measures-were-taken-to-prevent-forgetting-to-pick-up-yoshikei-part2
cottondesu
0
35
ヨシケイの取り忘れ対策した / Measures were taken to prevent forgetting to pick up Yoshikei
cottondesu
0
190
開発環境公開ハード編 / Development environment public hardware version
cottondesu
0
160
開発環境公開ソフト編 / Development environment public software edition
cottondesu
0
150
正規表現で心が折れた/Regular expressions broke my heart
cottondesu
0
160
オレオレGASからMakeへの移行を検討してみた / Considering the transition from Ore Ore GAS to Make
cottondesu
0
610
Kanazawa.rb 10周年KPT / Kanazawa.rb 10th Anniversary KPT
cottondesu
0
750
Kanazawa.rb 9周年KPT / Kanazawa.rb 9th Anniversary KPT
cottondesu
0
450
Mac Book Proのバッテリー交換してみた / I replaced the battery in my Mac Book Pro.
cottondesu
0
570
Other Decks in Programming
See All in Programming
CSC307 Lecture 08
javiergs
PRO
0
670
CSC307 Lecture 05
javiergs
PRO
0
500
Amazon Bedrockを活用したRAGの品質管理パイプライン構築
tosuri13
5
780
疑似コードによるプロンプト記述、どのくらい正確に実行される?
kokuyouwind
0
390
「ブロックテーマでは再現できない」は本当か?
inc2734
0
1k
AtCoder Conference 2025
shindannin
0
1.1k
Rust 製のコードエディタ “Zed” を使ってみた
nearme_tech
PRO
0
200
OSSとなったswift-buildで Xcodeのビルドを差し替えられるため 自分でXcodeを直せる時代になっている ダイアモンド問題編
yimajo
3
620
Apache Iceberg V3 and migration to V3
tomtanaka
0
170
Oxlintはいいぞ
yug1224
5
1.4k
16年目のピクシブ百科事典を支える最新の技術基盤 / The Modern Tech Stack Powering Pixiv Encyclopedia in its 16th Year
ahuglajbclajep
5
1k
CSC307 Lecture 03
javiergs
PRO
1
490
Featured
See All Featured
Bootstrapping a Software Product
garrettdimon
PRO
307
120k
コードの90%をAIが書く世界で何が待っているのか / What awaits us in a world where 90% of the code is written by AI
rkaga
60
42k
Principles of Awesome APIs and How to Build Them.
keavy
128
17k
WENDY [Excerpt]
tessaabrams
9
36k
How to train your dragon (web standard)
notwaldorf
97
6.5k
YesSQL, Process and Tooling at Scale
rocio
174
15k
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
47
7.9k
The Cost Of JavaScript in 2023
addyosmani
55
9.5k
Unlocking the hidden potential of vector embeddings in international SEO
frankvandijk
0
170
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
31
3.1k
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
PRO
196
71k
Neural Spatial Audio Processing for Sound Field Analysis and Control
skoyamalab
0
170
Transcript
2018年12月15日 kanazawa.rb meetup#76 εΫϨΠϐϯά ҵͷಓ
!DPUUPO@EFTV ΏΔͬͱ;ΘͬͱอकΤϯδχΞ ׂ৽ن։ൃɾۀվળ ׂอक ػೳՃɺػೳमਖ਼ɺόάमਖ਼ FUD
ΞδΣϯμ લճ-5ͷ͓͞Β͍ લճ-5ࣦഊ͔ΒֶͿ ιʔεͷमਖ਼ ݁
લճ-5ͷ͓͞Β͍
1ZUIPO 4FMFOJVN #FBVUJGVM4PVQͰεΫϨΠϐϯά
εΫϨΠϐϯάͱʁ
εΫϨΠϐϯά w ΣϒαΠτ͔ΒใΛநग़͢Δ ίϯϐϡʔλιϑτΣΞٕज़ w ݴޠ1FSMɺ1ZUIPOɺ3VCZɺ+BWB4DSJQUͰ ϥΠϒϥϦ͋Γ
શମͷߏਤ
શମͷߏਤ ϑΥϧμ EJSFOW WFOW QZϑΝΠϧ 4FMFOJVN XFCESJWFS ϩάΠϯϖʔδ $ISPNF
%FNP͕՚ྷʹ Τϥʔ
લճ-5ࣦഊ͔ΒֶͿ
ίʔυ֬ೝ
class main(): options = Options() # ChromeͷύεʢStableνϟωϧͰ--headless͕͑ΔΑ͏ʹͳͬͨΒෆཁͳͣʣ options.binary_location = '/Applications/Google
Chrome.app/Contents/MacOS/Google Chrome' # ϔουϨεϞʔυΛ༗ޮʹ͢Δʢ࣍ͷߦΛίϝϯτΞτ͢Δͱը໘͕දࣔ͞ΕΔʣɻ options.add_argument('--headless') # ChromeͷWebDriverΦϒδΣΫτΛ࡞͢Δɻ driver = webdriver.Chrome( os.environ["CHROMEDRIVER"], chrome_options=options) # ϩάΠϯαΠτURL driver.get(os.environ["URL"]) # ϩάΠϯID driver.find_element_by_xpath( "//div[@class='sub-content']/dl/dd/div/input").send_keys( os.environ["LOGIN_USER_ID"]) # ϩάΠϯύεϫʔυ username = driver.find_element_by_xpath( "//div[@class='sub-content']/dl/dd[2]/div/input").send_keys( os.environ["LOGIN_USER_PASS"]) # ϩάΠϯϘλϯԡԼ driver.find_element_by_name("login").click() # ޱ࠲ཧʹભҠ driver.find_element_by_xpath(“//div[@id='link']/ul/li[10]").click() html = driver.page_source soup = BeautifulSoup(html, 'html.parser')
# อ༗ޱɺऔಘ୯Ձɺऔಘ୯ՁɺධՁଛӹͷऔಘ item_name = ['อ༗ޱɹɹ |', 'औಘ୯Ձɹɹ |', 'ج४Ձֹɹɹ |',
'ධՁଛӹɹɹ |'] # ࢿ৴ୗ໊ͷऔಘ investmentname = [] investments = soup.find_all('td', class_='mbody', colspan="3") for investment in investments: investmentname.append(investment.a.text) inves_num = 0 row_num = 0 various_values = soup.find_all('tr', bgcolor='#eaf4e8', align="right") for various_value in various_values: # ࢿ৴ୗ໊ͷදࣔ print(investmentname[inves_num]) # อ༗ޱͷදࣔ print(item_name[row_num], various_value.td.text) for other in various_value.td.find_next_siblings("td"): row_num = row_num + 1 # อ༗ޱҎ֎ͷදࣔ print(item_name[row_num], other.text) row_num = 0 inves_num = inves_num + 1 # શͯͷΫοΩʔΛআ driver.delete_all_cookies() driver.quit() if __name__ == "__main__": main()
ҟৗऴྃ࣌ʹϓϩηε ͕Γଓ͚Δ ϔουϨεϒϥβͷͨΊΘ͔Γʹ͍͘
ղܾࡦ
ҟৗऴྃ࣌Ͱϓϩηε Λऴྃͤ͞Δඞཁ͕͋Δ
ιʔεͷमਖ਼
class main(): try: options = Options() # ChromeͷύεʢStableνϟωϧͰ--headless͕͑ΔΑ͏ʹͳͬͨΒෆཁͳͣʣ options.binary_location =
'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome' # ϔουϨεϞʔυΛ༗ޮʹ͢Δʢ࣍ͷߦΛίϝϯτΞτ͢Δͱը໘͕දࣔ͞ΕΔʣɻ options.add_argument('--headless') # ChromeͷWebDriverΦϒδΣΫτΛ࡞͢Δɻ driver = webdriver.Chrome( os.environ["CHROMEDRIVER"], chrome_options=options) # ϩάΠϯαΠτURL driver.get(os.environ["URL"]) # ϩάΠϯID driver.find_element_by_xpath( "//div[@class='sub-content']/dl/dd/div/input").send_keys( os.environ["LOGIN_USER_ID"]) # ϩάΠϯύεϫʔυ username = driver.find_element_by_xpath( "//div[@class='sub-content']/dl/dd[2]/div/input").send_keys( os.environ["LOGIN_USER_PASS"]) # ϩάΠϯϘλϯԡԼ driver.find_element_by_name("login").click() # ޱ࠲ཧʹભҠ driver.find_element_by_xpath(“//div[@id='link']/ul/li[10]").click() html = driver.page_source soup = BeautifulSoup(html, 'html.parser') मਖ਼Օॴ
# อ༗ޱɺऔಘ୯Ձɺऔಘ୯ՁɺධՁଛӹͷऔಘ item_name = ['อ༗ޱɹɹ |', 'औಘ୯Ձɹɹ |', 'ج४Ձֹɹɹ |',
'ධՁଛӹɹɹ |'] # ࢿ৴ୗ໊ͷऔಘ investmentname = [] investments = soup.find_all('td', class_='mbody', colspan="3") for investment in investments: investmentname.append(investment.a.text) inves_num = 0 row_num = 0 various_values = soup.find_all('tr', bgcolor='#eaf4e8', align="right") for various_value in various_values: # ࢿ৴ୗ໊ͷදࣔ print(investmentname[inves_num]) # อ༗ޱͷදࣔ print(item_name[row_num], various_value.td.text) for other in various_value.td.find_next_siblings("td"): row_num = row_num + 1 # อ༗ޱҎ֎ͷදࣔ print(item_name[row_num], other.text) row_num = 0 inves_num = inves_num + 1 except NoSuchElementException as e: print("seleniumͷૢ࡞தʹΤϥʔ͕ൃੜ͠·ͨ͠ɻ") traceback.print_exc() finally: # શͯͷΫοΩʔΛআ driver.delete_all_cookies() driver.quit() if __name__ == "__main__": main() मਖ਼Օॴ
݁
݁ w εΫϨΠϐϯάඞͣޭ͢ΔͱݶΒͳ͍ w ྫ֎ॲཧ USZFYDFQUpOBMMZ ͕ඞཁ w ͍ͭͰਖ਼ৗऴྃͤ͞Δॲཧ͕ඞཁ