Slide 1

Slide 1 text

୊਺ࠗఠঌইࠁחਢ௼܀۞ ੉ળߧ KVO!CFPNJOFU Back to the Basic

Slide 2

Slide 2 text

ߊ಴੗ ࣗѐ ੉ળߧ ( jun@beomi.net ) - DjangoGirls Seoul - <ա݅੄ ਢ ௼܀۞ ٜ݅ӝ> दܻૉ ো੤ - ౵੉௑ ౚషܻ঴: <ա݅੄ ਢ ௼܀۞ ٜ݅ӝ> - Python + Django = <3 - ਋ইೠ ప௼஬೐(਋ইೠ ഋઁٜ) ੋఢ 2

Slide 3

Slide 3 text

௼܀݂? 3

Slide 4

Slide 4 text

য়ט ೡ Ѫ 4 ௼܀݂ೡ ѐߊജ҃ ௼܁ ੜ ॄࠁӝ, CSS Selectorۆ? ౵੉௑ ߊ಴ࣁ࣌ ௼܀݂ೞӝ ৡয়೐޸झ ۽ӒੋೞҊ ௼܀݂ೞӝ ਵ۽ ௼܀݂ೞӝ ( )ചݶহ੉ ௼܀݂ೞӝ ઱ӝ੸ਵ۽ ௼܀݂ೞӝ

Slide 5

Slide 5 text

যڃ ജ҃ীࢲ ੘সೞաਃ? • Python 3.6.x (ইޖܻ ծইب 3.4.x / Python2ॳݶ Ҋా߉ইਃ) • Requests 2.18.x (2.1x.x ੉࢚ߡ੹੉ݶ ޖդ೤פ׮.) • beautifulsoup4 4.6.x (4.5 ੉࢚ ߡ੹੉ݶ ؾפ׮.) • selenium 3.4.x (selenium਷ ঱ઁա ୭न ߡ੹ਸ ੉ਊ೧઱ࣁਃ.) • Chrome v60 ഑਷ ੉࢚ (Headlessݽ٘ח 60ߡ੹ ੉࢚ ૑ਗؾפ׮.)
 
 pip install requests bs4 selenium 5

Slide 6

Slide 6 text

Requestsۆ? 6 1ZUIPO)5513FRVFTUTGPS)VNBOT >>> import requests >>> r = requests.get('https://api.github.com/user', auth=('user', 'pass')) >>> r.status_code 200 >>> r.headers['content-type'] 'application/json; charset=utf8' >>> r.encoding 'utf-8' >>> r.text u'{"type":"User"...'

Slide 7

Slide 7 text

BeautifulSoupۆ? EFTJHOFEGPSRVJDLUVSOBSPVOEQSPKFDUT MJLFTDSFFOTDSBQJOH • Requests۽ ߉ইৡ ؘ੉ఠܳ ౵੉ॆ੉ ੉೧ೞח ё୓۽ ٜ݅ӝ • HTML DOM ҳઑ Ӓ؀۽! 7 from bs4 import BeautifulSoup

Slide 8

Slide 8 text

HTML DOM? 8 ఋ੉ౣ

ઁੌ ௾ ઁݾ

౵੉௑ 2017 KR
TITLE HEAD DIV H1 BODY HTML

Slide 9

Slide 9 text

௼܁ Inspector ࢎਊೞӝ 9

Slide 10

Slide 10 text

CSS Selectorۆ? 10 body > div.frontpage > div.onsky > nav HTML TAG CSS Class CSS Class CSS Class HTML TAG CSS Classח .~~~ IDח #~~~ ߄۽ ইېח >

Slide 11

Slide 11 text

рױೠ ਢ ಕ੉૑ ௼܀݂ೞӝ 11

Slide 12

Slide 12 text

౵੉௑ ࣁ࣌ ݾ۾ਸ оઉ৬ࠇद׮ 12 https://www.pycon.kr/2017/program/list/

Slide 13

Slide 13 text

13

Slide 14

Slide 14 text

14 body > div.container > div:nth-child(1) > div.col-md-9.content > ul:nth-child(3) > li:nth-child(16) > a nth-childח BeautifulSoup੄ select()ীࢲ ૑ਗೞ૑ ঋח׮ body > div.container > div:nth-of-type(1) > div.col-md-9.content > ul:nth-of-type(2) > li:nth-of-type(16) > a

Slide 15

Slide 15 text

15 div > div.col-md-9.content > ul > li > a body > div.container > div:nth-of-type(1) > div.col-md-9.content > ul:nth-of-type(2) > li:nth-of-type(16) > a

Slide 16

Slide 16 text

nth-child vs nth-of-type • NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type. • div:nth-child(5): 
 э਷ ࠗݽܳ о૓ Element઺ 5ߣ૩. ݅ড divо ইפݶ হחѦ۽ ஜ • div:nth-of-type(5):
 э਷ ࠗݽܳ о૓ div઺ 5ߣ૩. 16

Slide 17

Slide 17 text

17 import requests from bs4 import BeautifulSoup as bs # Getߑधਵ۽ ࣗझܳ оઉ২פ׮. req = requests.get('https://www.pycon.kr/2017/program/list/') # ೻؊ࠗ࠙੉ ইצ HTTP੄ Body(Text)ܳ оઉ২פ׮. html = req.text # HTMLਸ ౵੉ॆ੉ ੉೧ೞח Soup ё୓۽ ౵य೤פ׮. soup = bs(html, 'html.parser') # CSS Selectorܳ ా೧ ղਊޛਸ ݽف ࢶఖ೤פ׮.(iterable) session_list = soup.select('body > div.container > div:nth-of-type(1) '\ '> div.col-md-9.content > ul:nth-of-type(1) '\ '> li:nth-of-type(16) > a') for session in session_list: # HTML DOMё୓੄ ղਊޛ(text)݅ ࠇפ׮. print(session.text)

Slide 18

Slide 18 text

18 import requests from bs4 import BeautifulSoup as bs # Getߑधਵ۽ ࣗझܳ оઉ২פ׮. req = requests.get('https://www.pycon.kr/2017/program/list/') # ೻؊ࠗ࠙੉ ইצ HTTP੄ Body(Text)ܳ оઉ২פ׮. html = req.text # HTMLਸ ౵੉ॆ੉ ੉೧ೞח Soup ё୓۽ ౵य೤פ׮. soup = bs(html, 'html.parser') # CSS Selectorܳ ా೧ ղਊޛਸ ݽف ࢶఖ೤פ׮.(iterable) session_list = soup.select('div > div.col-md-9.content > ul > li > a') for session in session_list: # HTML DOMё୓੄ ղਊޛ(text)݅ ࠇפ׮. print(session.text)

Slide 19

Slide 19 text

19

Slide 20

Slide 20 text

۽Ӓੋ੉ ೙ਃೠ ҃਋ 20

Slide 21

Slide 21 text

ONOFFMIX न୒ݾ۾ оઉয়ӝ 21

Slide 22

Slide 22 text

۽Ӓੋ਷ যڌѱ ೞաਃ? 22

Slide 23

Slide 23 text

ONOFFMIX ۽Ӓੋ ڳযࠁӝ 23 http://onoffmix.com/account/login

Slide 24

Slide 24 text

requests ੄ Session ੉ਊ 24

Slide 25

Slide 25 text

25 import requests from bs4 import BeautifulSoup as bs def onoffmix(): with requests.Session() as s: s.headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) ' 'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36' } login = s.post('https://onoffmix.com/account/login', data={ 'email': 'usermail@gmail.com', 'pw': 'mypassword1234', 'proc': 'login' }) html = s.get('http://onoffmix.com/account/event') soup = bs(html.text, 'html.parser') event_list = soup.select('#eventListHolder > div > ul > li.title > a') for event in event_list: print(event.text)

Slide 26

Slide 26 text

۽Ӓੋ റ ੗ܐ оઉয়ӝ 26 #eventListHolder > div:nth-child(1) > ul > li.title > a #eventListHolder > div > ul > li.title > a

Slide 27

Slide 27 text

27 import requests from bs4 import BeautifulSoup as bs def onoffmix(): with requests.Session() as s: s.headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) ' 'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36' } login = s.post('https://onoffmix.com/account/login', data={ 'email': 'usermail@gmail.com', 'pw': 'mypassword1234', 'proc': 'login' }) html = s.get('http://onoffmix.com/account/event') soup = bs(html.text, 'html.parser') event_list = soup.select('#eventListHolder > div > ul > li.title > a') for event in event_list: print(event.text) 27

Slide 28

Slide 28 text

28

Slide 29

Slide 29 text

۽Ӓੋ੉ ցޖ য۰ਕਃ Ӓր ࠳ۄ਋੷ ॳݶ উغաਃ 29

Slide 30

Slide 30 text

௼܁ਵ۽ ௼܀݂೧ࠇद׮ 30 Selenium + Chrome(v60) pip install selenium https://sites.google.com/a/chromium.org/chromedriver/downloads

Slide 31

Slide 31 text

31 /Users/ࢎਊ੗੉ܴ/Downloads/chromedriver

Slide 32

Slide 32 text

32 from selenium import webdriver driver = webdriver.Chrome('/Users/beomi/Downloads/chromedriver') # Selenium੉ ݽٚ ੗ਗਸ оઉয়ӝө૑ 3ୡܳ ӝ׮۰સפ׮ driver.implicitly_wait(3) # ֎੉ߡ ୐ ചݶਸ оઉ৬ࠇद׮ driver.get('https://naver.com')

Slide 33

Slide 33 text

33

Slide 34

Slide 34 text

݅ড ੉۠ ী۞о լ׮ݶ 34 selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home ‘chromedriver'о PATH ী ١۾غ૑ ঋইࢲ ࢤӝח ޙઁ. https://sites.google.com/a/chromium.org/chromedriver/home ਤ ઱ࣗীࢲ latest driverܳ ׮਍߉ই ঑୷ਸ ಽযળ ҃۽ܳ ੿ഛೞѱ ૑੿ೞӝ

Slide 35

Slide 35 text

֎੉ߡ ۽Ӓੋ ೧ࠁӝ 35 from selenium import webdriver driver = webdriver.Chrome(‘/Users/username/Downloads/chromedriver') driver.implicitly_wait(3) driver.get('https://naver.com') input_id = driver.find_element_by_css_selector('#id') input_pw = driver.find_element_by_css_selector('#pw') login_button = driver.find_element_by_css_selector('#frmNIDLogin > fieldset > span > input[type="submit"]') input_id.send_keys('someid') input_pw.send_keys(‘mypassword1234!') login_button.click()

Slide 36

Slide 36 text

֎੉ߡ ನੋ౟ ҳݒղ৉ ௼܀݂ 36 #_listContentArea > ul > li:nth-child(1) > div > div.item_content > div.info_space > p #_listContentArea > ul > li > div > div.item_content > div.info_space > p https://order.pay.naver.com/home?tabMenu=POINT_TOTAL

Slide 37

Slide 37 text

௼܁ ௑ࣛীࢲ ޷ܻ ഛੋ೧ࠁӝ 37 CSS Selectorܳ ੹ࠗ ॳ૑ ঋইب ௼܀݂੉ оמೞ׮! ೞ૑݅ ௼܁ ௑ࣛ਷ JS

Slide 38

Slide 38 text

оઉ৬ ࠇद׮ 38 # ਤ ௏٘ ੉যࢲ login_button.click() driver.get('https://order.pay.naver.com/home?tabMenu=POINT_TOTAL') point_list = driver.find_elements_by_css_selector('div.info_space > p') for point in point_list: print(point.text) driver.quit() ]

Slide 39

Slide 39 text

ചݶਸ ڸ਋חѱ फযਃ 39 driver.set_window_position(-10000,0) ೞ૑݅ ੉Ѥ ਋ܻо ߄ۄחѱ ইפભ Headless Browser like Headless Chrome

Slide 40

Slide 40 text

ಕ੉झ࠘ झ௼ܽࢫ 40 chrome://settings/help ীࢲ ௼܁ ߡ੹ਸ ഛੋ೧઱ࣁਃ Headless Chrome਷ v60੉࢚ীࢲ ࢎਊоמ೤פ׮. from selenium import webdriver options = webdriver.ChromeOptions() options.add_argument('headless') options.add_argument('window-size=1920x1080') driver = webdriver.Chrome('chromedriver', chrome_options=options) driver.get('https://facebook.com') driver.implicitly_wait(3) email = driver.find_element_by_css_selector('input[type=email]') password = driver.find_element_by_css_selector('input[type=password]') login = driver.find_element_by_css_selector('input[type="submit"]') email.send_keys('username@mail.com') password.send_keys('ilovepython') login.click() driver.get_screenshot_as_file('facebook.png') driver.quit()

Slide 41

Slide 41 text

24दр ௼܀݂ جܻҊ रযਃ 41 Cron ੘স + ࢲߡী ৢܻӝ ੉Ѥয়טೞ૑ঋইਃ

Slide 42

Slide 42 text

10࠙݃׮ ೠߣঀ ௼܀݂ ೞҊरযਃ 42 */10 * * * * /usr/bin/python3 /home/beomi/parser.py ࠙ द ੌ ਘ ਃੌ ౵੉ॆ ਤ஖ ౵੉ॆ ౵ੌ ਤ஖ crontab -e ੿ഛೠ ‘दп’਷ ं੗, ‘ݻ ࠙/दр݃׮’ח */ं੗ ೠ஢ঀ झಕ੉झ

Slide 43

Slide 43 text

43 • एয઱ࣁਃ. time.sleep(3) एযоݶ ࢲߡח ചղ૑ ঋইਃ. • ےؒ. time.sleep(2 + random.random() * 4)୊ۢ ےؒਵ۽ एয઱ࣁਃ. • User-Agent ܳ ࠈ੉ ইצ ੌ߈ ࠳ۄ਋੷୊ۢ ֍য઱ࣁਃ. • ੗߄झ௼݀౟о ݆਷ ҃਋ীח requestsח Әߑ Ѧܾࣻ ੓णפ׮. • robots.txt ܳ ઓ઺೧઱ࣁਃ. ࢲߡ੗ਗ਷ ޖೠ੉ ইפѢٚਃ. • ࠁా ݽ߄ੌ ಕ੉૑о PCಕ੉૑ࠁ׮ ௼܀݂ೞӝ औणפ׮.
 (ੌױ Flash৬ ActiveXо হणפ׮) ࠗ۾: ખ ؊ ࢎۈ୊ۢ ௼܀݂ೞӝ

Slide 44

Slide 44 text

QnA 44

Slide 45

Slide 45 text

хࢎ೤פ׮ :D 45