Python Study 9 - Etc Library

안성현(@SH84AHN) 기타 유용한 라이브러리 소개 requests, bs4, scrapy, pycrypto 1

오늘 다룰 것들. 2 Requests Scrapy pycrypto BS4 기타 사이트
소개

Requests 3 urllib 의 기능이 파편화 되어 있다고 판단. 좀더
편한 형태로 만든 url 요청 라이브러리 http://docs.python-requests.org/en/latest 각각의 HTTPMethod 에 해당하는 함수가 존재, url 을 던져주면 된다. > pip install requests

Requests 4 파라미터를 전송해야 하는 경우, 딕셔너리(dict(), {}) 형태로 파라미터를
전달 http://apis.daum.net/search/blog?q=daum&apikey=DAUM_SEARCH_DEMO_APIKEY post 의 경우, data 파라미터를 통해서 딕셔너리의 형태로 전달 헤더 추가하기 {'API_VERSION': '2.0', 'Content-Length': '0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'User-Agent': 'python- requests/2.3.0 CPython/2.7.6 Windows/7'}

Requests 5 r = Response Object <class 'requests.models.Response'> ['__attrs__', '__bool__',
'__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getstate__', '__hash__', '__init__', '__iter__', '__module__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_content', '_content_consumed', 'apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history', 'is_redirect', 'iter_content', 'iter_lines', 'json', 'links', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']

Requests 6 UTF-8 {'content-length': '6049', 'content-language': 'ko-KR', 'server': 'nginx/1.4.1', 'connection':
'keep-alive', 'cache-control': 'no-cache', 'date': 'Thu, 26 Jun 2014 05:02:18 GMT', 'content-type': 'text/json;charset=UTF-8'} 200 [] False r.text : text encoding 된 response body 텍스트 r.content : body가 바이트인 경우 content 를 통해서 데이터 가져올수 있다. r.json : JSON 형식으로 response body 가 오는 경우, dict 로 변경 r. raw : raw socket response, requests() 함수에서 stream=True 설정 해야함.

Requests 7 timeout 지정

Requests + Auth 8 HTTP Basic Auth 기본 제공 request()
함수에 auth=HTTPBasic(id, pw) 지정 HTTPBasic 은 생략 가능. 기존의 HTTPBasicAuthHandler() 만들고 install 해서 사용. 번거로움.

Requests + Auth 9 기타 다양한 인증 기능 제공 Digest
인증 OAuth1 인증  OAuth2 는 아직 공식 미지원.

BS4(BeautifulSoup4) 10 HTML 파싱 라이브러리 http://coreapython.hosting.paran.com/etc/beautifulsoup4.html  기존 라이브러리의 문제점
: 찾는 방식의 문제 HTMLParser 이벤트기반, 불편하다. > pip install beautifulsoup4

BS4(BeautifulSoup4) 11 <title>The Dormouse's story</title> title The Dormouse's story [u'title']
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> [[u'title'], [u'story'], [u'story']]  찾으려고 하는 태그에 대해서 멤버 변수로 접근.

BS4(BeautifulSoup4) 12  HTML String으로 생성하기  URL 로 읽어와서
생성하기 urlib, requests 등을 이용해서 URL 에 있는 HTML을 가져온다.

BS4(BeautifulSoup4) 13 soup tag name property NavigableString tag BeautifulSoup 객체
- 하나의 soup 은 여러 개의 Tag형 변수를 가진다. - Tag형 변수의 이름은는 태그명 자체이다. <b> => soup.b - 하나의 Tag 에는 하나의 name을 가지는데, 태그명 - 하나의 Tag 는 여러 개의 속성을 가진다. class, id, style… - NavigableString 은 태그안에 있는 문자열을 지칭

BS4(BeautifulSoup4) 14

BS4(BeautifulSoup4) 15  태그 이름을 통해서 접근 => 편하지만, 해당
이름의 첫번째 태그만 가져온다.!! 현재 문서에서 첫번째 img 태그, 첫번째 a 태그를 가져온다.  원하는 태그 모두 가져오기, find_all(‘tag명’)  현재 문서에서 a 태그의 모든 링크를 가져와라. find_all(name, attrs, recursive, text, limit, **kwargs)

BS4(BeautifulSoup4) 16  head 태그내에서 meta 태그중에서 name과 content 속성을
둘다 가진것을 출력해라. HTML 결과  키워드 지정시, 속성으로 검색, text 지정시 문자열 검색

BS4(BeautifulSoup4) 예제 17  예제 – 글의 제목, 본문, 이미지
가져오기

BS4(BeautifulSoup4) 예제 18  파폭툴 이용 각각의 영역에 대한 tag
관련 정보 수집

BS4(BeautifulSoup4) 예제 19  글 제목 가져오기 <h1 class="entry-title"> <a
href="/137"> 라디오스타, 김유정 너무 일찍 철 들어 버리다 </a> </h1>

BS4(BeautifulSoup4) 예제 20  글 본문 가져오기 <div class="tt_article_useless_p_margin"> <p
style="TEXT-ALIGN: justify; LINE-HEIGHT: 2"> <span style="FONT-SIZE: 12pt; FONT-FAMILY: NanumGothic"> 예능 프로그램 진짜사나이에 헨리가 나왔을 때 저는 솔직히 … </span> </p> <p style="TEXT-ALIGN: justify; LINE-HEIGHT: 2"></p> <p style="TEXT-ALIGN: justify; LINE-HEIGHT: 2"></p> ….

BS4(BeautifulSoup4) 예제 21  본문 내 이미지 저장하기 {'content-length': '142068',
'via': '1.1 Wcache(1.1)', 'content-disposition': 'inline; filename="10225.jpg"', 'age': '36908', 'expires': 'Fri, 25 Jul 2014 22:52:11 GMT', 'server': 'Apache', 'last-modi fied': 'Wed, 25 Jun 2014 22:26:01 GMT', 'connection': 'keep-alive', 'date': 'Wed, 25 Jun 2014 22:52:11 GMT', 'content-type': 'image/jpeg'} {'content-length': '139046', 'via': '1.1 Wcache(1.1)', 'content-disposition': 'inline; filename="12335.jpg"', 'age': '36985', 'expires': 'Fri, 25 Jul 2014 22:50:55 GMT', 'server': 'Apache', 'last-modi fied': 'Wed, 25 Jun 2014 22:26:02 GMT', 'connection': 'keep-alive', 'date': 'Wed, 25 Jun 2014 22:50:54 GMT', 'content-type': 'image/jpeg'} {'content-length': '781', 'via': '1.1 Wcache(1.1)', 'accept-ranges': 'bytes', 'expires': 'Fri, 27 Jun 2014 02:30:54 GMT', 'server': 'dws', 'last-modified': 'Mon, 03 Nov 2008 07:05:34 GMT', 'connection ': 'keep-alive', 'etag': '"1f04b02-30d-45ac392bfab80"', 'date': 'Thu, 26 Jun 2014 02:30:55 GMT', 'content-type': 'image/gif', 'age': '23785'}

Scrapy 22 웹 크롤링 프레임워크, 웹사이트를 크롤링하고, 페이지에서 데이터 추출하는
역할. http://doc.scrapy.org/en/latest/topics/leaks.html > pip install Scrapy  하나의 크롤링을 위한 단계

Scrapy 23  프로젝트 생성하기 scrapy startproject [project_name]

Scrapy 24  Item.py 디자인하기 - 스크랩된 데이터를 담는 컨테이너,
파이썬 딕셔너리 같은. - Item : scrapy.Item - Item의 속성 : scrapy.Field 모델링 - 가져올 데이터들을 선정하고 그에 따라서 item 모델링 - dmoz.org 에서 title, link, description 을 가져온다. /tutorial/tutorial/item.py

Scrapy 25  Spider 만들기 - spider : 사용자가 만든
클래스, 도메인에서 정보 긁어오기 위해서 사용. - download 할 url 리스트 정의, 어떻게 링크를 따라 갈것인지, - 어떻게 페이지내 컨텐츠를 가져올것인지. 작성법 - scrapy.Spider 클래스를 서브클래싱해서 사용. - 3가지의 필수 속성/메소드 속성/메소드 설명 name Identifier, unique, 다른이름으로 설정 start_urls URL 리스트 parse() 각각의 url에 대한 Response 객체에서 호출됨. response data 를 파싱하고 데이터를 뽑아 내는 역할. Response 를 처리하고 스크랩된 데이터를 item 객체로 변환.

Scrapy 26  작성예제

Scrapy 27  Crawling 프로젝트 디렉토리 상에서 : scrapy crawl
[spider_name] scrapy Request(callback=parse()) start_urls count 만큼 생성 Request 가 실행. Response 객체가 반환 parse() 로 전달.  내부에서는 이렇게 돌아갑니다.

Scrapy 28 ash84 at ubuntu in ~/study/scrapy_test/tutorial $ scrapy crawl
dmoz 2014-06-27 11:30:17+0900 [dmoz] INFO: Spider opened 2014-06-27 11:30:17+0900 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2014-06-27 11:30:17+0900 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6 023 2014-06-27 11:30:17+0900 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080 2014-06-27 11:30:18+0900 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Co mputers/Programming/Languages/Python/Resources/> (referer: None) filename : Resources 2014-06-27 11:30:18+0900 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Co mputers/Programming/Languages/Python/Books/> (referer: None) filename : Books 2014-06-27 11:30:18+0900 [dmoz] INFO: Closing spider (finished) 2014-06-27 11:30:18+0900 [dmoz] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 516, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 16515, 'downloader/response_count': 2, 'downloader/response_status_count/200': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2014, 6, 27, 2, 30, 18, 119539), 'log_count/DEBUG': 4, 'log_count/INFO': 7, 'response_received_count': 2, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'start_time': datetime.datetime(2014, 6, 27, 2, 30, 17, 59434)} 2014-06-27 11:30:18+0900 [dmoz] INFO: Spider closed (finished) 실제 작성한 dmoz spider 가 작동하는 부분 지정한 URL 에서 크롤링 수행

Scrapy 29 원하는 웹 페이지에서 가져오는것 까지 성공, 다음은? 가져온
것 중에서 필요한것만 챙기기 웹 페이지에서 데이터 가져오기 Scrapy Selector : XPath, CSS 기반 예 설명 /html/head/title Html 안에 head 안에 title 태그 /html/head/title/text() Html 안에 head 안에 title 안의 문자열 반환 //td 전체 문서에서 td 태그들을 반환 //div[@class=“min”] 전체 div 태그들에서 class 명이 mine 것만 반환

Scrapy 30  Selector 클래스 제공 - 파싱을 직접 수행하는
주체이자 노드를 나타냄. - 4가지 기본 메소드 제공 메소드 설명 xpath() Xpath 표현에 의해서 선택된 Selector 의 리스트 반환 css() CSS 표현에 의해서 선택된 Selector의 리스트 반환 extract() 선택된 데이터의 유니코드 값 반환 re() 정규표현식에 의한 Selector의 리스트 반환

Scrapy 31 데이터 추출하기 - response.body 부분에서 XPath 를 이용해서
필요한 것들을 추출 - XPath 를 알기 위해서 HTML을 사람이 봐야 한다. Firefox 확장기능 활용 예제) link-title, link url, text

Scrapy 32 기존의 dmoz_spider.py 파일내 parse() 함수에 HTML 파일 저장
대신 Selector 를 이용한 파싱 코드 삽입

Scrapy 33 Item 객체에 넣기 - Item 객체는 파이썬 딕셔너리
커스텀 객체 - item[‘title’] 이런식으로 접근. Spiders 에서 추출한 데이터를 Item 객체로 변환후 반환.

Scrapy 34 scrapy crawl dmoz {"desc": [], "link": ["/Computers/"], "title":
["Computers"]}, {"desc": [], "link": ["/Computers/Programming/"], "title": ["Programming"]}, {"desc": [], "link": ["/Computers/Programming/Languages/"], "title": ["Languages"]}, {"desc": [], "link": ["/Computers/Programming/Languages/Python/"], "title": ["Python"]}, scrapy crawl dmoz –o items.json - 추출된 데이터 => item => item.json 파일 - 좀더 복잡한 프로젝트에서는 Item Pipeline 이용

pycrypto 35 보안 해쉬 함수(SHA256 등) 와 다양한 암호화 알고리즘(AES,
DES, RSA ..) 를 하나로 묶은 패키지. https://launchpad.net/products/pycrypto > pip install pycrypto

pycrypto 36

정리 37 urllib2 보다는 requests 사용하자. oauth2 는 아직 미지원,
OAuth1, Basic, Digest 인증 지원 HTMLParsing : 파싱 대상/성격 따라 다르게 파싱 대상 구조가 다르다. BS4 구조가 같다. Scrapy ex) 각각의 쇼핑몰에서 데이터 파싱 ex) 특정 블로그 시스템내 페이지 파싱

http://www.pythonweekly.com 38

https://twitter.com/pypi 39

https://www.facebook.com/groups/pythonkorea 40

http://ask.python.kr/questions 41

끝 42

Python Study 9 - Etc Library

Python Study 9 - Etc Library

More Decks by AhnSeongHyun

Other Decks in Programming

Featured

Transcript