Slide 1

Slide 1 text

LEARNING PYTHON FROM DATA Mosky 1

Slide 2

Slide 2 text

THIS SLIDE • The online version is at 
 https://speakerdeck.com/mosky/learning-python-from-data. • The examples are at 
 https://github.com/moskytw/learning-python-from-data- examples. 2

Slide 3

Slide 3 text

MOSKY 3

Slide 4

Slide 4 text

MOSKY • I am working at Pinkoi. 3

Slide 5

Slide 5 text

MOSKY • I am working at Pinkoi. • I've taught Python for 100+ hours. 3

Slide 6

Slide 6 text

MOSKY • I am working at Pinkoi. • I've taught Python for 100+ hours. • A speaker at
 COSCUP 2014, PyCon SG 2014, PyCon APAC 014, 
 OSDC 2014, PyCon APAC 2013, COSCUP 2014, ... 3

Slide 7

Slide 7 text

MOSKY • I am working at Pinkoi. • I've taught Python for 100+ hours. • A speaker at
 COSCUP 2014, PyCon SG 2014, PyCon APAC 014, 
 OSDC 2014, PyCon APAC 2013, COSCUP 2014, ... • The author of the Python packages: 
 MoSQL, Clime, ZIPCodeTW, ... 3

Slide 8

Slide 8 text

MOSKY • I am working at Pinkoi. • I've taught Python for 100+ hours. • A speaker at
 COSCUP 2014, PyCon SG 2014, PyCon APAC 014, 
 OSDC 2014, PyCon APAC 2013, COSCUP 2014, ... • The author of the Python packages: 
 MoSQL, Clime, ZIPCodeTW, ... • http://mosky.tw/ 3

Slide 9

Slide 9 text

SCHEDULE 4

Slide 10

Slide 10 text

SCHEDULE • Warm-up 4

Slide 11

Slide 11 text

SCHEDULE • Warm-up • Packages - Install the packages we need. 4

Slide 12

Slide 12 text

SCHEDULE • Warm-up • Packages - Install the packages we need. • CSV - Download a CSV from the Internet and handle it. 4

Slide 13

Slide 13 text

SCHEDULE • Warm-up • Packages - Install the packages we need. • CSV - Download a CSV from the Internet and handle it. • HTML - Parse a HTML source code and write a Web crawler. 4

Slide 14

Slide 14 text

SCHEDULE • Warm-up • Packages - Install the packages we need. • CSV - Download a CSV from the Internet and handle it. • HTML - Parse a HTML source code and write a Web crawler. • SQL - Save data into a SQLite database. 4

Slide 15

Slide 15 text

SCHEDULE • Warm-up • Packages - Install the packages we need. • CSV - Download a CSV from the Internet and handle it. • HTML - Parse a HTML source code and write a Web crawler. • SQL - Save data into a SQLite database. • The End 4

Slide 16

Slide 16 text

FIRST OF ALL, 5

Slide 17

Slide 17 text

6

Slide 18

Slide 18 text

PYTHON IS AWESOME! 6

Slide 19

Slide 19 text

2 OR 3? 7

Slide 20

Slide 20 text

2 OR 3? • Use Python 3! 7

Slide 21

Slide 21 text

2 OR 3? • Use Python 3! • But it actually depends on the libs you need. 7

Slide 22

Slide 22 text

2 OR 3? • Use Python 3! • But it actually depends on the libs you need. • https://python3wos.appspot.com/ 7

Slide 23

Slide 23 text

2 OR 3? • Use Python 3! • But it actually depends on the libs you need. • https://python3wos.appspot.com/ • We will go ahead with Python 2.7,
 but I will also introduce the changes in Python 3. 7

Slide 24

Slide 24 text

THE ONLINE RESOURCES 8

Slide 25

Slide 25 text

THE ONLINE RESOURCES • The Python Official Doc • http://docs.python.org • The Python Tutorial • The Python Standard Library 8

Slide 26

Slide 26 text

THE ONLINE RESOURCES • The Python Official Doc • http://docs.python.org • The Python Tutorial • The Python Standard Library • My Past Slides • Programming with Python - Basic • Programming with Python - Adv. 8

Slide 27

Slide 27 text

THE BOOKS 9

Slide 28

Slide 28 text

THE BOOKS • Learning Python by Mark Lutz 9

Slide 29

Slide 29 text

THE BOOKS • Learning Python by Mark Lutz • Programming in Python 3 by Mark Summerfield 9

Slide 30

Slide 30 text

THE BOOKS • Learning Python by Mark Lutz • Programming in Python 3 by Mark Summerfield • Python Essential Reference by David Beazley 9

Slide 31

Slide 31 text

PREPARATION 10

Slide 32

Slide 32 text

PREPARATION • Did you say "hello" to Python? 10

Slide 33

Slide 33 text

PREPARATION • Did you say "hello" to Python? • If no, visit • http://www.slideshare.net/moskytw/programming-with- python-basic. 10

Slide 34

Slide 34 text

PREPARATION • Did you say "hello" to Python? • If no, visit • http://www.slideshare.net/moskytw/programming-with- python-basic. • If yes, open your Python shell. 10

Slide 35

Slide 35 text

WARM-UP The things you must know. 11

Slide 36

Slide 36 text

MATH & VARS 2 + 3 2 - 3 2 * 3 2 / 3, -2 / 3 ! (1+10)*10 / 2 ! 2.0 / 3 ! 2 % 3 ! 2 ** 3 x = 2 ! y = 3 ! z = x + y ! print z ! '#' * 10 12

Slide 37

Slide 37 text

FOR for i in [0, 1, 2, 3, 4]:
 print i ! items = [0, 1, 2, 3, 4]
 for i in items:
 print i ! for i in range(5):
 print i ! ! ! chars = 'SAHFI'
 for i, c in enumerate(chars):
 print i, c ! ! words = ('Samsung', 'Apple', 'HP', 'Foxconn', 'IBM')
 
 for c, w in zip(chars, words):
 print c, w 
 13

Slide 38

Slide 38 text

IF for i in range(1, 10): if i % 2 == 0: print '{} is divisible by 2'.format(i) elif i % 3 == 0: print '{} is divisible by 3'.format(i) else: print '{} is not divisible by 2 nor 3'.format(i) 14

Slide 39

Slide 39 text

WHILE while 1: n = int(raw_input('How big pyramid do you want? ')) if n <= 0: print 'It must greater than 0: {}'.format(n) continue break 15

Slide 40

Slide 40 text

TRY while 1: ! try: n = int(raw_input('How big pyramid do you want? ')) except ValueError as e: print 'It must be a number: {}'.format(e) continue ! if n <= 0: print 'It must greater than 0: {}'.format(n) continue ! break 16

Slide 41

Slide 41 text

LOOP ... ELSE for n in range(2, 100): for i in range(2, n): if n % i == 0: break else: print '{} is a prime!'.format(n) 17

Slide 42

Slide 42 text

A PYRAMID * *** ***** ******* ********* *********** ************* *************** ***************** ******************* 18

Slide 43

Slide 43 text

A FATER PYRAMID * ***** ********* ************* *******************
 19

Slide 44

Slide 44 text

YOUR TURN! 20

Slide 45

Slide 45 text

LIST COMPREHENSION [ n for n in range(2, 100) if not any(n % i == 0 for i in range(2, n)) ] 21

Slide 46

Slide 46 text

PACKAGES import is important. 22

Slide 47

Slide 47 text

23

Slide 48

Slide 48 text

GET PIP - UN*X 24

Slide 49

Slide 49 text

GET PIP - UN*X • Debian family • # apt-get install python-pip 24

Slide 50

Slide 50 text

GET PIP - UN*X • Debian family • # apt-get install python-pip • Rehat family • # yum install python-pip 24

Slide 51

Slide 51 text

GET PIP - UN*X • Debian family • # apt-get install python-pip • Rehat family • # yum install python-pip • Mac OS X • # easy_install pip 24

Slide 52

Slide 52 text

GET PIP - WIN * 25

Slide 53

Slide 53 text

GET PIP - WIN * • Follow the steps in http://stackoverflow.com/questions/ 4750806/how-to-install-pip-on-windows. 25

Slide 54

Slide 54 text

GET PIP - WIN * • Follow the steps in http://stackoverflow.com/questions/ 4750806/how-to-install-pip-on-windows. • Or just use easy_install to install. 
 The easy_install should be found at C:\Python27\Scripts\. 25

Slide 55

Slide 55 text

GET PIP - WIN * • Follow the steps in http://stackoverflow.com/questions/ 4750806/how-to-install-pip-on-windows. • Or just use easy_install to install. 
 The easy_install should be found at C:\Python27\Scripts\. • Or find the Windows installer on Python Package Index. 25

Slide 56

Slide 56 text

3-RD PARTY PACKAGES 26

Slide 57

Slide 57 text

3-RD PARTY PACKAGES • requests - Python HTTP for Humans 26

Slide 58

Slide 58 text

3-RD PARTY PACKAGES • requests - Python HTTP for Humans • lxml - Pythonic XML processing library 26

Slide 59

Slide 59 text

3-RD PARTY PACKAGES • requests - Python HTTP for Humans • lxml - Pythonic XML processing library • uniout - Print the object representation in readable chars. 26

Slide 60

Slide 60 text

3-RD PARTY PACKAGES • requests - Python HTTP for Humans • lxml - Pythonic XML processing library • uniout - Print the object representation in readable chars. • clime - Convert module into a CLI program w/o any config. 26

Slide 61

Slide 61 text

YOUR TURN! 27

Slide 62

Slide 62 text

CSV Let's start from making a HTTP request! 28

Slide 63

Slide 63 text

HTTP GET import requests ! #url = 'http://stats.moe.gov.tw/files/school/101/ u1_new.csv' url = 'https://raw.github.com/moskytw/learning- python-from-data-examples/master/sql/schools.csv' ! print requests.get(url).content ! #print requests.get(url).text 29

Slide 64

Slide 64 text

FILE save_path = 'school_list.csv' ! with open(save_path, 'w') as f: f.write(requests.get(url).content) ! with open(save_path) as f: print f.read() ! with open(save_path) as f: for line in f: print line, 30

Slide 65

Slide 65 text

DEF from os.path import basename ! def save(url, path=None): ! if not path: path = basename(url) ! with open(path, 'w') as f: f.write(requests.get(url).content) 31

Slide 66

Slide 66 text

CSV import csv from os.path import exists ! if not exists(save_path): save(url, save_path) ! with open(save_path) as f: for row in csv.reader(f): print row 32

Slide 67

Slide 67 text

+ UNIOUT import csv from os.path import exists import uniout # You want this! ! if not exists(save_path): save(url, save_path) ! with open(save_path) as f: for row in csv.reader(f): print row 33

Slide 68

Slide 68 text

NEXT with open(save_path) as f: next(f) # skip the unwanted lines next(f) for row in csv.reader(f): print row 34

Slide 69

Slide 69 text

DICT READER with open(save_path) as f: next(f) next(f) for row in csv.DictReader(f): print row ! # We now have a great output. :) 35

Slide 70

Slide 70 text

DEF AGAIN def parse_to_school_list(path): school_list = [] with open(path) as f: next(f) next(f) for school in csv.DictReader(f): school_list.append(school) ! return school_list[:-2] 36

Slide 71

Slide 71 text

+ COMPREHENSION def parse_to_school_list(path='schools.csv'): with open(path) as f: next(f) next(f) school_list = [school for school in csv.DictReader(f)][:-2] ! return school_list 37

Slide 72

Slide 72 text

+ PRETTY PRINT from pprint import pprint ! pprint(parse_to_school_list(save_path)) ! # AWESOME! 38

Slide 73

Slide 73 text

PYTHONIC school_list = parse_to_school_list(save_path) ! # hmmm ... ! for school in shcool_list: print shcool['School Name'] ! # It is more Pythonic! :) ! print [school['School Name'] for school in school_list] 39

Slide 74

Slide 74 text

GROUP BY from itertools import groupby ! # You MUST sort it. keyfunc = lambda school: school['County'] school_list.sort(key=keyfunc) ! for county, schools in groupby(school_list, keyfunc): for school in schools: print '%s %r' % (county, school) print '---' 40

Slide 75

Slide 75 text

DOCSTRING '''It contains some useful function for paring data from government.''' ! def save(url, path=None): '''It saves data from `url` to `path`.''' ... ! --- Shell --- ! $ pydoc csv_docstring 41

Slide 76

Slide 76 text

CLIME if __name__ == '__main__': import clime.now ! --- shell --- ! $ python csv_clime.py usage: basename

or: parse-to-school-list or: save [--path] ! It contains some userful function for parsing data from government. 42

Slide 77

Slide 77 text

DOC TIPS help(requests) ! print dir(requests) ! print '\n'.join(dir(requests)) 43

Slide 78

Slide 78 text

YOUR TURN! 44

Slide 79

Slide 79 text

HTML Have fun with the final crawler. ;) 45

Slide 80

Slide 80 text

LXML import requests from lxml import etree ! content = requests.get('http://clbc.tw').content root = etree.HTML(content) ! print root 46

Slide 81

Slide 81 text

CACHE from os.path import exists ! cache_path = 'cache.html' ! if exists(cache_path): with open(cache_path) as f: content = f.read() else: content = requests.get('http://clbc.tw').content with open(cache_path, 'w') as f: f.write(content) 47

Slide 82

Slide 82 text

SEARCHING head = root.find('head') print head ! head_children = head.getchildren() print head_children ! metas = head.findall('meta') print metas ! title_text = head.findtext('title') print title_text 48

Slide 83

Slide 83 text

XPATH titles = root.xpath('/html/head/title') print titles[0].text ! title_texts = root.xpath('/html/head/title/text()') print title_texts[0] ! as_ = root.xpath('//a') print as_ print [a.get('href') for a in as_] 49

Slide 84

Slide 84 text

MD5 from hashlib import md5 ! message = 'There should be one-- and preferably only one --obvious way to do it.' ! print md5(message).hexdigest() ! # Actually, it is noting about HTML. 50

Slide 85

Slide 85 text

DEF GET from os import makedirs from os.path import exists, join ! def get(url, cache_dir_path='cache/'): ! if not exists(cache_dir_path): makedirs(cache_dir) ! cache_path = join(cache_dir_path, md5(url).hexdigest()) ! ... 51

Slide 86

Slide 86 text

DEF FIND_URLS def find_urls(content): root = etree.HTML(content) return [ a.attrib['href'] for a in root.xpath('//a') if 'href' in a.attrib ] 52

Slide 87

Slide 87 text

BFS 1/2 NEW = 0 QUEUED = 1 VISITED = 2 ! def search_urls(url): ! url_queue = [url] url_state_map = {url: QUEUED} ! while url_queue: ! url = url_queue.pop(0) print url 53

Slide 88

Slide 88 text

BFS 2/2 # continue the previous page try: found_urls = find_urls(get(url)) except Exception, e: url_state_map[url] = e print 'Exception: %s' % e except KeyboardInterrupt, e: return url_state_map else: for found_url in found_urls: if not url_state_map.get(found_url, NEW): url_queue.append(found_url) url_state_map[found_url] = QUEUED url_state_map[url] = VISITED 54

Slide 89

Slide 89 text

DEQUE from collections import deque ... ! def search_urls(url): url_queue = deque([url]) ... while url_queue: ! url = url_queue.popleft() print url ... 55

Slide 90

Slide 90 text

YIELD ... ! def search_urls(url): ... while url_queue: ! url = url_queue.pop(0) yield url ... except KeyboardInterrupt, e: print url_state_map return ... 56

Slide 91

Slide 91 text

YOUR TURN! 57

Slide 92

Slide 92 text

SQL How about saving the CSV file into a db? 58

Slide 93

Slide 93 text

TABLE CREATE TABLE schools ( id TEXT PRIMARY KEY, name TEXT, county TEXT, address TEXT, phone TEXT, url TEXT, type TEXT ); ! DROP TABLE schools; 59

Slide 94

Slide 94 text

CRUD INSERT INTO schools (id, name) VALUES ('1', 'The First'); INSERT INTO schools VALUES (...); ! SELECT * FROM schools WHERE id='1'; SELECT name FROM schools WHERE id='1'; ! UPDATE schools SET id='10' WHERE id='1'; ! DELETE FROM schools WHERE id='10'; 60

Slide 95

Slide 95 text

COMMON PATTERN import sqlite3 ! db_path = 'schools.db' conn = sqlite3.connect(db_path) cur = conn.cursor() ! cur.execute('''CREATE TABLE schools ( ... )''') conn.commit() ! cur.close() conn.close() 61

Slide 96

Slide 96 text

ROLLBACK ... ! try: cur.execute('...') except: conn.rollback() raise else: conn.commit() ! ... 62

Slide 97

Slide 97 text

PARAMETERIZE QUERY ... ! rows = ... ! for row in rows: cur.execute('INSERT INTO schools VALUES (?, ?, ?, ?, ?, ?, ?)', row) ! conn.commit() ! ... 63

Slide 98

Slide 98 text

EXECUTEMANY ... ! rows = ... ! cur.executemany('INSERT INTO schools VALUES (?, ?, ?, ?, ?, ?, ?)', rows) ! conn.commit() ! ... 64

Slide 99

Slide 99 text

FETCH ... cur.execute('select * from schools') ! print cur.fetchone() ! # or print cur.fetchall() ! # or for row in cur: print row ... 65

Slide 100

Slide 100 text

TEXT FACTORY # SQLite only: Let you pass the 8-bit string as parameter. ! ... ! conn = sqlite3.connect(db_path) conn.text_factory = str ! ... 66

Slide 101

Slide 101 text

ROW FACTORY # SQLite only: Let you convert tuple into dict. It is `DictCursor` in some other connectors. ! def dict_factory(cursor, row): d = {} for idx, col in enumerate(cursor.description): d[col[0]] = row[idx] return d ! ... con.row_factory = dict_factory ... 67

Slide 102

Slide 102 text

MORE 68

Slide 103

Slide 103 text

MORE • Python DB API 2.0 68

Slide 104

Slide 104 text

MORE • Python DB API 2.0 • MySQLdb - MySQL connector for Python 68

Slide 105

Slide 105 text

MORE • Python DB API 2.0 • MySQLdb - MySQL connector for Python • Psycopg2 - PostgreSQL adapter for Python 68

Slide 106

Slide 106 text

MORE • Python DB API 2.0 • MySQLdb - MySQL connector for Python • Psycopg2 - PostgreSQL adapter for Python • SQLAlchemy - the Python SQL toolkit and ORM 68

Slide 107

Slide 107 text

MORE • Python DB API 2.0 • MySQLdb - MySQL connector for Python • Psycopg2 - PostgreSQL adapter for Python • SQLAlchemy - the Python SQL toolkit and ORM • MoSQL - Build SQL from common Python data structure. 68

Slide 108

Slide 108 text

THE END 69

Slide 109

Slide 109 text

THE END • You learned how to ... 69

Slide 110

Slide 110 text

THE END • You learned how to ... • make a HTTP request 69

Slide 111

Slide 111 text

THE END • You learned how to ... • make a HTTP request • load a CSV file 69

Slide 112

Slide 112 text

THE END • You learned how to ... • make a HTTP request • load a CSV file • parse a HTML file 69

Slide 113

Slide 113 text

THE END • You learned how to ... • make a HTTP request • load a CSV file • parse a HTML file • write a Web crawler 69

Slide 114

Slide 114 text

THE END • You learned how to ... • make a HTTP request • load a CSV file • parse a HTML file • write a Web crawler • use SQL with SQLite 69

Slide 115

Slide 115 text

THE END • You learned how to ... • make a HTTP request • load a CSV file • parse a HTML file • write a Web crawler • use SQL with SQLite • and lot of techniques today. ;) 69