Slide 1

Slide 1 text

Python for Business Intelligence Štefan Urbánek ■ @Stiivi ■ [email protected] ■ October 2012 sobota, 27. októbra 12

Slide 2

Slide 2 text

python business intelligence ) sobota, 27. októbra 12

Slide 3

Slide 3 text

Q/A and articles with Java solution references (not listed here) Results sobota, 27. októbra 12

Slide 4

Slide 4 text

sobota, 27. októbra 12

Slide 5

Slide 5 text

Why? sobota, 27. októbra 12

Slide 6

Slide 6 text

Overview ■ Traditional Data Warehouse ■ Python and Data ■ Is Python Capable? ■ Conclusion sobota, 27. októbra 12

Slide 7

Slide 7 text

Business Intelligence sobota, 27. októbra 12

Slide 8

Slide 8 text

people technology processes sobota, 27. októbra 12

Slide 9

Slide 9 text

Data Governance Analysis and Presentation Extraction, Transformation, Loading Data Sources Technologies and Utilities sobota, 27. októbra 12

Slide 10

Slide 10 text

Traditional Data Warehouse sobota, 27. októbra 12

Slide 11

Slide 11 text

■ Extracting data from the original sources ■ Quality assuring and cleaning data ■ Conforming the labels and measures in the data to achieve consistency across the original sources ■ Delivering data in a physical format that can be used by query tools, report writers, and dashboards. Source: Ralph Kimball – The Data Warehouse ETL Toolkit sobota, 27. októbra 12

Slide 12

Slide 12 text

Source Systems Staging Area Operational Data Store Datamarts structured documents databases APIs Temporary Staging Area staging relational dimensional L0 L1 L2 sobota, 27. októbra 12

Slide 13

Slide 13 text

real time = daily sobota, 27. októbra 12

Slide 14

Slide 14 text

Multi-dimensional Modeling sobota, 27. októbra 12

Slide 15

Slide 15 text

sobota, 27. októbra 12

Slide 16

Slide 16 text

aggregation browsing slicing and dicing sobota, 27. októbra 12

Slide 17

Slide 17 text

business / analyst’s point of view regardless of physical schema implementation sobota, 27. októbra 12

Slide 18

Slide 18 text

Facts fact most detailed information measurable fact data cell sobota, 27. októbra 12

Slide 19

Slide 19 text

dimensions location type time sobota, 27. októbra 12

Slide 20

Slide 20 text

■ provide context for facts ■ used to filter queries or reports ■ control scope of aggregation of facts Dimension sobota, 27. októbra 12

Slide 21

Slide 21 text

Pentaho sobota, 27. októbra 12

Slide 22

Slide 22 text

Python and Data community perception* *as of Oct 2012 sobota, 27. októbra 12

Slide 23

Slide 23 text

Scientific & Financial sobota, 27. októbra 12

Slide 24

Slide 24 text

Python sobota, 27. októbra 12

Slide 25

Slide 25 text

Data Governance Analysis and Presentation Extraction, Transformation, Loading Data Sources Technologies and Utilities sobota, 27. októbra 12

Slide 26

Slide 26 text

T1[s] T2[s] T3[s] T4[s] P1 392,55 25,04 367,51 520,27 P2 920,00 776,71 360,03 989,75 P3 877,14 784,61 502,41 786,96 P4 997,65 527,45 447,87 504,76 P5 752,58 237,17 549,78 287,02 n-dimensional array of numbers Scientific Data sobota, 27. októbra 12

Slide 27

Slide 27 text

Assumptions ■ data is mostly numbers ■ data is neatly organized... ■ … in one multi-dimensional array sobota, 27. októbra 12

Slide 28

Slide 28 text

Data Governance Analysis and Presentation Extraction, Transformation, Loading Data Sources Technologies and Utilities sobota, 27. októbra 12

Slide 29

Slide 29 text

Business Data sobota, 27. októbra 12

Slide 30

Slide 30 text

multiple representations of same data multiple snapshots of one source categories are changing sobota, 27. októbra 12

Slide 31

Slide 31 text

❄ sobota, 27. októbra 12

Slide 32

Slide 32 text

Is Python Capable? very basic examples sobota, 27. októbra 12

Slide 33

Slide 33 text

Data Pipes with SQLAlchemy Data Governance Analysis and Presentation Extraction, Transformation, Loading Data Sources Technologies and Utilities sobota, 27. októbra 12

Slide 34

Slide 34 text

■ connection: create_engine ■ schema reflection: MetaData, Table ■ expressions: select(), insert() sobota, 27. októbra 12

Slide 35

Slide 35 text

src_engine = create_engine("sqlite:///data.sqlite") src_metadata = MetaData(bind=src_engine) src_table = Table('data', src_metadata, autoload=True) target_engine = create_engine("postgres://localhost/sandbox") target_metadata = MetaData(bind=target_engine) target_table = Table('data', target_metadata) sobota, 27. októbra 12

Slide 36

Slide 36 text

for column in src_table.columns: target_table.append_column(column.copy()) target_table.create() insert = target_table.insert() for row in src_table.select().execute(): insert.execute(row) clone schema: copy data: sobota, 27. októbra 12

Slide 37

Slide 37 text

magic used: metadata reflection sobota, 27. októbra 12

Slide 38

Slide 38 text

reader = csv.reader(file_stream) columns = reader.next() for column in columns: table.append_column(Column(column, String)) table.create() for row in reader: insert.execute(row) text file (CSV) to table: sobota, 27. októbra 12

Slide 39

Slide 39 text

Simple T from ETL Data Governance Analysis and Presentation Extraction, Transformation, Loading Data Sources Technologies and Utilities sobota, 27. októbra 12

Slide 40

Slide 40 text

transformation = [ ('fiscal_year', {"w function": int, ". field":"fiscal_year"}), ('region_code', {"4 mapping": region_map, ". field":"region"}), ('borrower_country', None), ('project_name', None), ('procurement_type', None), ('major_sector_code', {"4 mapping": sector_code_map, ". field":"major_sector"}), ('major_sector', None), ('supplier', None), ('contract_amount', {"w function": currency_to_number, ". field": 'total_contract_amount'} ] target fields source transformations sobota, 27. októbra 12

Slide 41

Slide 41 text

for row in source: result = transform(row, [ transformation) table.insert(result).execute() Transformation sobota, 27. októbra 12

Slide 42

Slide 42 text

OLAP with Cubes Data Governance Analysis and Presentation Extraction, Transformation, Loading Data Sources Technologies and Utilities sobota, 27. októbra 12

Slide 43

Slide 43 text

cubes dimensions measures levels, attributes, hierarchy Model { “name” = “My Model” “description” = .... “cubes” = [...] “dimensions” = [...] } sobota, 27. októbra 12

Slide 44

Slide 44 text

❄ logical physical sobota, 27. októbra 12

Slide 45

Slide 45 text

workspace.browser(cube) load_model("model.json") create_workspace("sql", model, url="sqlite:///data.sqlite") model.cube("sales") Aggregation Browser backend cubes Application ∑ 1 2 3 4 sobota, 27. októbra 12

Slide 46

Slide 46 text

browser.aggregate(o cell, . drilldown=[9 "sector"]) drill-down sobota, 27. októbra 12

Slide 47

Slide 47 text

q row.label k row.key for row in result.table_rows(“sector”): row.record["amount_sum"] sobota, 27. októbra 12

Slide 48

Slide 48 text

✂ cut = PointCut(9 “date”, [2010]) o cell = o cell.slice(✂ cut) browser.aggregate(o cell, drilldown=[9 “date”]) 2006 2007 2008 2009 2010 Total Jan Feb Mar Apr March April May ... whole cube o cell = Cell(cube) browser.aggregate(o cell) browser.aggregate(o cell, drilldown=[9 “date”]) sobota, 27. októbra 12

Slide 49

Slide 49 text

How can Python be Useful sobota, 27. októbra 12

Slide 50

Slide 50 text

■ saves maintenance resources ■ shortens development time ■ saves your from going insane Language just the sobota, 27. októbra 12

Slide 51

Slide 51 text

Source Systems Staging Area Operational Data Store Datamarts structured documents databases APIs Temporary Staging Area staging relational dimensional L0 L1 L2 faster sobota, 27. októbra 12

Slide 52

Slide 52 text

Data Governance Analysis and Presentation Extraction, Transformation, Loading Data Sources Technologies and Utilities faster advanced understandable, maintainable sobota, 27. októbra 12

Slide 53

Slide 53 text

Conclusion sobota, 27. októbra 12

Slide 54

Slide 54 text

people technology processes BI is about… sobota, 27. októbra 12

Slide 55

Slide 55 text

don’t forget metadata sobota, 27. októbra 12

Slide 56

Slide 56 text

who is going to fix your COBOL Java tool if you have only Python guys around? Future sobota, 27. októbra 12

Slide 57

Slide 57 text

is capable, let’s start sobota, 27. októbra 12

Slide 58

Slide 58 text

Thank You Twitter: @Stiivi DataBrewery blog: blog.databrewery.org Github: github.com/Stiivi [t\ sobota, 27. októbra 12