Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Business Intelligence in Python

Business Intelligence in Python

What is the state of BI and python.

Stefan Urbanek

October 27, 2012
Tweet

More Decks by Stefan Urbanek

Other Decks in Technology

Transcript

  1. Overview ▪ Traditional Data Warehouse ▪ Python and Data ▪

    Is Python Capable? ▪ Conclusion sobota, 27. októbra 12
  2. ▪ Extracting data from the original sources ▪ Quality assuring

    and cleaning data ▪ Conforming the labels and measures in the data to achieve consistency across the original sources ▪ Delivering data in a physical format that can be used by query tools, report writers, and dashboards. Source: Ralph Kimball – The Data Warehouse ETL Toolkit sobota, 27. októbra 12
  3. Source Systems Staging Area Operational Data Store Datamarts structured documents

    databases APIs Temporary Staging Area staging relational dimensional L0 L1 L2 sobota, 27. októbra 12
  4. ▪ provide context for facts ▪ used to filter queries

    or reports ▪ control scope of aggregation of facts Dimension sobota, 27. októbra 12
  5. T1[s] T2[s] T3[s] T4[s] P1 392,55 25,04 367,51 520,27 P2

    920,00 776,71 360,03 989,75 P3 877,14 784,61 502,41 786,96 P4 997,65 527,45 447,87 504,76 P5 752,58 237,17 549,78 287,02 n-dimensional array of numbers Scientific Data sobota, 27. októbra 12
  6. Assumptions ▪ data is mostly numbers ▪ data is neatly

    organized... ▪ … in one multi-dimensional array sobota, 27. októbra 12
  7. multiple representations of same data multiple snapshots of one source

    categories are changing sobota, 27. októbra 12
  8. Data Pipes with SQLAlchemy Data Governance Analysis and Presentation Extraction,

    Transformation, Loading Data Sources Technologies and Utilities sobota, 27. októbra 12
  9. src_engine = create_engine("sqlite:///data.sqlite") src_metadata = MetaData(bind=src_engine) src_table = Table('data', src_metadata,

    autoload=True) target_engine = create_engine("postgres://localhost/sandbox") target_metadata = MetaData(bind=target_engine) target_table = Table('data', target_metadata) sobota, 27. októbra 12
  10. for column in src_table.columns: target_table.append_column(column.copy()) target_table.create() insert = target_table.insert() for

    row in src_table.select().execute(): insert.execute(row) clone schema: copy data: sobota, 27. októbra 12
  11. reader = csv.reader(file_stream) columns = reader.next() for column in columns:

    table.append_column(Column(column, String)) table.create() for row in reader: insert.execute(row) text file (CSV) to table: sobota, 27. októbra 12
  12. Simple T from ETL Data Governance Analysis and Presentation Extraction,

    Transformation, Loading Data Sources Technologies and Utilities sobota, 27. októbra 12
  13. transformation = [ ('fiscal_year', {"w function": int, ". field":"fiscal_year"}), ('region_code',

    {"4 mapping": region_map, ". field":"region"}), ('borrower_country', None), ('project_name', None), ('procurement_type', None), ('major_sector_code', {"4 mapping": sector_code_map, ". field":"major_sector"}), ('major_sector', None), ('supplier', None), ('contract_amount', {"w function": currency_to_number, ". field": 'total_contract_amount'} ] target fields source transformations sobota, 27. októbra 12
  14. OLAP with Cubes Data Governance Analysis and Presentation Extraction, Transformation,

    Loading Data Sources Technologies and Utilities sobota, 27. októbra 12
  15. cubes dimensions measures levels, attributes, hierarchy Model { “name” =

    “My Model” “description” = .... “cubes” = [...] “dimensions” = [...] } sobota, 27. októbra 12
  16. ✂ cut = PointCut(9 “date”, [2010]) o cell = o

    cell.slice(✂ cut) browser.aggregate(o cell, drilldown=[9 “date”]) 2006 2007 2008 2009 2010 Total Jan Feb Mar Apr March April May ... whole cube o cell = Cell(cube) browser.aggregate(o cell) browser.aggregate(o cell, drilldown=[9 “date”]) sobota, 27. októbra 12
  17. ▪ saves maintenance resources ▪ shortens development time ▪ saves

    your from going insane Language just the sobota, 27. októbra 12
  18. Source Systems Staging Area Operational Data Store Datamarts structured documents

    databases APIs Temporary Staging Area staging relational dimensional L0 L1 L2 faster sobota, 27. októbra 12
  19. Data Governance Analysis and Presentation Extraction, Transformation, Loading Data Sources

    Technologies and Utilities faster advanced understandable, maintainable sobota, 27. októbra 12
  20. who is going to fix your COBOL Java tool if

    you have only Python guys around? Future sobota, 27. októbra 12