Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Realtime Reporting Platform at Concur

Realtime Reporting Platform at Concur

Realtime Reporting Platform at Concur

Avatar for Santosh Sahoo

Santosh Sahoo

July 10, 2015
Tweet

More Decks by Santosh Sahoo

Other Decks in Technology

Transcript

  1. About us Concur (now part of SAP) provides travel and

    expense management services to businesses. Data Insights team is building solutions to provide customer access to data, visualization and reporting.
  2. Numbers 7K OLTP database sources 14K OLAP Reporting dbs 28K

    ETL Jobs 300M rows (Compacted), 2B row changes Only ~20 failure a night
  3. Batch ETL challenges Scheduled (High latency) Processing time Hard to

    scale. Not fault tolerance Monolithic High maintenance
  4. Moving forward Scheduled (High latency) Streaming, real time Hard to

    scale Scalable Monolithic Modular Not fault tolerant Fault tolerant ACID Consistent, Normalized Eventual Consistency High maintenance (Single Tenant) Reduce maintenance overhead (Multi tenant)
  5. Source Flow Manager Streaming Processor Storage Reporting Streaming Data Pipeline

    Applications Mobile Devices Sensors IOT - Internet of things Database Log scrapping Alert Message Queues Kafka Flume Azure Event hub AWS Kinesis HDFS Storm Spark Streaming Azure Stream analytics Samza Flink RDBMS NoSQL HDFS Redshift Custom App D3 Tableau Cognos Excel
  6. Spark Streaming What? A data processing framework to build scalable

    fault-tolerant streaming applications. Why? It lets you reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state.
  7. how it worked.. Producer random() Kafka Topic Topic Spark HDFS

    Worker Redis Node.js HTML5 D3.js SSE Parquet
  8. Kafka - Flow Management No nonsense logging 100K/s throughput vs

    20k of RabbitMQ Log compaction Durable persistence Partition tolerance Replication Best in class integration with Spark
  9. Spark Streaming Architecture Worker Worker Worker Receiver Driver Master Executor

    Executor Executor Source D1 D2 D3 D4 WAL D1 D2 Replication Data Store TASK DStream- Discretized Stream of RDD RDD - Resilient Distributed Datasets
  10. OLTP Reporting Cognos Tableau ? Stream Processor Spark HDFS Import

    FTP HTTP SMTP P Protobuf Json Broker Kafka Hive/ Spark SQL OLAP Load balance Failover HANA HANA OLAP Replication Service bus Normalization Extract Compensate Data {Quality, Correction, Analytics} Migrate method API/SQL Expense Travel TTX API Reporting Next Gen Architecture C Tachyon
  11. QnA