Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Developing a Hadoop Based ETL Platform by ESTEB...

Big Data Spain
November 14, 2013

Developing a Hadoop Based ETL Platform by ESTEBAN CHINER and IGNACIO SALES at Big Data Spain 2013

GFT has built an ETL accelerator platform under Hadoop for a big international investment bank. This ETL layer is used to consolidate, enrich and validate financial data gathered from different source systems in the bank, and make it available to the Accounting Layer of the bank in an homogeneous format .
8th Nov 2013
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/2013/conference/developing-a-hadoop-based-etl-platform-for-feed-consolidation

Big Data Spain

November 14, 2013
Tweet

More Decks by Big Data Spain

Other Decks in Technology

Transcript

  1. Esteban Chiner & Ignacio Sales BigData Spain Madrid, 7th /

    8th November Developing a Hadoop Based ETL Platform
  2. GFT Group Page 3 08/11/2013 Agenda 1 GFT at a

    glance 2 Problem description 3 Design principles 4 Architecture description 5 Lessons learnt 6 Conclusions / Q&A
  3. GFT Group Page 4 08/11/2013 GFT at a glance “Big

    enough to deliver, small enough to care” Focus on Financial Services Industry  Among the top 10 European IT service providers (FinTech 100 ranking 2012) for the financial services sector with global reach  Long-standing partnerships with more than 15 top-tier institutions  About 2,000 employees with extensive industry knowledge State of the art Consulting, Managed Services and Solutions  Reliable technology services and solutions based on all technologies  Delivery teams from onsite, near-shore/offshore projects – excellent quality for the best price Commitment to Delivery | Passion for Technology  We know how to manage risk  Our experience enables us to deal with complexity • We do co-innovation with our clients • We ensure maximum transparency
  4. GFT Group Page 6 08/11/2013 Problem description Introduction: An Investment

    Bank from 10,000 feet Front Office Middle Office B a c k O f f Trade Capture Trade Validation Trade Enrichment Trade Settlement Calculation Engines Accounting Reconciliation s Risk Management Reporting
  5. GFT Group Page 7 08/11/2013 Problem description Current state After

    a few years, this is how the enterprise architecture looks like: Source 1 Source 3 Source 4 Source 2 Source 5 ETL1 ETL2 ETL3 ETL4 Target 1 Target 3 Target 2 Target 4 FIX / MQ CSV / SFTP XML1 / JMS CSV / SFTP XML2/WS Format 1 Format 3 Format 2 Format 4 JDBC
  6. GFT Group Page 8 08/11/2013 Problem description Future state A

    feed consolidation layer ESB Pattern Implementation for Batch and Real-Time Feeds With unlimited horizontal scalability Source 1 Source 3 Source 4 Source 2 Source 5 Feed Consolidation Layer Target 1 Target 3 Target 2 Target 4
  7. GFT Group Page 9 08/11/2013 Why Hadoop?  Horizontal scalability

    to cope with current and future volumes - Non-linear growth  Manage multiple structures of data  Develop ETL without being constrained by a rigid relational data model  Store all incoming / intermediate / outgoing data: Build data hub for future analytics
  8. GFT Group Page 10 08/11/2013 Design principles  New feed

    time-to-market should be reduced to the minimum  New mappings and transformations should be easy to develop  Horizontal scalability to cope with current and future volumes  Avoid vendor lock-in  Based on modular components: Plug & Play  Support multiple input formats & delivery mechanisms (batch / real time)
  9. GFT Group Page 11 08/11/2013 Architecture description Key design decisions

     Use Hadoop in order to provide scalability  XML data format  XSLT for data transformations  External metadata storage: Oracle  Mappings done with an external tool which supports XSLT: Altova MapForce  Divide the orchestration in two: - Internal, using Oozie - External, using Tibco BusinessWorks
  10. GFT Group Page 12 08/11/2013 Architecture description Approach and implementation

    ETL Layer Data Ingestion Data Delivery Orchestration Metadata Module 1 (Filter) Module 2 (Enrich) Module 3 (Transform) Module n Reference Data Hub Flume Oracle Oozie Tibco BW HDFS MapReduce Sqoop Sqoop Java . . . Setup Process Log XSLT Monitoring GWT
  11. GFT Group Page 13 08/11/2013 Lessons Learnt Big Data in

    Finance  Every record counts  No unstructured data: rather very many diverse structures  Provide the right tools for each function  Development  Testing  Production Support  Learn to move at Open Source Speed
  12. GFT Group Page 14 08/11/2013 Lessons Learnt Setting up a

    team  How to ramp up development team skills  Good Java Programmers make good MapReduce Programmers  Concepts, API – Easy  MapReduce design – Not so easy  Working within a framework makes life easier  Focus on the “right” tools
  13. GFT Group Page 15 08/11/2013 Lessons Learnt Technical  Handle

    error records as part of the main workflow - Leverage Multiple Output capabilities  MRUnit wrapper to support unit testing of jobs with Multiple Outputs  Leverage Hive for testing and production support teams  Encrypt and compress data for security / optimize resource usage  Multi-tenancy is a challenge
  14. GFT Group Page 16 08/11/2013 Conclusions  We have seen

    how a well known Enterprise Integration challenge was solved with Big Data Technologies  We have examined the particular problems posed by integrating Hadoop into a large financial services organization  We have validated that Big Data technology is ready for use in this environment, and that its use is justified  This is only the beginning…
  15. GFT Group Page 17 08/11/2013 © Copyright GFT Technologies AG,

    2013 GFT IT Consulting Esteban Chiner Sanz Senior Architect Web: www.gft.com Ignacio Sales Saborit Senior Architect