Upgrade to Pro — share decks privately, control downloads, hide ads and more …

RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND...

Big Data Spain
December 15, 2016
25

RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov

Big Data Spain

December 15, 2016
Tweet

More Decks by Big Data Spain

Transcript

  1. 1

  2. 2 RUNNING A PETABYTE SCALE DATA SYSTEM Alexey Kharlamov Nov

    14st, 2016 Good, Bad, and Ugly Decisions
  3. 3 2 1 3 AGENDA MULTITENANCY • Problem statement •

    Resource management • Workload isolation CONTINOUS INTEGRATION • What is different? • Caveats of the conventional approach • BigData release pipeline INTRODUCTION • Who? • What? • Why?
  4. 4 SERVICES Data Strategy Big Data Architecture Data Science Big

    Data DevOps and Support Solutions and Accelerators BIG DATA AND DATA SCIENCE PRACTICE 15+ World-Class Data Architects 200+ Big Data Engineers & Hadoop DevOps 10% Hadoop Certified Engineers 20+ Data Scientists
  5. 5 BIO Alexey a Solution Architect at EPAM Systems Ltd,

    where he leads EMEA Big Data Competency Center. He has over 20 years of software engineering experience and built multiple systems in the area of low-latency and distributed data processing in financial, e-retail and advertising industries. During his career, Alexey has designed systems processing millions of messages per second and managing petabytes of stored data. He uses RDBMs, NoSQL, data grids, and Big Data toolchain in his daily work to help companies on their Big Data journey. Alexey Kharlamov EPAM Systems, Solution Architect
  6. 7 • Data – Machine generated data by social networks,

    games, sensors, ad networks – Large volumes – Allow to build fine grained models of reality • Traits – ~1000 USD/TB – Hundreds of servers, thousands of rotational drives (Failure is a reality) – High performance server to server network – It takes days to copy data from a single server BIG DATA SYSTEM
  7. 9 • Multiple environments for different purposes – Local/Continuous Integration

    – Quality Assurance – Production • The environments are kept in sync – Configuration – Databases • Code and test datasets are deployed to the environments to test different aspects of a system CLASSICAL (WEB) APPROACH 1 Laptop 1 VM 2 hosts 100+ hosts TRADITIONAL APPROACH
  8. 10 TOTALLY DIFFERENT ENVIRONMENT SYNCRHONIZATION OUTCOME • CI, QA and

    PROD are constantly different • Test failure on CI and QA does not mean it will fail in PROD and visa versa • People stop to rely on additional environments to test their jobs • The most frequent bugs – Unexpected field value / rubbish – Input data change – Resource issue due data skew or growth • Environments have different hardware – Number of nodes – Generations of servers • Hard to synchronize configuration – Reprovisioning takes hours – Engineers tend to forget to copy configuration parameters • Hard to synchronize data – Different amount of disk space and CPU – Coping takes hours
  9. 11 PREVAILING ISSUE TYPES • Unexpected field value / rubbish

    – Test data do not cover all possible values – Sampled data may miss exactly this error – Need to test on production data • Incompatible change in data format – Frequently brought in by third-parties and unexpected – Fall through ETL layers – Need to test on production data • Resource issue due data skew or growth – Causes job termination or cluster failure – Must be tested on exactly the same hardware configuration – Need to test on production data
  10. 13 • Logical partitions for DEV, QA, PROD on the

    cluster – Full processing capacity available – Always up-to-date data and configuration – No environment synchronization needed • Cluster becomes multitenant – Partitions must be isolated! – Code must be portable! • Developers need more – Faster turnaround times – Easy interactive debugging and cross- process traceability QA: SINGLE CLUSTER FOR EVERYTHING
  11. 14 QA: HADOOP MINICLUSTER • Full clone of a Hadoop

    Cluster in a single JVM – Job Driver – NameNode – DataNode – Hive – Hbase • Step Into... Hadoop and debug – MapReduce Jobs – User Defined Functions – Coprocessors – Queries
  12. 15 QA: CONTINUOUS QUALITY MONITORING • Assertion of invariants per

    data chunk or time period – Number of records – Field data profile – Conversion failures – Missing dictionary/dimension data – Field values range • Alerting on assertion failure – Too many errors! – Number of records differs!
  13. 17 • Uses unit allocated to them, but always would

    like to get more • Wants independence from others • Do not want to be bothered by other, but can throw a party from time to time APARTMENT RENTAL TENANT • Provides unit fulfilling tenant needs • Fixes broken facilities • Ensures tenants follow rules • Evicts misbehaving tenants LANDLORD
  14. 18 • A logical partition of platform resources independently executing

    a cluster application – Data processing scripts and drivers – Cluster services (workflow managers, query engines) – Bespoken services (REST, Web UI, etc) • Resource management – YARN resource pool defines share of resource available to application – HDFS quotes for data volume control • Isolation – Linux Cgroups enforce CPU/RAM utilization – Filesystem ACLs restrict access – Own service instance per domain (Hive, scheduler, etc) – YARN can preempt tasks running for too long – Watchdog processes terminates ran away jobs APPLICATION DOMAIN
  15. 19 ELASTIC COMPUTING CAPACITY Mesosphere • Researchers and Developers frequently

    need a playground • Application domains need to dynamically allocate resources – Metal as a Service – Virtualization – Containerization • Containers are perfect for portable code bundling – Statelessness encourages externalization of configuration – All dependencies included – Explicit amount of resources allocated – Easy migration between hosts
  16. 20 2 1 3 TAKE AWAYS AUGMENT HADOOP WITH FLUID

    COMPUTATIONAL CAPACITY CREATE ISOLATED DOMAINS FOR TENANTS AND WORKLOADS USE UNIFIED PLATFORM FOR ALL ACTIVITIES