Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Технології Big Data та інфраструктура Apache Hadoop

Технології Big Data та інфраструктура Apache Hadoop

Роман Нікітченко, Research engineer at V.I.tech
Технології Big Data та інфраструктура Apache Hadoop
Обсяги даних в сучасних інформаційних рішеннях стримко зростають і це приводить до виникнення як нових проблем, так і нових можливостей. De-facto перша в світі операційна система для обробки даних (data OS) Apache Hadoop надає дуже широкі можливості як для зберігання, так і для обробки великих обсягів інформації. Які саме і як це може вплинути найближчим часом на обличчя індустрії Big Data?

Grygoriy Mykhalyuno

May 17, 2014
Tweet

More Decks by Grygoriy Mykhalyuno

Other Decks in Programming

Transcript

  1. 2 www.vitech.com.ua Agenda Hadoop causes real big data Industry changes

    What technology is behind this name? Why Hadoop is so promising solution? BIG DATA APPROACH HADOOP ENVIRONMENT INDUSTRY FACE IS CHANGING
  2. 4 www.vitech.com.ua What is BIG DATA? • Really BIG DATA

    things: photo banks, video storage, historical measurements. • Intensive data transactions and high distribution: stores (offline or online), banks, advertising networks. • Realtime data: measurements and minitoring, gaming. • Intensive processing: science, modelling. • High volumes of small things: social networks, healthcare BIG DATA IS EVERYWHERE
  3. 5 www.vitech.com.ua BIG DATA in just 3 words Indeed any

    real big data is just about DIGITAL LIFE FOOTPRINT
  4. 6 www.vitech.com.ua WORLD is big data itself Yet to remember....

    WORLD ITSELF CAN BE DIGITIZED TOO • Earth weather and environment: realtime, really big data volume, high potential for processing, lot of things to be analysed, historical data. • Space: unlimited potential for analysis, ocean is yet unknow volume. • Internet of things is going to be digital world itself. • ???
  5. 7 www.vitech.com.ua So... BIG DATA is not about the data.

    It is about OUR ABILITY TO HANDLE THEM.
  6. 10 www.vitech.com.ua BIG DATA storage: requirements SIMPLE BUT RELIABLE •

    Really big amount of data is to be stored in reliable manner. • Storage is to be simple, recoverable and cheap.
  7. 11 www.vitech.com.ua BIG DATA storage: requirements DECENTRALIZED • No single

    point of failure. • Scalable as close to linear as possible. • No manual actions to recover in case of failures
  8. 12 www.vitech.com.ua BIG DATA processing: requirements SIMPLE TO USE •

    Complexity is to be burried inside. • Interface is to be functional and compatible between versions.
  9. 13 www.vitech.com.ua BIG DATA processing: requirements TOOLS TO BE CLOSE

    TO WORK • Process data on the same nodes as it is stored on. • Distributed storage — distributed processing.
  10. 14 www.vitech.com.ua BIG DATA processing: requirements • Work is to

    be balanced. • Data placement is to be appropriate to balanced work. • Amount of work is to be balanced in accordance to resources. SHARE LOAD
  11. 15 www.vitech.com.ua Solution requirements in general WHAT FINALLY DO WE

    NEED? • CPU+HDD in one place • Cluster of replacable nodes • Lot of storage space • Way to control resources and balance load • Everything is to be relatively simple and affordable x MAX + = BIG DATA
  12. 17 www.vitech.com.ua What is it? What is HADOOP? • Hadoop

    is open source framework for big data. Both distributed storage and processing. • Hadoop is reliable and fault tolerant with no rely on hardware for these properties. • Hadoop has unique horisontal scalability. Currently — from single computer up to thousands of cluster nodes.
  13. 18 www.vitech.com.ua Facts and trends • 2004, Was inspired by

    by Google MapReduce idea. Originally was named just after son's elephant toy. • On June 13, 2012 Facebook announced their Hadoop cluster has 100 PB of data. On November 8, 2012 they announced the warehouse grows by roughly half a PB per day. • On February 19, 2008, Yahoo! Inc. launched what it claimed was the world's largest Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster.
  14. 19 www.vitech.com.ua Hadoop: classical picture Hadoop historical top view •

    HDFS serves as file system layer • MapReduce originally served as distributed processing framework. • Native client API is Java but there are lot of alternatives. • This is only initial architecture and it is now more complex.
  15. 20 www.vitech.com.ua HDFS top view • Namenode is 'management' component.

    Keeps 'directory' of what file blocks are stored where. • Actual work is performed by data nodes.
  16. 21 www.vitech.com.ua HDFS files handling • Files are stored in

    large enough blocks. Every block is replicated to several data nodes. • Replication is tracked by namenode. Clients only locate blocks using namenode and actual load is taken by datanode. • Datanode failure leads to replication recovery. Namenode could be backed by standby scheme.
  17. 22 www.vitech.com.ua HDFS properties • Designed for throughput, not for

    latency. • Blocks are expected to be large. There is issue with lot of small files. • Write once, read many times ideology. • Only append, no 'edit' ability. • Special tools are required to implement OLTP like Apache HBase. HDFS is ...
  18. 23 www.vitech.com.ua MapReduce framework model • 2 steps data processing:

    transform and then reduce. Really nice to do things in distributed manner. • Large class of jobs can be adopted but not all of them.
  19. 24 www.vitech.com.ua MapReduce service: top view • One JobTracker with

    redundancy possible. • Multiple TaskTrackers doing actual job. • Ideology is similar to HDFS handling. • HDFS is usually used as storage on all phases. MapReduce service
  20. 25 www.vitech.com.ua Technology: Hadoop 2.0 concept • New component (YARN)

    forms resource management layer and completes real distributed data OS. • MapReduce is from now only one among other YARN appliactions.
  21. 26 www.vitech.com.ua YARN: notable addition • Resource manager dispatches client

    requests. • Node managers manage node resources. • Any application is set of containers including application master. YARN service
  22. 27 www.vitech.com.ua YARN: notable addition • Better resource balance for

    heterogeneous clusterss and multple applications. • Dynamic applications over static services. • Much wider applications model over simple MapReduce. Things like Spark ot Tez. Why YARN is SO important?
  23. 28 www.vitech.com.ua Hadoop current picture • HDFS2 is now about

    storage and YARN is about processing resources. • Lot of things to do on top of this data OS starting from traditional MapReduce. Now there is lot of alternatives.
  24. 29 www.vitech.com.ua Just several items around Infrastructure • HBase: Scalable

    structured data storage for large tables. • Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying. • Mahout: A Scalable machine learning and data mining library. • Pig: A high-level data-flow language and execution framework for parallel computation. • ZooKeeper: A high-performance distributed coordination service.
  25. 30 www.vitech.com.ua Most important concept First ever world DATA OS

    10.000 nodes computer... Recent technology changes are focused on higher scale. Better resource usage and control, lower MTTR, higher security, redundancy, fault tolerance.
  26. 32 www.vitech.com.ua New concepts DATA LAKE Take as much data

    about your business processes as you can take. The more data you have the more value you could get from it.
  27. 33 www.vitech.com.ua New concepts ENTERPRISE DATA HUB Don't ruine your

    existing data warehouse. Just extend it with new, centralized big data storage through data migration solution.
  28. 34 www.vitech.com.ua Trends Big data is goind BIGGER • SSD

    are going to be widely used as storage and memory based replica is not a miracle anymore. • Memory and SSD based caching schemes are going to be more and more aggressive. Particularry in HDFS and HBase. • Clusters grow. Currently some open source features are targeted for clusters of 1K nodes. How about staging 300 nodes cluster in companies like EBay? • Production clusters go beyond 4000 nodes (up to 10K). Node failure nearly every day.
  29. 35 www.vitech.com.ua Trends • Typecal node is expected to include

    at least 64G memory • Starting from 4 x 2T drives for storage. 8-16 x 4T drives are not so rare. This is for general 'workload' node. • 10 and more CPU cores. 2 CPUs is normal approach. • SSD is starting to be widely used not only for OS and caching but for data itself. • Main outcome — per node costs model is changing. HARDWARE IS GOING CHEAPER
  30. 37 www.vitech.com.ua For whom bell tools? Old way • Make

    assumptions about data you need. • Make assumptions about data model. • Make assumptions about algorithms you need. • Get confirmation for your initial guess about result. Are you surprised? New way • Get as much data as you can. • Detect data model based on set of algorithms with extensive approach. • Cluster your data, detect correlations, clean from anomalies... in all way you can afford on whole data set. • Get grounded results. You still cen miss some fundamental aspects but isn't it much better in any case?
  31. 38 www.vitech.com.ua Major Hadoop distributions • HortonWorks are 'barely open

    source'. Innovative, but 'running too fast'. Most ot their key technologies are not so mature yet. • Cloudera is stable enough but not stale. Hadoop 2.3 with YARN, HBase 0.96.x. Balance. • MapR focuses on performance per node but they are slightly outdated in term of functionality and their distribution costs. For cases where node performance is high priority. • Intel is newcomer on this market. Not for near future.