Технології Big Data та інфраструктура Apache Hadoop

Roman Nikitchenko, 09.05.2014 BIG.DATA technologies & HADOOP infrastructure

2 www.vitech.com.ua Agenda Hadoop causes real big data Industry changes
What technology is behind this name? Why Hadoop is so promising solution? BIG DATA APPROACH HADOOP ENVIRONMENT INDUSTRY FACE IS CHANGING

3 www.vitech.com.ua No escape for you ;-)

4 www.vitech.com.ua What is BIG DATA? • Really BIG DATA
things: photo banks, video storage, historical measurements. • Intensive data transactions and high distribution: stores (offline or online), banks, advertising networks. • Realtime data: measurements and minitoring, gaming. • Intensive processing: science, modelling. • High volumes of small things: social networks, healthcare BIG DATA IS EVERYWHERE

5 www.vitech.com.ua BIG DATA in just 3 words Indeed any
real big data is just about DIGITAL LIFE FOOTPRINT

6 www.vitech.com.ua WORLD is big data itself Yet to remember....
WORLD ITSELF CAN BE DIGITIZED TOO • Earth weather and environment: realtime, really big data volume, high potential for processing, lot of things to be analysed, historical data. • Space: unlimited potential for analysis, ocean is yet unknow volume. • Internet of things is going to be digital world itself. • ???

7 www.vitech.com.ua So... BIG DATA is not about the data.
It is about OUR ABILITY TO HANDLE THEM.

8 www.vitech.com.ua But how can I handle big data? …
BUT HOW TO HANDLE IT? BIG DATA

9 www.vitech.com.ua BIG DATA storage: requirements NO BACKUPS

10 www.vitech.com.ua BIG DATA storage: requirements SIMPLE BUT RELIABLE •
Really big amount of data is to be stored in reliable manner. • Storage is to be simple, recoverable and cheap.

11 www.vitech.com.ua BIG DATA storage: requirements DECENTRALIZED • No single
point of failure. • Scalable as close to linear as possible. • No manual actions to recover in case of failures

12 www.vitech.com.ua BIG DATA processing: requirements SIMPLE TO USE •
Complexity is to be burried inside. • Interface is to be functional and compatible between versions.

13 www.vitech.com.ua BIG DATA processing: requirements TOOLS TO BE CLOSE
TO WORK • Process data on the same nodes as it is stored on. • Distributed storage — distributed processing.

14 www.vitech.com.ua BIG DATA processing: requirements • Work is to
be balanced. • Data placement is to be appropriate to balanced work. • Amount of work is to be balanced in accordance to resources. SHARE LOAD

15 www.vitech.com.ua Solution requirements in general WHAT FINALLY DO WE
NEED? • CPU+HDD in one place • Cluster of replacable nodes • Lot of storage space • Way to control resources and balance load • Everything is to be relatively simple and affordable x MAX + = BIG DATA

16 www.vitech.com.ua … and what is the solution? HADOOP magic
is here!

17 www.vitech.com.ua What is it? What is HADOOP? • Hadoop
is open source framework for big data. Both distributed storage and processing. • Hadoop is reliable and fault tolerant with no rely on hardware for these properties. • Hadoop has unique horisontal scalability. Currently — from single computer up to thousands of cluster nodes.

18 www.vitech.com.ua Facts and trends • 2004, Was inspired by
by Google MapReduce idea. Originally was named just after son's elephant toy. • On June 13, 2012 Facebook announced their Hadoop cluster has 100 PB of data. On November 8, 2012 they announced the warehouse grows by roughly half a PB per day. • On February 19, 2008, Yahoo! Inc. launched what it claimed was the world's largest Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster.

19 www.vitech.com.ua Hadoop: classical picture Hadoop historical top view •
HDFS serves as file system layer • MapReduce originally served as distributed processing framework. • Native client API is Java but there are lot of alternatives. • This is only initial architecture and it is now more complex.

20 www.vitech.com.ua HDFS top view • Namenode is 'management' component.
Keeps 'directory' of what file blocks are stored where. • Actual work is performed by data nodes.

21 www.vitech.com.ua HDFS files handling • Files are stored in
large enough blocks. Every block is replicated to several data nodes. • Replication is tracked by namenode. Clients only locate blocks using namenode and actual load is taken by datanode. • Datanode failure leads to replication recovery. Namenode could be backed by standby scheme.

22 www.vitech.com.ua HDFS properties • Designed for throughput, not for
latency. • Blocks are expected to be large. There is issue with lot of small files. • Write once, read many times ideology. • Only append, no 'edit' ability. • Special tools are required to implement OLTP like Apache HBase. HDFS is ...

23 www.vitech.com.ua MapReduce framework model • 2 steps data processing:
transform and then reduce. Really nice to do things in distributed manner. • Large class of jobs can be adopted but not all of them.

24 www.vitech.com.ua MapReduce service: top view • One JobTracker with
redundancy possible. • Multiple TaskTrackers doing actual job. • Ideology is similar to HDFS handling. • HDFS is usually used as storage on all phases. MapReduce service

25 www.vitech.com.ua Technology: Hadoop 2.0 concept • New component (YARN)
forms resource management layer and completes real distributed data OS. • MapReduce is from now only one among other YARN appliactions.

26 www.vitech.com.ua YARN: notable addition • Resource manager dispatches client
requests. • Node managers manage node resources. • Any application is set of containers including application master. YARN service

27 www.vitech.com.ua YARN: notable addition • Better resource balance for
heterogeneous clusterss and multple applications. • Dynamic applications over static services. • Much wider applications model over simple MapReduce. Things like Spark ot Tez. Why YARN is SO important?

28 www.vitech.com.ua Hadoop current picture • HDFS2 is now about
storage and YARN is about processing resources. • Lot of things to do on top of this data OS starting from traditional MapReduce. Now there is lot of alternatives.

29 www.vitech.com.ua Just several items around Infrastructure • HBase: Scalable
structured data storage for large tables. • Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying. • Mahout: A Scalable machine learning and data mining library. • Pig: A high-level data-flow language and execution framework for parallel computation. • ZooKeeper: A high-performance distributed coordination service.

30 www.vitech.com.ua Most important concept First ever world DATA OS
10.000 nodes computer... Recent technology changes are focused on higher scale. Better resource usage and control, lower MTTR, higher security, redundancy, fault tolerance.

31 www.vitech.com.ua Big data industry is changing. HADOOP has influence
on whole BIG DATA INDUSTRY face

32 www.vitech.com.ua New concepts DATA LAKE Take as much data
about your business processes as you can take. The more data you have the more value you could get from it.

33 www.vitech.com.ua New concepts ENTERPRISE DATA HUB Don't ruine your
existing data warehouse. Just extend it with new, centralized big data storage through data migration solution.

34 www.vitech.com.ua Trends Big data is goind BIGGER • SSD
are going to be widely used as storage and memory based replica is not a miracle anymore. • Memory and SSD based caching schemes are going to be more and more aggressive. Particularry in HDFS and HBase. • Clusters grow. Currently some open source features are targeted for clusters of 1K nodes. How about staging 300 nodes cluster in companies like EBay? • Production clusters go beyond 4000 nodes (up to 10K). Node failure nearly every day.

35 www.vitech.com.ua Trends • Typecal node is expected to include
at least 64G memory • Starting from 4 x 2T drives for storage. 8-16 x 4T drives are not so rare. This is for general 'workload' node. • 10 and more CPU cores. 2 CPUs is normal approach. • SSD is starting to be widely used not only for OS and caching but for data itself. • Main outcome — per node costs model is changing. HARDWARE IS GOING CHEAPER

36 www.vitech.com.ua Most important concept • You need to limit
things you are guessing

37 www.vitech.com.ua For whom bell tools? Old way • Make
assumptions about data you need. • Make assumptions about data model. • Make assumptions about algorithms you need. • Get confirmation for your initial guess about result. Are you surprised? New way • Get as much data as you can. • Detect data model based on set of algorithms with extensive approach. • Cluster your data, detect correlations, clean from anomalies... in all way you can afford on whole data set. • Get grounded results. You still cen miss some fundamental aspects but isn't it much better in any case?

38 www.vitech.com.ua Major Hadoop distributions • HortonWorks are 'barely open
source'. Innovative, but 'running too fast'. Most ot their key technologies are not so mature yet. • Cloudera is stable enough but not stale. Hadoop 2.3 with YARN, HBase 0.96.x. Balance. • MapR focuses on performance per node but they are slightly outdated in term of functionality and their distribution costs. For cases where node performance is high priority. • Intel is newcomer on this market. Not for near future.

39 www.vitech.com.ua Questions and discussion

Технології Big Data та інфраструктура Apache Ha...

Технології Big Data та інфраструктура Apache Hadoop

More Decks by Grygoriy Mykhalyuno

Other Decks in Programming

Featured

Transcript