$30 off During Our Annual Pro Sale. View Details »

How It Works - Hadoop

How It Works - Hadoop

A series of talks on Data Engineering

Avatar for Yuri Ostapchuk

Yuri Ostapchuk

September 12, 2021
Tweet

More Decks by Yuri Ostapchuk

Other Decks in Programming

Transcript

  1. WHAT IS HADOOP? WHAT IS HADOOP? Original: Framework (Hadoop Core)

    HDFS MapReduce (Yarn) Ecosystem/Solution "Hadoop solution" vs classic RDBMS warehousing Ecosystem of tools and frameworks originated from Hadoop or its principles 3 . 1
  2. HADOOP CHARACTERISTICS HADOOP CHARACTERISTICS open-source commodity hardware [semi/un]structured data, schema-on-read

    fault-tolerant, highly-available scalable (thousands of nodes) highly-parallel computation data-locality 5 . 1
  3. HDFS CHARACTERISTICS HDFS CHARACTERISTICS almost POSIX `hdfs dfs -ls -a

    user/spark..` URI scheme: hdfs://92.23.23.23/user/spark/.. les split into 64mb blocks 6 . 4
  4. MAPREDUCE HIGHLIGHTS MAPREDUCE HIGHLIGHTS any job is represented as series

    of map/reduce steps embedded data locality high scalability relatively slow complex 7 . 2
  5. HADOOP LIMITATIONS HADOOP LIMITATIONS a lot of small les (each

    le still occupies 64mb block) MapReduce is slow no ACID, append-only (or rewrite) only batch-processing highly complex (con guration & programming api) 10 . 1
  6. EMR FEATURES EMR FEATURES ec2 control con guration & components

    version mngmnt exible resource-utilization, cost model fast provisioning auto-scaling high availability integration cloud-formation cloud-watch s3 (emrfs) aws glue, dynamodb 13 . 4
  7. DEMO DEMO hadoop version: 3.1.3 core-site: <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property>

    hdfs-site: <property> <name>dfs.replication</name <value>1</value> </property> mapred-site: <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>yarn.app.mapreduce.am.env</name> <value>HADOOPMAPREDHOME=/home/twist/Down 3.1.3</value> </property> <property>
  8. <name>mapreduce.map.env</name> <value>HADOOPMAPREDHOME=/home/twist/Down 3.1.3</value> </property> <property> <name>mapreduce.reduce.env</name> <value>HADOOPMAPREDHOME=/home/twist/Down 3.1.3</value> </property> yarn-site:

    <property> <name>yarn.nodemanager.au services</name> <value>mapreduceshuf e</value> <property> <name>yarn.nodemanager.aux- services.mapreduceshuf e.class</name> <value>org.apache.hadoop.mapred.Shuf eHandler< </property> format hdfs
  9. start dfs, start yarn word-count javac WordCount.java export JAVAHOME=/usr/lib/jv export

    PATH= {PATH} export HADOOPCLASSPATH=${JAVAHOME}/lib/tools.jar jar import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static class TokenizerMapper J AV OM E/bin : AH