Designing Big Data Solutions Using AWS

Designing Big Data Solutions Using AWS By Zakiullah Khan Mohammed
@khan_io http://www.khanio.com

How many of are you using AWS already ? Survey
Question 1 8/2/2013 @khan_io | http://www.khanio.com 2

How many of are using Hadoop at work ? Survey
Question 2 8/2/2013 @khan_io | http://www.khanio.com 3

How many of are you already using Amazon EMR ?
Survey Question 3 8/2/2013 @khan_io | http://www.khanio.com 4

Big Data Solution Requisites From Solution Architect’s PoV • Volume
– Can it handle “large” data volumes ? • Variety – Can it handle data variety ? Structured, Semi-Structured, Poly-Structured, Unstructured ? • Value – At what cost can it deliver the required performance ? 8/2/2013 @khan_io | http://www.khanio.com 5

Optimal Big Data Solution Characteristics for right ROI Scalable &
Reliable Operational Ease Cost Effective 8/2/2013 @khan_io | http://www.khanio.com 6

Amazon Web Services Quick Introduction A utility service that provides
you technology resources managed by experts and available on demand. • Flexible • Cost-Effective • Scalable & Elastic • Secure • Experienced 8/2/2013 @khan_io | http://www.khanio.com 7

AWS Platform Overview Global Infrastructure • Regions • Availability Zones
• Edge Location Foundation Services • Compute • Storage • Database • Networking Application Platform Services • Content Distribution • Application SVCS • Parallel Processing • Libraries & SDKs Management & Administration • Identity & Access • Web Interface • Monitoring • Deployment & Automation 8/2/2013 @khan_io | http://www.khanio.com 8

AWS Platform In Scope Global Infrastructure • Regions • Availability
Zones • Edge Location Foundation Services • Compute • Storage • Database • Networking Application Platform Services • Content Distribution • Application SVCS • Parallel Processing • Libraries & SDKs Management & Administration • Identity & Access • Web Interface • Monitoring • Deployment & Automation 8/2/2013 @khan_io | http://www.khanio.com 9

Regions & Availability Zones Customer Decides Where Apps & Data
reside 8/2/2013 @khan_io | http://www.khanio.com 10 Note: The Actual number of availability zones may vary. US East (VA) US West (CA) US West (OR) Asia Pacific (Tokyo) Asia Pacific (Singapore ) Asia Pacific (Sydney) EU (Ireland) South America (Sao Paulo) Availability Zone A Availability Zone B Availability Zone C Availability Zone C Availability Zone A Availability Zone B Availability Zone A Availability Zone B Availability Zone C Availability Zone A Availability Zone B Availability Zone C Availability Zone A Availability Zone B Availability Zone A Availability Zone B Availability Zone A Availability Zone B Availability Zone C Availability Zone A Availability Zone B

Compute Decide kind & quantity of computational power needed ?
8/2/2013 @khan_io | http://www.khanio.com 11 Note: The Actual number of availability zones may vary. Amazon Elastic Compute Cloud (EC2) Instance (On-Demand, Reserved & Spot) Instances Amazon Machine Image (AMI)

Amazon EC2 Instance Types 8/2/2013 @khan_io | http://www.khanio.com 12 •
EC2 Instance = Virtual Server • Available in 16 compute capacities • EC2 instance types are On-Demand, Reserved and Spot • Amazon Machine Image (AMI) is a building block of EC2 instances

Storage Where will we store the data ? 8/2/2013 @khan_io
| http://www.khanio.com 13 Note: The Actual number of availability zones may vary. Amazon Simple Storage Service (S3) Amazon Elastic Block Store (EBS) Amazon Glacier

Amazon S3 Universal Object Store 8/2/2013 @khan_io | http://www.khanio.com 14
• A “Bucket” is equivalent to “folder” • Objects form 1B - 5TB, no bucket size limit • Highly available, scalable, reliable, fast and inexpensive Bucket Bucket with Objects Object

Amazon EBS Storage for EC2 8/2/2013 @khan_io | http://www.khanio.com 15
• EBS Volume = Virtual Disk • Storage volume for EC2 instances – create, attach, backup, restore & delete • Can use to create RAID configuration for an EC2 instance Volume Snapshot

Data Portability How do you get large volume data into
AWS Infra ? 8/2/2013 @khan_io | http://www.khanio.com 16 • AWS Direct Connect – Dedicated low latency bandwidth • AWS Import/Export – Physical media shipping • Queuing – Highly scalable event buffering (CEP) • Amazon Storage Gateway – Sync local storage to the cloud

Parallel Processing Distributed Job Processing 8/2/2013 @khan_io | http://www.khanio.com 17
Amazon Elastic Map Reduce Cluster HDFS Cluster

Amazon EMR Managed Hadoop Infrastructure 8/2/2013 @khan_io | http://www.khanio.com 18
• Reduces complexity of Hadoop management – node provisioning, customization and shutdown • Run Hive, Pig, Java and Streaming programs • Provides tight integration with AWS Services – Optimized for Amazon S3 – EC2 integration with automatic provisioning on node failure – Cluster Monitoring using Amazon CloudWatch

Amazon EMR Internals Instance Groups 8/2/2013 @khan_io | http://www.khanio.com 19
• Master Instance Group • Core Instance Group • Task Instance Group EMR Cluster Master Instance Group Core Instance Group Task Instance Group

Master Instance Group Instance Groups 8/2/2013 @khan_io | http://www.khanio.com 20
• Manages the job flow: coordinates the distribution of the MapReduce executable and subsets of raw data to the Core Instance Group and Task Instance Group • Master Node run both NameNode and JobTracker daemons EMR Cluster Master Instance Group Core Instance Group Task Instance Group

Core Instance Group Instance Groups 8/2/2013 @khan_io | http://www.khanio.com 21
• Collection of Core Nodes of a Job Flow • Core Node – an EC2 instance that runs Map & Reduce Tasks and stores data using HDFS • Core Nodes run both DataNodes and TaskTracker daemons. EMR Cluster Master Instance Group Core Instance Group Task Instance Group

Task Instance Group Instance Groups 8/2/2013 @khan_io | http://www.khanio.com 22
• Collection of Task Nodes in a Job Flow • Non Persistent in nature • Can be added or removed on-demand at any stage of job lifecycle EMR Cluster Master Instance Group Core Instance Group Task Instance Group

Amazon EMR Job Flow Execution Lifecycle 8/2/2013 @khan_io | http://www.khanio.com
23 Create Job Flow • Cluster created (Master, Core & Task Instance Groups) Run Job Step 1 • Processing Step (Execute scripts such has java, python, hive, pig etc..) Run Job Step 2 • Chained Step (Execute subsequent script using data from previous step) …. Run Job Step N • Chained Step (Cascaded Steps pipeline) End Job Flow • Cluster Terminated

Running Hive Job (interactive & batch mode) on Amazon EMR
Demo 8/2/2013 @khan_io | http://www.khanio.com 24

Interactive Query using Hive Solution Design 8/2/2013 @khan_io | http://www.khanio.com
25 Amazon EMR Input S3 Bucket Output S3 Bucket User

Data Corpus As cluster gets rolled… Impressions s3://elasticmapreduce/samples/hive-ads/tables/impressions/ dt=$time/$hostname-$time.log {
requestBeginTime: "19191901901", requestEndTime: "19089012890", browserCookie: "xFHJK21AS6HLASLHAS", userCookie: "ajhlasH6JASLHbas8", searchPhrase: "digital cameras“, adId: "jalhdahu789asashja", impresssionId: "hjakhlasuhiouasd897asdh", referrer: "http://cooking.com/recipe?id=10231", hostname: "ec2-12-12-12-12.ec2.amazonaws.com", modelId: "asdjhklasd7812hjkasdhl", processId: "12901", threadId: "112121", timers: { requestTime: "1910121", modelLookup: "1129101" }, counters: { heapSpace: "1010120912012" } } Clicks s3://elasticmapreduce/samples/hive-ads/tables/clicks/ dt=$time/$hostname-$time.log { requestBeginTime: "19191901901", requestEndTime: "19089012890", browserCookie: "xFHJK21AS6HLASLHAS", userCookie: "ajhlasH6JASLHbas8", adId: "jalhdahu789asashja", impresssionId: "hjakhlasuhiouasd897asdh", clickId: "ashda8ah8asdp1uahipsd", referrer: "http://recipes.com/", directedTo: "http://cooking.com/" } 8/2/2013 @khan_io | http://www.khanio.com 26

Data Flow within Cluster Moving Parts 8/2/2013 @khan_io | http://www.khanio.com
27 S3 Bucket Master Instance Group Core Instance Group Task Instance Group

Automated Batch Query using Hive Solution Design 8/2/2013 @khan_io |
http://www.khanio.com 28 Amazon EMR Input S3 Bucket Output S3 Bucket AWS Data Pipeline

Time & Cost Savings A Solution Architect’s Tip Scenario 1
Scenario 2 8/2/2013 @khan_io | http://www.khanio.com 29 EMR Cluster Availability Zone Duration: 14 Hours Total Cost: 4 * 14 * 0.50 = $28 Duration: 7 Hours Total Cost: 4 * 7 * 0.50 = $14 5 * 7 * 0.25 = $8.75 $14 + $8.75 = $22.75 Time Savings : 50% & Cost Savings: 22%

Other Considerations More SA’s Tips 8/2/2013 @khan_io | http://www.khanio.com 30
• Use VPC for low latency within inter EMR clusters in a particular AZ. • Use CloudWatch to monitor metrics and scale based on cluster performances. • Use HDFS to store intermediate results, and S3 for job inputs/outputs, 200 transactions per second cap on S3.

Thank You Q&A 8/2/2013 @khan_io | http://www.khanio.com 31

Designing Big Data Solutions Using AWS

Designing Big Data Solutions Using AWS

Zakiullah Khan

More Decks by Zakiullah Khan

Other Decks in Technology

Featured

Transcript

Designing Big Data Solutions Using AWS By Zakiullah Khan Mohammed

How many of are you using AWS already ? Survey

How many of are using Hadoop at work ? Survey

How many of are you already using Amazon EMR ?

Big Data Solution Requisites From Solution Architect’s PoV • Volume

Optimal Big Data Solution Characteristics for right ROI Scalable &

Amazon Web Services Quick Introduction A utility service that provides

AWS Platform Overview Global Infrastructure • Regions • Availability Zones

AWS Platform In Scope Global Infrastructure • Regions • Availability

Regions & Availability Zones Customer Decides Where Apps & Data

Compute Decide kind & quantity of computational power needed ?

Amazon EC2 Instance Types 8/2/2013 @khan_io | http://www.khanio.com 12 •

Storage Where will we store the data ? 8/2/2013 @khan_io

Amazon S3 Universal Object Store 8/2/2013 @khan_io | http://www.khanio.com 14

Amazon EBS Storage for EC2 8/2/2013 @khan_io | http://www.khanio.com 15

Data Portability How do you get large volume data into

Parallel Processing Distributed Job Processing 8/2/2013 @khan_io | http://www.khanio.com 17

Amazon EMR Managed Hadoop Infrastructure 8/2/2013 @khan_io | http://www.khanio.com 18

Amazon EMR Internals Instance Groups 8/2/2013 @khan_io | http://www.khanio.com 19

Master Instance Group Instance Groups 8/2/2013 @khan_io | http://www.khanio.com 20

Core Instance Group Instance Groups 8/2/2013 @khan_io | http://www.khanio.com 21

Task Instance Group Instance Groups 8/2/2013 @khan_io | http://www.khanio.com 22

Amazon EMR Job Flow Execution Lifecycle 8/2/2013 @khan_io | http://www.khanio.com

Running Hive Job (interactive & batch mode) on Amazon EMR

Interactive Query using Hive Solution Design 8/2/2013 @khan_io | http://www.khanio.com

Data Corpus As cluster gets rolled… Impressions s3://elasticmapreduce/samples/hive-ads/tables/impressions/ dt=$time/$hostname-$time.log {

Data Flow within Cluster Moving Parts 8/2/2013 @khan_io | http://www.khanio.com

Automated Batch Query using Hive Solution Design 8/2/2013 @khan_io |

Time & Cost Savings A Solution Architect’s Tip Scenario 1

Other Considerations More SA’s Tips 8/2/2013 @khan_io | http://www.khanio.com 30

Thank You Q&A 8/2/2013 @khan_io | http://www.khanio.com 31