Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Designing Big Data Solutions Using AWS

Designing Big Data Solutions Using AWS

Zakiullah Khan

August 01, 2013
Tweet

More Decks by Zakiullah Khan

Other Decks in Technology

Transcript

  1. How many of are you using AWS already ? Survey

    Question 1 8/2/2013 @khan_io | http://www.khanio.com 2
  2. How many of are using Hadoop at work ? Survey

    Question 2 8/2/2013 @khan_io | http://www.khanio.com 3
  3. How many of are you already using Amazon EMR ?

    Survey Question 3 8/2/2013 @khan_io | http://www.khanio.com 4
  4. Big Data Solution Requisites From Solution Architect’s PoV • Volume

    – Can it handle “large” data volumes ? • Variety – Can it handle data variety ? Structured, Semi-Structured, Poly-Structured, Unstructured ? • Value – At what cost can it deliver the required performance ? 8/2/2013 @khan_io | http://www.khanio.com 5
  5. Optimal Big Data Solution Characteristics for right ROI Scalable &

    Reliable Operational Ease Cost Effective 8/2/2013 @khan_io | http://www.khanio.com 6
  6. Amazon Web Services Quick Introduction A utility service that provides

    you technology resources managed by experts and available on demand. • Flexible • Cost-Effective • Scalable & Elastic • Secure • Experienced 8/2/2013 @khan_io | http://www.khanio.com 7
  7. AWS Platform Overview Global Infrastructure • Regions • Availability Zones

    • Edge Location Foundation Services • Compute • Storage • Database • Networking Application Platform Services • Content Distribution • Application SVCS • Parallel Processing • Libraries & SDKs Management & Administration • Identity & Access • Web Interface • Monitoring • Deployment & Automation 8/2/2013 @khan_io | http://www.khanio.com 8
  8. AWS Platform In Scope Global Infrastructure • Regions • Availability

    Zones • Edge Location Foundation Services • Compute • Storage • Database • Networking Application Platform Services • Content Distribution • Application SVCS • Parallel Processing • Libraries & SDKs Management & Administration • Identity & Access • Web Interface • Monitoring • Deployment & Automation 8/2/2013 @khan_io | http://www.khanio.com 9
  9. Regions & Availability Zones Customer Decides Where Apps & Data

    reside 8/2/2013 @khan_io | http://www.khanio.com 10 Note: The Actual number of availability zones may vary. US East (VA) US West (CA) US West (OR) Asia Pacific (Tokyo) Asia Pacific (Singapore ) Asia Pacific (Sydney) EU (Ireland) South America (Sao Paulo) Availability Zone A Availability Zone B Availability Zone C Availability Zone C Availability Zone A Availability Zone B Availability Zone A Availability Zone B Availability Zone C Availability Zone A Availability Zone B Availability Zone C Availability Zone A Availability Zone B Availability Zone A Availability Zone B Availability Zone A Availability Zone B Availability Zone C Availability Zone A Availability Zone B
  10. Compute Decide kind & quantity of computational power needed ?

    8/2/2013 @khan_io | http://www.khanio.com 11 Note: The Actual number of availability zones may vary. Amazon Elastic Compute Cloud (EC2) Instance (On-Demand, Reserved & Spot) Instances Amazon Machine Image (AMI)
  11. Amazon EC2 Instance Types 8/2/2013 @khan_io | http://www.khanio.com 12 •

    EC2 Instance = Virtual Server • Available in 16 compute capacities • EC2 instance types are On-Demand, Reserved and Spot • Amazon Machine Image (AMI) is a building block of EC2 instances
  12. Storage Where will we store the data ? 8/2/2013 @khan_io

    | http://www.khanio.com 13 Note: The Actual number of availability zones may vary. Amazon Simple Storage Service (S3) Amazon Elastic Block Store (EBS) Amazon Glacier
  13. Amazon S3 Universal Object Store 8/2/2013 @khan_io | http://www.khanio.com 14

    • A “Bucket” is equivalent to “folder” • Objects form 1B - 5TB, no bucket size limit • Highly available, scalable, reliable, fast and inexpensive Bucket Bucket with Objects Object
  14. Amazon EBS Storage for EC2 8/2/2013 @khan_io | http://www.khanio.com 15

    • EBS Volume = Virtual Disk • Storage volume for EC2 instances – create, attach, backup, restore & delete • Can use to create RAID configuration for an EC2 instance Volume Snapshot
  15. Data Portability How do you get large volume data into

    AWS Infra ? 8/2/2013 @khan_io | http://www.khanio.com 16 • AWS Direct Connect – Dedicated low latency bandwidth • AWS Import/Export – Physical media shipping • Queuing – Highly scalable event buffering (CEP) • Amazon Storage Gateway – Sync local storage to the cloud
  16. Amazon EMR Managed Hadoop Infrastructure 8/2/2013 @khan_io | http://www.khanio.com 18

    • Reduces complexity of Hadoop management – node provisioning, customization and shutdown • Run Hive, Pig, Java and Streaming programs • Provides tight integration with AWS Services – Optimized for Amazon S3 – EC2 integration with automatic provisioning on node failure – Cluster Monitoring using Amazon CloudWatch
  17. Amazon EMR Internals Instance Groups 8/2/2013 @khan_io | http://www.khanio.com 19

    • Master Instance Group • Core Instance Group • Task Instance Group EMR Cluster Master Instance Group Core Instance Group Task Instance Group
  18. Master Instance Group Instance Groups 8/2/2013 @khan_io | http://www.khanio.com 20

    • Manages the job flow: coordinates the distribution of the MapReduce executable and subsets of raw data to the Core Instance Group and Task Instance Group • Master Node run both NameNode and JobTracker daemons EMR Cluster Master Instance Group Core Instance Group Task Instance Group
  19. Core Instance Group Instance Groups 8/2/2013 @khan_io | http://www.khanio.com 21

    • Collection of Core Nodes of a Job Flow • Core Node – an EC2 instance that runs Map & Reduce Tasks and stores data using HDFS • Core Nodes run both DataNodes and TaskTracker daemons. EMR Cluster Master Instance Group Core Instance Group Task Instance Group
  20. Task Instance Group Instance Groups 8/2/2013 @khan_io | http://www.khanio.com 22

    • Collection of Task Nodes in a Job Flow • Non Persistent in nature • Can be added or removed on-demand at any stage of job lifecycle EMR Cluster Master Instance Group Core Instance Group Task Instance Group
  21. Amazon EMR Job Flow Execution Lifecycle 8/2/2013 @khan_io | http://www.khanio.com

    23 Create Job Flow • Cluster created (Master, Core & Task Instance Groups) Run Job Step 1 • Processing Step (Execute scripts such has java, python, hive, pig etc..) Run Job Step 2 • Chained Step (Execute subsequent script using data from previous step) …. Run Job Step N • Chained Step (Cascaded Steps pipeline) End Job Flow • Cluster Terminated
  22. Running Hive Job (interactive & batch mode) on Amazon EMR

    Demo 8/2/2013 @khan_io | http://www.khanio.com 24
  23. Data Corpus As cluster gets rolled… Impressions s3://elasticmapreduce/samples/hive-ads/tables/impressions/ dt=$time/$hostname-$time.log {

    requestBeginTime: "19191901901", requestEndTime: "19089012890", browserCookie: "xFHJK21AS6HLASLHAS", userCookie: "ajhlasH6JASLHbas8", searchPhrase: "digital cameras“, adId: "jalhdahu789asashja", impresssionId: "hjakhlasuhiouasd897asdh", referrer: "http://cooking.com/recipe?id=10231", hostname: "ec2-12-12-12-12.ec2.amazonaws.com", modelId: "asdjhklasd7812hjkasdhl", processId: "12901", threadId: "112121", timers: { requestTime: "1910121", modelLookup: "1129101" }, counters: { heapSpace: "1010120912012" } } Clicks s3://elasticmapreduce/samples/hive-ads/tables/clicks/ dt=$time/$hostname-$time.log { requestBeginTime: "19191901901", requestEndTime: "19089012890", browserCookie: "xFHJK21AS6HLASLHAS", userCookie: "ajhlasH6JASLHbas8", adId: "jalhdahu789asashja", impresssionId: "hjakhlasuhiouasd897asdh", clickId: "ashda8ah8asdp1uahipsd", referrer: "http://recipes.com/", directedTo: "http://cooking.com/" } 8/2/2013 @khan_io | http://www.khanio.com 26
  24. Data Flow within Cluster Moving Parts 8/2/2013 @khan_io | http://www.khanio.com

    27 S3 Bucket Master Instance Group Core Instance Group Task Instance Group
  25. Automated Batch Query using Hive Solution Design 8/2/2013 @khan_io |

    http://www.khanio.com 28 Amazon EMR Input S3 Bucket Output S3 Bucket AWS Data Pipeline
  26. Time & Cost Savings A Solution Architect’s Tip Scenario 1

    Scenario 2 8/2/2013 @khan_io | http://www.khanio.com 29 EMR Cluster Availability Zone Duration: 14 Hours Total Cost: 4 * 14 * 0.50 = $28 Duration: 7 Hours Total Cost: 4 * 7 * 0.50 = $14 5 * 7 * 0.25 = $8.75 $14 + $8.75 = $22.75 Time Savings : 50% & Cost Savings: 22%
  27. Other Considerations More SA’s Tips 8/2/2013 @khan_io | http://www.khanio.com 30

    • Use VPC for low latency within inter EMR clusters in a particular AZ. • Use CloudWatch to monitor metrics and scale based on cluster performances. • Use HDFS to store intermediate results, and S3 for job inputs/outputs, 200 transactions per second cap on S3.