Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hadoop in AWS

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

Hadoop in AWS

How to use Hadoop for old docker deletion? Easy!

Avatar for Sergey Dzyuban

Sergey Dzyuban

October 26, 2021
Tweet

More Decks by Sergey Dzyuban

Other Decks in Programming

Transcript

  1. SQLSat Kyiv Team Eugene Polonichko Oksana Tkach Oksana Borysenko Denis

    Reznik Mykola Pobyivovk Yevhen Nedashkivskyi Anton Artomov
  2. Sponsor Sessions are at 12:30 and 13:00  Don’t miss

    them, they might be providing some interesting and valuable information! 12:30 Congress Hall DevArt 12:30 Conference Hall Infopulse 12:30 Room AC Materialize 13:00 Congress Hall DB Best 13:00 Conference Hall Eleks
  3. Session will begin very soon :)  Please complete the

    electronic evaluation form for this session and for the event. Your feedback will help us to improve future conferences and speakers will appreciate your feedback!  Enjoy the conference!
  4. About the speaker Sergey Dzyuban Intapp LAMP admin 2003 2007

    2010 2014 2016 Web Developer Web Developer .NET Developer .NET Developer Technical Account Manager Technical Account Manager Team Lead Team Lead DevOps DevOps
  5. Let’s delete something on prod ! Review artifacts Analyze space

    usage Find unused files Delete unused file
  6. Quick review  Different types of feeds  Artifacts produced

    by few CIs  Deployment to different target platforms  2.7 Tb total size
  7. Initial analysis GET /api/storage/libs-release-local/…/lib-ver.pom { "uri": "http://localhost:8081/.../lib-ver.pom", "downloadUri": "http://localhost:8081/.../lib-ver.pom", "repo":

    "libs-release-local", "path": "/org/acme/lib/ver/lib-ver.pom", "remoteUrl": "http://some-remote-repo/.../lib-ver.pom", "created": ISO8601 (yyyy-MM-dd'T'HH:mm:ss.SSSZ), "createdBy": "userY", "lastModified": ISO8601 (yyyy-MM-dd'T'HH:mm:ss.SSSZ), "modifiedBy": "userX", "lastUpdated": ISO8601 (yyyy-MM-dd'T'HH:mm:ss.SSSZ), "size": "1024", //bytes "mimeType": "application/pom+xml", "checksums": { "md5" : string, "sha1" : string, "sha256" : string }, "originalChecksums":{ "md5" : string, "sha1" : string, "sha256" : string } } GET /api/storage/libs-release-local/…/lib-ver.pom { "uri": "http://localhost:8081/.../lib-ver.pom ", "lastDownloaded": Timestamp (ms), "downloadCount": 1337, "lastDownloadedBy": "user1" }
  8. Initial results  Total count – 360.329  Total size

    – 2.480 Tb  Docker and time repos – 99% of size  Docker – 90% of downloads  Half of files never were downloaded  Produce speed – 300Gb/Month  No extremely large files found  No extremely large products found
  9. Data structure image 1.0.0 1.0.1 sha_xxxxxxxxxxxxxxx_1 sha_xxxxxxxxxxxxxxx_2 sha_xxxxxxxxxxxxxxx_3 sha_xxxxxxxxxxxxxxx_1 sha_xxxxxxxxxxxxxxx_2

    sha_xxxxxxxxxxxxxxx_4 sha_xxxxxxxxxxxxxxx_5 created: 01-01-2016 updated: 01-02-2017 downloads: 100 size: 100Mb created: 01-01-2016 updated: 01-02-2017 downloaded: 01-2019 downloads: 0 size: 120Mb created: 01-01-2016 updated: 01-02-2017 downloaded: null downloads: 20 size: 30Mb created: 01-01-2016 updated: 01-02-2017 downloaded: 05-2019
  10. Aggregations image 1.0.0 1.0.1 sha_xxxxxxxxxxxxxxx_1 sha_xxxxxxxxxxxxxxx_2 sha_xxxxxxxxxxxxxxx_3 sha_xxxxxxxxxxxxxxx_1 sha_xxxxxxxxxxxxxxx_2 sha_xxxxxxxxxxxxxxx_4

    downloads: • sum: 120 • max: 100 • min: 0 size: 250Mb created: 01-01-2016 updated: 01-02-2017 downloaded: 05-2019 downloads: 100 size: 100Mb created: 01-01-2016 updated: 01-02-2017 downloaded: 01-2019 downloads: 0 size: 120Mb created: 01-01-2016 updated: 01-02-2017 downloaded: null downloads: 20 size: 30Mb created: 01-01-2016 updated: 01-02-2017 downloaded: 05-2019
  11. Kibana limitations  Slow calculated fields  AWS ES has

    no export feature  Specific query language  REST API for advanced stuff like data export  Specific query and Group By syntax
  12. AWS Athena  ANSI-standard SQL  based on the Presto

    distributed SQL engine  Serverless  File formats: JSON, CSV, log files, text with custom delimiters, Apache Parquet, and Apache ORC  $5.00 per TB of data scanned
  13. Amazon Elastic Map Reduce  Easy to use  Low

    cost  Flexible  Elastic  Secure
  14. EMR Deployment  To save the costs – use Spot

    Instances  Nodes are simple EC2 machines  Nodes can be accessed by SSH
  15. AWS CLI aws emr create-cluster --name "sql-saturday" --profile "emr-dzyuban" --configurations

    file://presto.config.json --release-label emr-5.20.0 --use-default- roles --ec2-attributes KeyName=Intapp --applications Name=Hadoop Name=Spark Name=Hive Name=PRESTO Name=HUE Name=ZEPPELIN Name=Ganglia Name=Pig Name=Sqoop --instance-fleets=file://fleet.json [ { "Classification": "presto-connector-hive", "Properties": { "hive.metastore": "glue", "hive.metastore.glue.datacatalog.enabled": "true" } }, { "Classification": "spark-hive-site", "Properties": { "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory" } } ] presto.config.json
  16. EMR components Component Description HDFS Hadoop Distributed File System: provides

    high-throughput access to application data based on commodity hardware YARN Yet Another Resource Negotiator: a framework for cluster resource management including job scheduling MapReduce Software framework for parallel processing of large data sets based on YARN Hive Data warehouse system based on Hadoop Spark Cluster computing framework that utilizes YARN and HDFS. ZooKeeper Distributed name registry, synchronization service and configuration service that is used as a sub-system in Hadoop
  17. EMR URLs YARN ResourceManager http://ec2-3-86-62-50.compute-1.amazonaws.com:8088/ Hadoop HDFS NameNode http://ec2-3-86-62-50.compute-1.amazonaws.com:50070/ Spark

    HistoryServer http://ec2-3-86-62-50.compute-1.amazonaws.com:18080/ Zeppelin http://ec2-3-86-62-50.compute-1.amazonaws.com:8890/ Hue http://ec2-3-86-62-50.compute-1.amazonaws.com:8888/ Ganglia http://ec2-3-86-62-50.compute-1.amazonaws.com/ganglia/
  18. References  https://aws.amazon.com/athena/ AWS Athena  http://prestodb.github.io/overview.html Presto Overview 

    https://www.awsgeek.com/posts/Amazon-Athena/ funny graphics from Jerry Hargrove  https://aws.amazon.com/emr/ Amazon official EMR page  https://www.amazon.com/Hadoop-Dummies-Dirk-deRoos/dp/1118607554 book  https://hadoop.apache.org Apache Hadoop official page  http://zeppelin.apache.org Apache Zeppelin official page  https://flightaware.com/ - thanks for nice photos