Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Getting started with Big Data In Azure HDInsight

Getting started with Big Data In Azure HDInsight

The slide deck from the Singapore DataCamp 2019 event held on 2nd March. The session and the demo were used to showcase HDInsight offering and the capabilities from OSS tools like Sqoop, Hive etc. Spark was used to query data from Hive and also CSV files using Jupyter notebook. Finally, PowerBI was used to build visualization with data sourced from HDInsight cluster using DirectQuery and Spark

Nilesh Gule

March 02, 2019
Tweet

More Decks by Nilesh Gule

Other Decks in Technology

Transcript

  1. Dharmendra Keshari PFE at Microsoft SG SQL PASS Chapter Nilesh

    Gule Big Data Architect at Prudential Getting Started with BigData In Azure HDInsight
  2. Needs for Distributed Computing • “BigData!” - How big is

    the BigData? | What is BigData System Requirements? 01 Hadoop Intro and Different Technologies • How Hadoop can effectively manage large data? What all the most popular technologies which you can use with it? 02 Azure HDInsight • Understand the benefits of the Azure HDInsight & the Azure Hadoop technology stack 03 Demos •HDInsight with Hive, Spark & other cool features…. •Reference architecture for Batch & Streaming 04 Agenda
  3. ❖ It has indexed 60 trillion web pages, servers over

    1 billion users in a single month, this is only for its search property, and serves 2.3 million searches in 1 second. Needs for Distributed Computing 13 exabytes Current Stored 100PB data/day Processes 1B Users/Mon
  4. ❖ The NSA is set to touch more of the

    internet than google does, 1.6% of the internet traffic per days passes through the NSA in some way ❖ All your web searches, that website that you visit, phone transactions, health information, financial records, legal information, skype calls, all of these are potentially monitored by the NSA Needs for Distributed Computing 5 exabytes Current Stored 30 PB data/day Processes
  5. ✓ There are 2.7 billion likes per day. When you

    click on the like, you are 1 amongst 2.7 billions in a day. ✓ Finally 300 million photographs are uploaded to Facebook in just a day. Needs for Distributed Computing 300 petabytes Current Stored 600TB data/day Processes 1B Users/Mon Log into FB
  6. Big Data System Requirement Distributed Computing Frameworks like Hadoop were

    developed considering following requirements; Needs for Distributed Computing Storage Compute Scale
  7. • Azure Portal – Apache Spark cluster • Powershell –

    Hive cluster Demo – Provision HDInsight cluster
  8. • Data Ingestion using Sqoop: SQL Server to HDFS data

    ingestion • Data Ingestion using CSV: Load CSV data to Hive tables • Data Access: • Hive view • Beeline • Zeppelin / Jupyter notebook • Spark • Power BI Batch processing using HDInsight Spark cluster Demo - Hive
  9. Hadoop data file compression Compression Extension codec Splittable Efficiency Speed

    Deflate .deflate org.apache.hadoop.io.comp ress.DefaultCodec No Medium Medium Gzip .gz org.apache.hadoop.io.comp ress.GzipCodec No Medium Medium Bzip2 .bz2 org.apache.hadoop.io.comp ress.BZip2Codec Yes High Low LZO .lzo org.apache.hadoop.io.comp ress.lzo.LzopCodec Yes Low High LZ4 .lz4 org.apache.hadoop.io.comp ress.LZ4Codec No Low High Snappy .snappy org.apache.hadoop.io.comp ress.SnappyCodec No Low High sqoop import \ --connect 'jdbc:sqlserver:<<connectionstring>>' \ --table Customer \ -- --schema SalesLT \ --hive-import \ --create-hive-table \ --target-dir 'wasb:///user/adventureworksdb/customer' \ --compression-codec org.apache.hadoop.io.compress.GZipCodec \ -m 1;
  10. • Metadata & data managed by Hive • By default

    stores data inside /user/hive/warehouse directory • Data security is managed internally by Hive • Use internal tables when Hive needs to completely manage the lifecycle of data Managed/Internal tables Hive tables • Only Metadata by Hive • Hive does not move data into its internal warehouse directory • Data is accessible to anyone who has access to HDFS • Use external tables when data is also used outside of Hive External tables
  11. Hive table storage types Format Description Syntax TEXTFILE • Delimited

    text • Splitable Stored as textfile SEQUENCEFILE • Binary key-value pairs • Optimized for MapReduce • Splitable STORED AS SEQUENCEFILE PARQUET • Columnar storage format (compressed, binary) • Non splitable STORED AS PARQUET RCFILE • Columnar storage format (compressed, binary) • Non splitable ROW FORMAT SERDE ‘org.apache.Hadoop.hive.serde2.columnar.ColumnarSerDe’ STORED AS INPUT FORMAT ‘org.apache.Hadoop.hive.ql.io.RCFileInputFormat’ OUTPUTFORMAT ‘org.apache.Hadoop.hive.ql.io.RCFileOutputFormat’ ORC • Columnar storage (highly compressed, binary) • Optimized for distinct selections and Aggregations • Vectorization: process a batch up to 1024 row numbers. Each batch is a column vector • Non splitable STORED AS ORC AVRO • Serialization system with evolvable schema-driven binary data • Cross-platform interoperability • Non splitable STORED AS INPUT FORMAT ‘org.apache.Hadoop.hive.ql.io.avro.AvroContainerInputFor mat’ OUTPUTFORMAT ‘org.apache.Hadoop.hive.ql.io.avro.AvroContainerOutputF ormat’ TBLPROPOERTIES (‘avro.schema.url’=‘<schemaFileLocation>’)
  12. Thank you for attending the session! Connect here to send

    your questions: https://sg.linkedin.com/in/dharmendra-keshari-a7043398 [email protected] |[email protected] http://www.dharmendrakeshari.com/ https://www.facebook.com/dharmendra.keshari.9 [email protected] https://www.handsonarchitect.com/ https://www.facebook.com/nilesh.gule https://www.linkedin.com/in/nileshgule/