Getting started with Big Data In Azure HDInsight

Dharmendra Keshari PFE at Microsoft SG SQL PASS Chapter Nilesh
Gule Big Data Architect at Prudential Getting Started with BigData In Azure HDInsight

Needs for Distributed Computing • “BigData!” - How big is
the BigData? | What is BigData System Requirements? 01 Hadoop Intro and Different Technologies • How Hadoop can effectively manage large data? What all the most popular technologies which you can use with it? 02 Azure HDInsight • Understand the benefits of the Azure HDInsight & the Azure Hadoop technology stack 03 Demos •HDInsight with Hive, Spark & other cool features…. •Reference architecture for Batch & Streaming 04 Agenda

Needs for Distributed Computing “BigData!” - How big is the
BigData?

❖ It has indexed 60 trillion web pages, servers over
1 billion users in a single month, this is only for its search property, and serves 2.3 million searches in 1 second. Needs for Distributed Computing 13 exabytes Current Stored 100PB data/day Processes 1B Users/Mon

❖ The NSA is set to touch more of the
internet than google does, 1.6% of the internet traffic per days passes through the NSA in some way ❖ All your web searches, that website that you visit, phone transactions, health information, financial records, legal information, skype calls, all of these are potentially monitored by the NSA Needs for Distributed Computing 5 exabytes Current Stored 30 PB data/day Processes

✓ There are 2.7 billion likes per day. When you
click on the like, you are 1 amongst 2.7 billions in a day. ✓ Finally 300 million photographs are uploaded to Facebook in just a day. Needs for Distributed Computing 300 petabytes Current Stored 600TB data/day Processes 1B Users/Mon Log into FB

Big Data System Requirement Distributed Computing Frameworks like Hadoop were
developed considering following requirements; Needs for Distributed Computing Storage Compute Scale

Different Technologies in the Hadoop Eco-system Needs for Distributed Computing

HDInsight Fully managed, open-source analytics service for enterprises

Hortonworks Data Platform

• Azure Portal – Apache Spark cluster • Powershell –
Hive cluster Demo – Provision HDInsight cluster

• Data Ingestion using Sqoop: SQL Server to HDFS data
ingestion • Data Ingestion using CSV: Load CSV data to Hive tables • Data Access: • Hive view • Beeline • Zeppelin / Jupyter notebook • Spark • Power BI Batch processing using HDInsight Spark cluster Demo - Hive

Hadoop data file compression Compression Extension codec Splittable Efficiency Speed
Deflate .deflate org.apache.hadoop.io.comp ress.DefaultCodec No Medium Medium Gzip .gz org.apache.hadoop.io.comp ress.GzipCodec No Medium Medium Bzip2 .bz2 org.apache.hadoop.io.comp ress.BZip2Codec Yes High Low LZO .lzo org.apache.hadoop.io.comp ress.lzo.LzopCodec Yes Low High LZ4 .lz4 org.apache.hadoop.io.comp ress.LZ4Codec No Low High Snappy .snappy org.apache.hadoop.io.comp ress.SnappyCodec No Low High sqoop import \ --connect 'jdbc:sqlserver:<<connectionstring>>' \ --table Customer \ -- --schema SalesLT \ --hive-import \ --create-hive-table \ --target-dir 'wasb:///user/adventureworksdb/customer' \ --compression-codec org.apache.hadoop.io.compress.GZipCodec \ -m 1;

• Metadata & data managed by Hive • By default
stores data inside /user/hive/warehouse directory • Data security is managed internally by Hive • Use internal tables when Hive needs to completely manage the lifecycle of data Managed/Internal tables Hive tables • Only Metadata by Hive • Hive does not move data into its internal warehouse directory • Data is accessible to anyone who has access to HDFS • Use external tables when data is also used outside of Hive External tables

Hive table storage types Format Description Syntax TEXTFILE • Delimited
text • Splitable Stored as textfile SEQUENCEFILE • Binary key-value pairs • Optimized for MapReduce • Splitable STORED AS SEQUENCEFILE PARQUET • Columnar storage format (compressed, binary) • Non splitable STORED AS PARQUET RCFILE • Columnar storage format (compressed, binary) • Non splitable ROW FORMAT SERDE ‘org.apache.Hadoop.hive.serde2.columnar.ColumnarSerDe’ STORED AS INPUT FORMAT ‘org.apache.Hadoop.hive.ql.io.RCFileInputFormat’ OUTPUTFORMAT ‘org.apache.Hadoop.hive.ql.io.RCFileOutputFormat’ ORC • Columnar storage (highly compressed, binary) • Optimized for distinct selections and Aggregations • Vectorization: process a batch up to 1024 row numbers. Each batch is a column vector • Non splitable STORED AS ORC AVRO • Serialization system with evolvable schema-driven binary data • Cross-platform interoperability • Non splitable STORED AS INPUT FORMAT ‘org.apache.Hadoop.hive.ql.io.avro.AvroContainerInputFor mat’ OUTPUTFORMAT ‘org.apache.Hadoop.hive.ql.io.avro.AvroContainerOutputF ormat’ TBLPROPOERTIES (‘avro.schema.url’=‘<schemaFileLocation>’)

Data warehouse Batch Architecture

IoT / Streaming Architecture

Thank you for attending the session! Connect here to send
your questions: https://sg.linkedin.com/in/dharmendra-keshari-a7043398 [email protected] |[email protected] http://www.dharmendrakeshari.com/ https://www.facebook.com/dharmendra.keshari.9 [email protected] https://www.handsonarchitect.com/ https://www.facebook.com/nilesh.gule https://www.linkedin.com/in/nileshgule/

Getting started with Big Data In Azure HDInsight

Getting started with Big Data In Azure HDInsight

Nilesh Gule

More Decks by Nilesh Gule

Other Decks in Technology

Featured

Transcript

Dharmendra Keshari PFE at Microsoft SG SQL PASS Chapter Nilesh

Needs for Distributed Computing • “BigData!” - How big is

Needs for Distributed Computing “BigData!” - How big is the

Needs for Distributed Computing “BigData!” - How big is the

❖ It has indexed 60 trillion web pages, servers over

❖ The NSA is set to touch more of the

✓ There are 2.7 billion likes per day. When you

Big Data System Requirement Distributed Computing Frameworks like Hadoop were

Different Technologies in the Hadoop Eco-system Needs for Distributed Computing

HDInsight Fully managed, open-source analytics service for enterprises

Hortonworks Data Platform

• Azure Portal – Apache Spark cluster • Powershell –

• Data Ingestion using Sqoop: SQL Server to HDFS data

Hadoop data file compression Compression Extension codec Splittable Efficiency Speed

• Metadata & data managed by Hive • By default

Hive table storage types Format Description Syntax TEXTFILE • Delimited

Data warehouse Batch Architecture

IoT / Streaming Architecture

Thank you for attending the session! Connect here to send