Testing 3Vs (Volume Variety and Velocity) of Big Data

Testing 3Vs (Volume, Variety and Velocity) of Big Data 1

A lot happens in the Digital World in 60 seconds…
2

• Big Data refers to data sets whose size is
beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. • Big Data is a generic term used to describe the voluminous amount of unstructured, structured and What is Big Data 3 voluminous amount of unstructured, structured and semi-structured data.

• 3 key characteristics of big data: Volume: High volume
of data created both inside corporations and outside the corporations via the web, mobile devices, IT infrastructure, and other sources Variety: Data is in structured, semi-structured and Big Data Characteristics 4 Variety: Data is in structured, semi-structured and unstructured format. Velocity: Data is generated at a high speed o High volume of data needs to be processed within seconds

Big Data Processing using Hadoop framework ❷ Perform Map Reduce
❸ Extract the output results from HDFS 5 ❶ Loading source data files into HDFS Reduce operations

Hadoop Map/Reduce processing – Overview Map/Reduce is distributed computing and
parallel processing framework where we have the advantage of pushing the computation to data 6 • Distributed computing • Parallel Computing • Based on Map & Reduce tasks

Hadoop Eco-System • HDFS – Hadoop Distributed File System •
HBase – NoSQL data store (Non-relational distributed database) • Map/Reduce - Distributed computing framework • Scoop - SQL-to-Hadoop database import and export tool • Hive – Hadoop Data Warehouse 7 • Pig - Platform for creating Map/Reduce programs for analyzing large data sets

Unique Testing Opportunities in Big Data Implementations 8 8 Data
Implementations

Testing Opportunities for ‘Independent Testing’ Early Validation of the Requirements
Early Validation of the Design Preparation of Big Test Data Configuration Testing 9 Incremental Load Testing Functional Testing

Early Validation of the Requirements Enterprise Data Warehouses integrated with
Big Data Business Intelligence Systems integrated with Big Data Early Validation of the Requirements 10 Are the requirements mapped to the right data sources? Are any data sources, that are not considered? Why? Early Validation of the Requirements

Early Validation of the Design Is the ‘Unstructured Data’ stored,
in right place, for analytics? Is the ‘structured Data’ stored, in right place, for analytics? Early Validation of the Design 11 Is the data duplicated in multiple storage systems? Why? Are the data synchronization needs adequately identified and addressed? Early Validation of the Design

Test Data Replicate data, intelligently, with tools How big the
data files should be, to ensure near-real volumes of data? Preparation of Big Test Data 12 Create data with incorrect schema Create erroneous data Preparation of Big Test Data

Cluster Setup Is the system behaving as expected when a
cluster is removed? Cluster Setup Testing 13 Is the system behaving as expected when a cluster is added? Cluster Setup Testing

Big Data Testing 14 14

• Terabytes and Petabytes of data. • Data storage in
HDFS in file formats • Data files are split and stored in multiple data Testing challenges Volume Testing: Challenges 15 • Data files are split and stored in multiple data nodes • 100% coverage cannot be achieved • Data consolidation issues

• Use Data Sampling strategy • Sampling to be done
based on data requirements Testing Approach Volume Testing: Approach 16 requirements • Convert raw data into expected result format to compare with actual output data • Prepare ‘Compare scripts’ to compare the data present in HDFS file storage

• Manually validating semi-structured and unstructured data • Unstructured validation
issues because of no Testing challenges Variety Testing: Challenges 17 • Unstructured validation issues because of no defined format • Lot of scripting required to be performed to process semi-structured and unstructured data • Unstructured data sampling challenge

• Structured Data : Compare data using compare tools and
identify the discrepancies • Semi-structured Data : • Convert semi-structured data into structured format Testing Approach Variety Testing: Approach 18 • Convert semi-structured data into structured format • Format converted raw data to expected results • Compare expected result data with actual results • Unstructured Data : • Parse unstructured text data into data blocks and aggregate the computed data blocks • Validate aggregated data against the data output

Velocity Testing: Challenges • Setting up of production like environment
for performance testing • Simulating production job run Testing challenges 19 • Simulating production job run • High velocity volume streaming Test data setup • Simulating node failures

Velocity Testing: Approach • Performance of Pig/Hive jobs and capture
• Job completion time and validating against the benchmark Validation Points • Job completion time • Throughput • Memory utilization • No. of spills and spilled records Metrics Captured 20 against the benchmark • Throughput of the jobs • Impact of background processes on performance of the system • Memory and CPU details of task tracker • Availability of name node and data nodes spilled records • Identify Jobs failure rate

Questions? 21 21

• http://en.wikipedia.org/wiki/Big_data • www.cloudera.com/ • http://developer.yahoo.com/hadoop/tutorial/index.html • http://wikibon.org/wiki/v/Big_Data:_Hadoop,_Business_Analytics_and_Beyond References 22

THANK YOU www.infosys.com The contents of this document are proprietary
and confidential to Infosys Limited and may not be disclosed in whole or in part at any time, to any third party without the prior written consent of Infosys Limited. © 2012 Infosys Limited. All rights reserved. Copyright in the whole and any part of this document belongs to Infosys Limited. This work may not be used, sold, transferred, adapted, abridged, copied or reproduced in whole or in part, in any manner or form, or in any media, without the prior written consent of Infosys Limited.

Testing 3Vs (Volume Variety and Velocity) of Bi...

Testing 3Vs (Volume Variety and Velocity) of Big Data

QAI Software Testing Conference

More Decks by QAI Software Testing Conference

Other Decks in Technology

Featured

Transcript