Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Testing 3Vs (Volume Variety and Velocity) of Bi...

Testing 3Vs (Volume Variety and Velocity) of Big Data

Best of the Best Award - 1st Runner Up presentation by Infosys @STC 2012.
Authors - Mahesh Gudipati, Jaya Bhagavathi Bhallamudi & Shanthi Rao

Presentation Abstract

Big Data is one of the most discussed topics in recent times and implementation of Big Data is on the top of CIOs list. Taming big data is one of the opportunity areas looked by management in most of the organizations to unearth hidden valuable information. New technologies like Hadoop, HDFS, NoSQL DBs etc. are evolving to mine, store huge volume of data and Big Data market is expected to grow up to $50 Billion by 2017.

Big Data is a general term used to describe the voluminous amount of unstructured, structured and semi-structured data. Volume, Variety and Velocity (3Vs) are the three dimensions of Big Data. These three characteristics of Big Data require advanced technologies like distributed computing, in memory database, high volume storage systems to process and provide a meaningful data.

Testing of Big Data involves validating the data extract from source systems, validating Map Reduce jobs and load of data into external systems. At each of these validation stages the three Vs of Big Data need to be validated. Volume and Variety requires robust functional testing approach and velocity dimension requires non-functional testing. Validating requires skills on working with distributed file systems like HDFS, No SQL DBs and knowledge on Hadoop Map Reduce processing. In this paper we will talk about various approaches used for validating the Volume, Variety and Velocity dimensions of Big Data.

About the Authors

Mahesh is having more than 9 yrs of testing experience and have worked in multiple testing projects across different domains. He has strong experience working on data warehouse/BI testing, demand forecasting testing, Big Data testing and product testing. He has implemented automation techniques in multiple ETL/DW testing projects and holds a Patent for developing an end to end solution for ETL/DW testing. He is PMP certified project manager and has managed multiple data warehousing testing projects.

Jaya has over 14 years of experience in the IT industry. She is a Certified Function Point Specialist (CFPS from IFPUG ). She specializes in Test Automation and had led the Test Automation Services at Infosys, in the past. She is currently leading Research and Development of Infosys Validation Service offerings and solutions, focusing on specialized testing disciplines such as Data Validation, Security Testing, and Agile Testing. She contributes to both internal and external thought leadership papers and Infosys blog.

Shanthi has been in the IT industry for 15 years. She has widespread experience in all kinds of projects Development, Maintenance and testing. For the past 6 year she has been exclusively focusing on testing. She is a core member of Specialized testing practice working for Financial and Insurance customers. She is part of incubation team of new service lines like Test Data Management and Big Data.

More Decks by QAI Software Testing Conference

Other Decks in Technology

Transcript

  1. • Big Data refers to data sets whose size is

    beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. • Big Data is a generic term used to describe the voluminous amount of unstructured, structured and What is Big Data 3 voluminous amount of unstructured, structured and semi-structured data.
  2. • 3 key characteristics of big data: Volume: High volume

    of data created both inside corporations and outside the corporations via the web, mobile devices, IT infrastructure, and other sources Variety: Data is in structured, semi-structured and Big Data Characteristics 4 Variety: Data is in structured, semi-structured and unstructured format. Velocity: Data is generated at a high speed o High volume of data needs to be processed within seconds
  3. Big Data Processing using Hadoop framework ❷ Perform Map Reduce

    ❸ Extract the output results from HDFS 5 ❶ Loading source data files into HDFS Reduce operations
  4. Hadoop Map/Reduce processing – Overview Map/Reduce is distributed computing and

    parallel processing framework where we have the advantage of pushing the computation to data 6 • Distributed computing • Parallel Computing • Based on Map & Reduce tasks
  5. Hadoop Eco-System • HDFS – Hadoop Distributed File System •

    HBase – NoSQL data store (Non-relational distributed database) • Map/Reduce - Distributed computing framework • Scoop - SQL-to-Hadoop database import and export tool • Hive – Hadoop Data Warehouse 7 • Pig - Platform for creating Map/Reduce programs for analyzing large data sets
  6. Testing Opportunities for ‘Independent Testing’ Early Validation of the Requirements

    Early Validation of the Design Preparation of Big Test Data Configuration Testing 9 Incremental Load Testing Functional Testing
  7. Early Validation of the Requirements Enterprise Data Warehouses integrated with

    Big Data Business Intelligence Systems integrated with Big Data Early Validation of the Requirements 10 Are the requirements mapped to the right data sources? Are any data sources, that are not considered? Why? Early Validation of the Requirements
  8. Early Validation of the Design Is the ‘Unstructured Data’ stored,

    in right place, for analytics? Is the ‘structured Data’ stored, in right place, for analytics? Early Validation of the Design 11 Is the data duplicated in multiple storage systems? Why? Are the data synchronization needs adequately identified and addressed? Early Validation of the Design
  9. Test Data Replicate data, intelligently, with tools How big the

    data files should be, to ensure near-real volumes of data? Preparation of Big Test Data 12 Create data with incorrect schema Create erroneous data Preparation of Big Test Data
  10. Cluster Setup Is the system behaving as expected when a

    cluster is removed? Cluster Setup Testing 13 Is the system behaving as expected when a cluster is added? Cluster Setup Testing
  11. • Terabytes and Petabytes of data. • Data storage in

    HDFS in file formats • Data files are split and stored in multiple data Testing challenges Volume Testing: Challenges 15 • Data files are split and stored in multiple data nodes • 100% coverage cannot be achieved • Data consolidation issues
  12. • Use Data Sampling strategy • Sampling to be done

    based on data requirements Testing Approach Volume Testing: Approach 16 requirements • Convert raw data into expected result format to compare with actual output data • Prepare ‘Compare scripts’ to compare the data present in HDFS file storage
  13. • Manually validating semi-structured and unstructured data • Unstructured validation

    issues because of no Testing challenges Variety Testing: Challenges 17 • Unstructured validation issues because of no defined format • Lot of scripting required to be performed to process semi-structured and unstructured data • Unstructured data sampling challenge
  14. • Structured Data : Compare data using compare tools and

    identify the discrepancies • Semi-structured Data : • Convert semi-structured data into structured format Testing Approach Variety Testing: Approach 18 • Convert semi-structured data into structured format • Format converted raw data to expected results • Compare expected result data with actual results • Unstructured Data : • Parse unstructured text data into data blocks and aggregate the computed data blocks • Validate aggregated data against the data output
  15. Velocity Testing: Challenges • Setting up of production like environment

    for performance testing • Simulating production job run Testing challenges 19 • Simulating production job run • High velocity volume streaming Test data setup • Simulating node failures
  16. Velocity Testing: Approach • Performance of Pig/Hive jobs and capture

    • Job completion time and validating against the benchmark Validation Points • Job completion time • Throughput • Memory utilization • No. of spills and spilled records Metrics Captured 20 against the benchmark • Throughput of the jobs • Impact of background processes on performance of the system • Memory and CPU details of task tracker • Availability of name node and data nodes spilled records • Identify Jobs failure rate
  17. THANK YOU www.infosys.com The contents of this document are proprietary

    and confidential to Infosys Limited and may not be disclosed in whole or in part at any time, to any third party without the prior written consent of Infosys Limited. © 2012 Infosys Limited. All rights reserved. Copyright in the whole and any part of this document belongs to Infosys Limited. This work may not be used, sold, transferred, adapted, abridged, copied or reproduced in whole or in part, in any manner or form, or in any media, without the prior written consent of Infosys Limited.