Big Data Testing VS ETL Testing
Whether it is a Data Warehouse (DWH) or a BIG Data Storage system, the basic component that's of interest to us, the testers, is the 'Data'. At the fundamental level, the data validation in both these storage systems involves validation of data against the source systems, for the defined business rules. It's easy to think that, if we know how to test a DWHwe know how to test the BIG Data storage system.
But, unfortunately, that is not the case! In this blog, focusing on some of the differences in these storage systems and suggest an approach to BIG Data Testing.
Let us look at these differences from the following 3 perspectives:
-Data
Four fundamental characteristics by which the data in DWH and BIG Data storage systems differ are the Data Volume, Data Variety, Data Velocity and Data Value.
DWH (Data Warehouse)
|
Big Data
|
---|---|
Typical Data volumes which the current DWH systems are capable of storing is in terms of Gigabytes. |
The BIG Data storage systems can store & process data sizes more than Petabytes.
When it comes to Data variety, there are no constraints on the type of data that can be stored and processed within a BIG Data storage system.
|
DWHs, can store and process only 'Structured' data. | Whether it is 'structured' or 'unstructured' can be stored and efficiently processed within a tolerable elapsed time in BIG Data Storage system. |
The data is stored in DWH is through 'Batch Processing', | BIG Data implementations support 'Streaming' data too. |
DWH systems are based on RDBMS.
|
The BIG Data storage systems are based on File system.
|
DWH systems have limitations on the linear data growth. | BIG Data implementations such as the ones based on Apache Hadoop have no such limitations as they are capable of storing the data in multiple clusters. |
Validation tools for DWH systems testing are based on SQL (Structured Query language). | For BIG Data, in Hadoop eco system range from pure programming tools like MapReduce (which supports coding in Java, Peal, Ruby, Python etc) to wrappers that are built on top of MapReduce like HIVE QL or PIGlatin. |
What does this mean to the tester?
DWH - Tester
|
Big Data - Tester
|
---|---|
A DWH tester has the advantage of working with 'Structured' data. (Data with static schema). |
But BIG Data tester may have to work with 'Unstructured or Semi Structured' data (Data with dynamic schema) most of the time.
The tester needs to seek the additional inputs on 'how to derive the structure dynamically from the given data sources' from the business/development teams.
|
When it comes to the actual validation of the data in DWH, the testing approach is well-defined and time-tested.
Tester has the option of using 'Sampling' strategy manually or 'Exhaustive verification' strategy from within automation tools like Infosys Perfaware (proprietary DWH Testing solution).
| Considering the huge data sets for validation, even 'Sampling' strategy is a challenge in the context of BIG Data Validation. |
RDBMS based databases (Oracle, SQL Server etc) are installed in the ordinary file system.
So, testing of DWH systems does not require any special test environment as it can be done from within the file system in which the DWH is installed.
|
When it comes to testing BIG Data in HDFS, the tester requires a test environment that is based on HDFS itself.
Testers need to learn the how to work with HDFS as it is different than working with ordinary file system.
|
The DWH testers use either the xl based macros or full-fledged UI based automation tools. Validation tools for DWH systems testing are based on SQL (Structured Query language). | For BIG Data, there are no defined tools. Tools presently available in the Hadoop eco system range from pure programming tools like MapReduce (which supports coding in Java, Peal, Ruby, Python etc) to wrappers that are built on top of MapReduce like HIVE QL or PIGlatin. |
-Conclusion
Experience in DWH at the least, can only shorten the learning curve of the BIG Data tester in understanding the extraction, loading transformation of the data from source systems to HDFS at the conceptual level. It does not provide any other advantage.
BIG Data testers have to learn the components of the BIG Data eco system from the scratch. Till the time, the market evolves and fully automated testing tools are available for BIG Data validation, the tester does not have any other option but to acquire the same skill set as the BIG Data developer in the context of leveraging the BIG Data technologies like Hadoop. This requires a tremendous mindset shift for both the testers as well as the testing units within the organization.
--Thanks.
Source: http://www.infosysblogs.com