Wednesday, February 11, 2009

Test Data: The Hard of the Matter

Here’s a tough truth: If you can’t get control of your test data, you can forget test automation. The core value proposition of automation rests on reusability, and if you can’t reuse the data you can’t reuse the tests. Even manual testing requires test data, of course, but a manual tester can try to find or create the data they need on the fly.

My experience shows that testers spend 80% or their time or more just trying to locate data they can use or entering the conditions they need. For more on this, see "Automated or Not, It’s All About the Data" (http://www.stickyminds.com/s.asp?F=S9033_COL_2).That means having reusable data even for manual testing would improve productivity by orders of magnitude.

So what’s the big deal? Can’t you just copy production data? Aren’t there tools out there that let you extract snapshots from databases, scramble existing data or generate fake values?

The bad news is, these traditional techniques usually don’t work. The good news is the solution may be easier than you think.

The Problem with Production

Most companies use production data in some form for their testing. For one thing, it’s realistic – after all, it’s from production. For another, it’s easy – it’s data you already have. But there are problems.

The first problem is that it is probably so much data that it’s costly to copy and store. Second, all that volume makes it hard to find the exact conditions you need for a particular test scenario, and it may not even exist in the form you need. Third, it’s dynamic – a constantly moving picture that is too unpredictable to yield any reuse.

And lately there is a fourth reason: privacy. It’s highly likely that some of the data is confidential and cannot be legally disclosed to others, including testers. There are the obvious fields like social security numbers, but others may include personal data like addresses, account numbers, and any transactions related to financial or health matters. See "Keeping Secrets: How Data Privacy Affects Testing", http://www.stickyminds.com/s.asp?F=S8327_COL_2 for more detail on this issue.

And yes, there are tools that can help with these problems. But it’s not that easy.

The Trick with Tools

Database tools have been around for a long time and are widely available. Most of them were developed to test databases by generating high volumes of data or extracting selected subsets, but some are specifically targeted at testing and can locate, obfuscate or create required data values.

The trick is that it’s tricky to selectively extract or create a meaningful, coherent set of data. By meaningful I mean that it contains the test conditions you are interested in, and by coherent I mean that all of the related data is included. It is not as easy as taking every Nth record or some flat percentage of the data: complex interrelationships must be maintained between data elements.

For example, testing customer orders might require the customer master record, any related contracts on file, all transactions for that customer, plus all of the warehouse locations, product inventories, shipping codes, commission records, and myriad other data elements that touch the customer or the transactions - or anything they touch.

Furthermore, few applications operate on a single source of data. Many have interfaces to other applications that take the form of even more files in a wide array of formats. Some of these interfaces are real-time, some are batch, but all must be coordinated. While database tools may help you trace the relationships between tables and fields, they may break down when external files and formats are in the mix.

And finally there is the question of dates. Dates abound in most applications and are often central to calculations and event triggers. The date of an order may affect its pricing based on contractual terms. Posting a payment in one period versus another may create a late fee or interest charge. The dates of shipments or receipt of goods may trigger automated inventory orders, and so on and on. Database tools may know about table relationships but they don’t know about date dependencies.

My experience shows that despite their availability and relatively sophisticated capabilities, few of these tools are actually successfully deployed for testing because the effort and skills required to make them work is too high.

The Advantage of Automation

So what does work? Ironically, test automation presents both the challenge and the solution. It is the challenge because, as pointed out, automation won’t work without reusable test data. It is also the solution, though, because automation can create the data it needs.

Think about it. If you need a customer account with particular conditions for testing, you can either try to find it or you can create it. Finding it may take a lot of time and it may not even exist in the form you need. The advantage of creating it is that you know it exists and that has the right conditions because you put it there. Another plus is that the very act of creating the customer is, in itself, a test.


Manually creating all that data is not practical, but with automation it makes sense. Test automation tools can type in data quickly and accurately. By simply planning our your test cycle carefully, you can be sure that all the data you need is there how and when you need it…and expand your test coverage along the way. See "The Test Automation Timetable: Altered States" http://itmanagement.earthweb.com/entdev/article.php/622301 for more on automating data states.

Now this doesn’t mean you won’t need a starting point that includes basic master data. Believe it or not, trying to start with a blank database is almost impossible: there are far too many pointers, stored procedures, and other arcane contents that you could not understand, let alone create, in your lifetime. So you will still find yourself starting with some production or sandbox data just as a backdrop, but you won’t use most of it.

Instead, you will use the data you design. And that’s another benefit of automated data: You can design exactly the conditions you need, and because you are in control you can be sure that dates have the right relationship to each other and to the system date.

You will still need a strategy for interfaces that will probably include maintaining copies of files and massaging the values, but since you are in control of the core database contents it will be easier to know what data you need.

If this sounds too simplistic, I can tell you that some of the largest companies in the world, with massive, complex IT landscapes that span the globe, have successfully automated both their testing and their data using this approach.