Test Data Management Virtualization [PODCAST]

Test Automation Published on:
Test Data Management Virtualization

Welcome to Episode 89 of TestTalks. In this episode, we'll discuss a cool new way to manage your test data using virtualization with the folks from Delphix, Kyle Hailey and Brett Stevens. Discover all the benefits of using virtualized data for your development and testing efforts.

TestDataManagementVirtualizationFeature

In my experience, one of the biggest reasons for flaky automated tests is test data issues. Not having a good test data management plan in place makes it hard to create reliable test automation across multiple environments. One solution I’ve been hearing more and more about is virtual data, so this episode is all about test data management using data virtualization.

Listen to the Audio

In this episode, you'll discover:

  • The significant difference between VMs and data virtualization
  • What data virtualization actually is
  • How data virtualization can help you with stale test data?
  • Tips to improve your test data management efforts
  • What tool you can use to get you started quickly with data virtualization

Join the Conversation

My favorite part of doing these podcasts is participating in the conversations they provoke. Each week, I pull out one question that I like to get your thoughts on.

This week, it is this:

Question: What do you use to help with your automated test data management efforts? Share your answer in the comments below.

Want to Test Talk?

If you have a question, comment, thought or concern, you can do so by clicking here. I'd love to hear from you.

How to Get Promoted on the Show and Increase your Kama

Subscribe to the show in iTunes and give us a rating and review. Make sure you put your real name and website in the text of the review itself. We will definitely mention you on this show.

We are also on Stitcher.com so if you prefer Stitcher, please subscribe there.

Read the Full Transcript

Joe:         Hey Brett and Kyle, welcome to Test Talks.

Brett:      Hey Joe, how are you doing today?

Joe:         Awesome. Before we get into it, could you just tell us a little more about yourselves?

Brett:      Yeah sure. My name is Brett Stevens. I currently work at a software startup at Silicon Valley called Delphix. Been here for about a year and a half. Previously filled the roll of Business technology consultant, so I worked across a bunch of customers at Delphix and I've seen a lot of different testing strategies, and we've particularly run into a lot of test data management practices.

Kyle:       Hey, this is Kyle Haley. I work alongside Brett at Delphix. Been here about five years. Previous to that I've been working on building tools to manage databases at Oracle and Market Arrow. Oracle is sort of a waterfall design methods, and Market Arrow is agile methods.

Joe:         I guess at a high level, before we even get into it is, what is data virtualization and how can it help someone who is creating tests or test cases?

Brett:      Sure. Data virtualization is basically a way to … It's an integrated data masking and delivery solution. It virtualizes data across the application life cycle. So today you might have multiple copies of production data sets for development testing, training, reporting back up, disaster recovery, and the issue is that making all those copies consumes both storage and also time and people. There's a lot of inefficient processes that go into provisioning a database. Data virtualization is basically a way to reduce the consumption of those infrastructure resources and then also stream line the delivery of that data to the application teams.

Kyle:       That's a good way to put it, Brett. I sort of think, in a simplified way, [inaudible 00:01:39] took a single set of hardware and made mini machines out of it, or [virta 00:01:43] machines, with a virtual day we take one set of data, or one immutable set of data, and provide mini re-writable copies on top of that shared set of immutable data,

Joe:         Awesome, so I'm just trying to visualize this, so I work with a very large team, we have ten sprint teams, and the ten sprint teams, they'll have their own environments, so there's ten different databases that they have to set up. I also have a database within my CI environment. So are you saying something like Test Data Virtualization, Test Data Management, that you can almost make an instance of just say the database that's in the CI environment and just provision it out to these sprint teams rather than have them set up with own environments and have to worry about maintaining the separate databases on their own?

Kyle:       Totally. One of the big advantages of virtual data is really it's only one copy of the data, so as opposed to taking a production database and making … I think you said eleven teams? So making eleven full copies of that, we basically have an automated stack that collects the data from the source, keeps a time window of the changes on that source, and then when you make eleven copies, if they're virtual copies they're not actual, physical copies. We just expose pointers to existing data that already exists. It's like snapshot technology. Once those copies that they modify the data, they don't modify the original data, they write their new changes elsewhere so that the other clones or copies don't see it.

Brett:      And then Joe, the other interesting thing as well is that … The way that data virtualization works is that it would injest a source database, whether it's a production database, or maybe it's synthetic data that's been created. From there on out all changes to that source or collected. So not only would you have a static snapshot of those eleven different environments of the data that the developer would need, but you would also have a test data factory of all those different data sets. Maybe data that's masked or generated at, you know, a week ago verses today. You'd be able to move those data sets into an environment as needed.

Joe:         This might actually solve an issue we've been having for a while, is that we have an environment, we have a database that we [inaudible 00:03:57] every day. So we wipe it out and start off fresh every day, but the problem with that is we have a snapshot in time, what it's being refreshed from. We do patients, we have certain patients that are in certain status, like they're checked in. But over time, say after six months, that data could become stale. So I'm just curious to know how you would handle a situation like there where there is potential for a lot of stale data in your test data.

Brett:      That's a good point. I think a lot of … One of the reasons why enterprises are looking towards test data management strategies, you might be testing, using stale data, or even maybe corrupted data that's been modified by testing, so you'd be able to refresh in a matter of minutes from the latest source. The other innovative feature to data virtualization is the power of a reset, reset going back to a previous baseline that you can bookmarks as of any point in time, so that way you could create a bunch of bookmarks following a test and just quickly revert to that period of time.

Kyle:       This has two great advantages in [inaudible 00:04:59] test. For one, it takes a while to build up those databases typically, maybe you want a copy of production and maybe you want to either mask it or add some edge case, corner case data. Then you want to rerun multiple tests, maybe the same tests over and over again on that data as the code changes. The if the test was distracted it might take a while to rebuild that set of data. With virtual data though, we can spin up a copy of a virtual database from the source database. We can mask it, add corner case data, shut it down, then we can make as many copies of that as we want or we can actually keep it running, run the destructive tests, and then roll back or refresh to that point in time that you bookmarked. Just after you set up, you masked the data, you set up the edge case data, and you're ready to run your QA test.

A lot of companies eliminate an enormous amount of infrastructure, resources, and time that they use to build up and refresh these QA test database after each run. That can almost be eliminated. One particular case that comes to mind is … It took eight hours to refresh the database and only twenty minutes to rerun the QA test suite. Basically this runs in cycles, so when they went to virtual data that eight hour refresh went down a couple minutes, so now they're actually spending the majority of the time QA-ing code as opposed to building up a data set.

Joe:         This is another interesting point in that, when we run our tests in our continuous integration environment, sometimes … They don't run in order, so we never know what order they're going to run in, so I never know if a team is modifying data that another test needs and they're not telling me about it, and it's somehow corrupting the data. In theory it sounds like I could almost create a snapshot after each test and just roll back to the way it was previously before that test ran, for every test. It sounds like it's quick enough to handle that, is that correct?

Kyle:       Yes.

Joe:         Awesome.

Brett:      Another thing you can do as well, in addition to creating this bookmarks, creating a copy of a data set we call it branching off of the main copy, it occupies basically no space, so you could run a unit test in parallel without interfering with the work another team might be doing.

Joe:         Cool, so let's take a few minutes to talk about the architecture of your solution and all the different layers, the different pieces of it.

Kyle:       This might be different depending on the companies, but I'll talk about our architecture. In our virtual data architecture we give customers a or virtual machine that they spin up on their hardware, any Intel hardware, and they give it, any storage to this virtual machine. Our software will map on our proprietary file system, and it's that file system that will track the blocks and the changes, and do the branching and the snapshotting. Also this virtual machine will have a web user interface that you can register machines, you can register a source machine and then register the source database. The software will automatically, first collect a full copy of that data and it doesn't ever do that again. From then on it incrementally collects the data block changes and it keeps it in a time window so that we can spin up a copy anywhere within that time window. Time window is two weeks, but it could be two months, two years, whatever you wanted to architect for.

Brett:      Going back to your previous point as well, Kyle actually wrote about this one on one of his recent blogs, so there's technology today that you can share exact machine copies, or apps and builds using something like Docker, or you make build configurations repeatable in automated … There isn't really a way today to make a data consistent across team, so data virtualization is basically a way to created an integrated data set across the application sack and the database sack.

Kyle:       That's one of our more powerful features. It's not just about snapshotting the data, but we can also have the binary … So they're using an Oracle database, we could do virtual data for the oracle distribution, we could do virtual data for the application using the database and as a QA person or a developer, I could refresh or roll back all three of those tiers together, virtual data.

Joe:         Cool. Once again I'm thinking back to the technology I work on. We have really old back ends. We have old side based databases that have some sort of … Just the way they have it set up, it's not your standard database. How does this solution work? Does it only work with certain databases like oracle? Or can it consume any other database technology?

Kyle:       In theory virtual data applies to any data. For our company in particular, Delphix, we have full automation, like point and click. It's a complete awareness, the database is, and those databases are oracle, sybase, SQL server, my SQL, post grass, [00:09:45] in those cases it basically use point and click in a web UI.

Virtualizing data, the idea connecting to a data source, getting a copy of it, collecting the changes, being able to bookmark that and then spin off cloned copies of that, it can be done in any database. We have people using this basically we allow people to export v-files, or virtual files, so they can stick their database whether it's cassandra, omongo, or DB2, onto those virtual files, and then we can manage the snapshot, cloning of those data files. But that's going to take … That take a little bit of scripting. It's not point and click in a web UI. It takes some understanding of the databases and how to apply [weulogs 00:10:26] and such.

Joe:         You must have a demo version of this, correct? That someone could download it and mess around with it?

Kyle:       There's a version called Delphix Express. I haven't tried this, but I imagine if you Googled Delphix Express you'd probably find my blog and how to install and download it.

Joe:         In your experience, when someone is looking for a data virtualization solution, what are the main things you think they should look for in a company or a solution that it definitely should have? Main features that everyone is going to need?

Kyle:       I think, first thing is to look at the company's needs, the company that's looking into this technology. What are their needs? What kind of data do they have, what kind of development processes do they have, and look for a technology that covers their whole stack. If it's an oracle only shop, oracle has some technologies they might want to look into, but if it's multiple databases like oracle, SQL server, or my SQL, then they're going to have to branch out of just that proprietary database.

Brett:      Another thing I would add as well is specially in the world of test data management you might be talking about subsetting data, you might be talking about masking data, or you might be talking about generating synthetic data, and these are nothing that's new. There's plenty of vendors out there that can do one or a few of these things. I think the unique differentiator though, is deliver that data, or having that data deliver mechanism. The thing about data virtualization technology is it could work with a synthetic data generator, it could work with a subsetting tool. It could provide the masking so you wouldn't actually need a separate masking solution, and then it would carry that data delivery mechanism which is often one of the biggest bottle necks in [inaudible 00:12:03].

Kyle:       Things that would concern me as a prospect looking into this technology, is how easy is it to install? Do I need new hardware or can I use my existing hardware? How automated is it? How secure it is? Especially does it do masking? Can it do replication of only masking outside of a protected data center? I'd want some special features, like branching, so that … For example, if I'm a QA person and I find a bug, I log the bug and the developer can't reproduce it because they're on a different data set, what I'd like is a technology that has branching so that the QA person can basically bookmark their data and continue working on it. They bookmark the data where that bug can be reproduced. Then the developer can spin up a copy in a couple of minutes of that exact same data. That would be branching of data.

I like full automation, I want to connect to the source data, I want to collect the whole data set including changes into the future. I want it to be automatically provisioned, up and running database, without me having to be a DBA on [takhart 00:13:06] machine. I'd like to do things like being able to branch the data, like I mentioned, so that I can, say, mask it and make multiple copies, branches. Or be a QA person that could bookmark some data. That would be a branch, to fix a bug.

Brett:      Also to add to that, I would also stress having build in data masking capabilities, because if you look at the year 2015 there was something like a hundred seventy million records stolen, I think there's around seven hundred plus data breaches. The average cost of data breaches isn't cheap. It's something around the order of 3.5 million. I think the need for data masking in that market is growing. Part of that need, data virtualization technology, should have built in masking.

Kyle:       That's a great point, because apparently 80% of the data in companies is outside of the production zone, for example, maybe I have a production database and then maybe seven or eight copies of that in development. That's where I probably need to spend more of my time securing it. Production tends to be secure, and often we don't think as hard about the development environments.

Joe:         That's actually a good point, especially … I work in health care, so we actually had a real production database, a lot of that is sensitive patient information, so having the masking feature is definitely something that would be a big win for us, because we'd be breaking off a bunch of rules and regulations if we were using real patient data and not have the ability to mask it like you talked about.

Brett:      That's also … One of our biggest healthcare companies is actually using data virtualization for HIPAA compliance. Something on the order of 5,000-6,000 virtual databases that are … Many of which are masked.

Joe:         Cool. So besides these features, is there anything that Delphix has that you think most people aren't aware of? Any features that you think are a game changer, and if people knew more about it? Any hidden features that, some people may have your produce but not necessarily know your tool can help them with?

Kyle:       I would say having self service is probably one of the biggest differentiators. Being able … The developers ability to orchestrate and align data with their application code and test. [inaudible 00:15:23] talking about things like bookmarking, branching data sets, refreshing data sets with just the press of a button. Another thing in test data management as well, is that often the integrity of the data might be impacted if data's coming from a bunch of different sources. Maybe it's out of sync. One unique feature is that you can integrate multiple databases as of any point in time. Another thing is, we talk about bookmarking, but also sharing that data so if an error is found during a QA cycle one of the challenges many companies might face today is getting that environment state back to development team to run tests again and correct for the error. With data virtualization you'd be able to quickly bookmark and then share that data back. So these are all things that are carried out through a self-service portal which can also connect to things like Docker, or Puppet, or Chef.

Joe:         Are the main categories that you break your solution up to like this?

Brett:      I was thinking the product and basically who it applies to. So there's three different customers out there. There's customers with legacy software that they're customizing it, and I think about customers building their own software, and I think about customers that are modernizing, maybe they're moving to the cloud, or trying to implement job ops processes, and these are sort of the categories that data virtualization gets into.

Kyle:       Yeah, and building on that as well, and also relating back to data masking, even if you are securing a data on production, having world-based access controls or separating the privileges between IT teams and app teams is another important area to cover. Giving IT teams over provisioning new environments or retaining data on disc being rolled back to. Separating those privileges out from things like reset or bookmark or branching.

Brett:      I agree with Kyle. The benefits could be viewed based on which teams they apply to.

Joe:         Okay. I think as more companies move towards DevOps, like you just mentioned, probably data virtualization is a need that more and more companies are probably going to have to start investing in I would think. So is this something you've been seeing as a trend?

Kyle:       This is something I'd be interesting in your view point on, because from my point of view I don't see how people can do really solid continuous integration or continuous delivery without virtual data. If I have to kick of multiple QA runs a day, and I need a copy of the production data set, it takes, generally if that data set's large that will take me too long to build up real, physical copies of those. With data virtualization, or virtual data, I can spin those copies up in maybe two to ten minutes. Then I can actually run multiple QA sweeps a day and do CI.

Joe:         It's a great point because a lot of times we hear people complaining, well this isn't a real representation of what our customers would have, because we have such a small data set, just to speed it up, so, because we don't have that in place we're not able to have a realistic back end that we're testing against that a customer would have. I think this would help in that regard.

Brett:      I addressed this, I think it was Jess Humble on continuous integration, I think it's a whole chapter in interfaces, he … This was before, the book's been out a few years, so before virtual data was really taking off. His approach is to have roll back strips and such. I think if you look around or talk to people, first it's hard to do and second, because it's hard to do, people don't usually do it. It's not a very successful strategy. Sometimes is not even feasible. If I'm doing some massive data modifications, the roll back might take a long time.

Joe:         Once again, another great point, something we've experienced, when we first started our test development efforts, I told them, hey we need to start doing better roll backs or something we can clean up, and they couldn't do it, it was too difficult. The back end, they didn't really understand, it took too long, so that's another really great point. It sounds like data virtualization would definitely help with those type of issues.

Brett:      Oh yeah, absolutely.

Joe:         All right guys, you mentioned Kyle's blog. Are there any other resources that you think people should go to in order to learn more about Delphix or even learn more about data virtualization in general?

Brett:      I think Kyle's blog is definitely the greatest resource and it has a pretty wide spread of different challenges and issues that Delphix can solve. I think in addition, there's a lot of other people that are … People at Delphix that might also be writing blog posts. I think we should think to be accessed on the website.

Kyle:       Delphix, the Delphix website, Delphix.com, also has a blog section where various engineers and people at Delphix blog about related topics.

Joe:         This is going to sound like an odd question, I think a lot of people sometimes, when they hear a solution like this, they're afraid to approach the company because all a sudden they get bombarded with sales calls. What's the culture like at Delphix? Say someone wanted to learn more about your solution and they wanted to try it out. What should they do in order to learn more about it?

Kyle:       I definitely understand that reaction, this is Kyle, and as someone that's been in the field as a DBA, I can be very cynical towards sales people and such. Two things: one that Delphix Express, if you go to my blog, I can tell you how to download it so that you won't get a call. There's two ways to download it, one you get a call, one you won't, but you have to check out my blog. The other is if you, there's a community. Community.Delphix.com. It's mainly used by people that use the Delphix Express or the free version, but you can ask any questions you want there, it's non-sales, it's monitored by technical people, so you should get pretty non-salesy, technical, substantial answer in the community forums, community.Delphix.com.

Joe:         Also, how long has Delphix been around? Were they acquired by another company?

Kyle:       Delphix was incorporated in 2008 and they first started selling product in 2010. I actually joined in 2010, it was about thirty people, and now I think we're north of 400.

Joe:         Great. So I guess for someone that's never heard of virtual data, these types of concept that we talked about, I actually should have asked this question earlier in the show, but how would you summarize, what is data virtualization and how could someone wrap their heads around this concept and how it could really help them over all with their day-to-day testing activities?

Brett:      I can just give a little summary review. I think virtual data for most people is a completely new topic with people I run into, and … See, I have two reactions. One, “It's impossible, you can't be doing this this quick,” which is great because then you can show them, you can download Delphix express and try it yourself. The second reaction I get is, “Oh, we've got an EMC storage filer or in app and it takes snapshots, I can do it.” My question would be “Well, if you can do it, then why don't you do it.”

Too, I think it's … The internet was around for ages before there was a browser, but the only people that used it were academics. I short of look at storage snapshots like that too. You can do a lot of what we, Delphix does. You can't do it all, because some of the stuff we've got down to the operating system. You can do a lot of it, but it's going to be a ton of work. Some of the reasons why things take off is because of ease of use. Once the browser came out, everybody started using the web. Once Docker came out, containers went off like crazy. Before I could do containers before Docker, but it wasn't as automated or as easy.

Kyle:       Yeah, that's a good way [inaudible 00:23:01] for me on my end, it's like, virtual data is sort of like containers for data.

Joe:         How long does it take a typical user to get up to speed with your solution?

Kyle:       Several different ways, Delphix Express, which is the same thing as Delphix, it just has some limitations on it, I can install in about fifteen minutes on my Mac with both source and target linking and already producing a database. I think … I've seen the average user download Delphix Express and probably do it in a half hour. When we sell Delphix, when a customer actually buys Delphix, it comes with two weeks of consulting. People are usually up and running in a couple days. Generally installing Delphix is easy, it installs in like fifteen minutes to half hour. What usually takes … The only thing that takes a while is if there's a big database that we're linking to, say a ten terabyte database, that initial link which only happens once, it never happens again, that initial link might take a couple of days. From then and there we just collect the changes, provisioning the virtual database only takes a few minutes. How hard is it to get going? I'm [inaudible 00:24:00] because I've been here so long, but I see people pick up Delphix Express and get going really quickly with it.

Brett:      Another interesting thing too is, that while an administrator, it might be a primary end user of Delphix, and they're probably the one that needs to get Delphix up and running, it's really developers and testers that are the key beneficiaries to using the product. I think the self-service portal is pretty intuitive, easy to use.

Kyle:       One thing that comes to mind is administration at Delphix. Customer after customer have told me that they'd gone from a couple of DBA's doing physical copies, down to a junior DBA quarter time or less, taking care of Delphix.

Joe:         That's another road block I always see. Once it goes into another group like the DBA's sometimes it gets lost and you have to go through layers and layers of bureaucracy to get to it.

Kyle:       That's a huge point. It's one thing saying that these technical actions, or these scripts, might run in a few minutes. It's another thing getting my manager to sign off, getting a DBA to get the resources, getting an assistant admin to get the resources, getting a store, getting the snapshots, I mean … Those guys are passing it around, they've got their own queue's of work to do. Often it feels like it just goes off in a black hole. Now I've got self service. I just hit a button and a couple minutes later I have the data I want.

Joe:         Awesome. This sounds like an awesome solution, I think it's going to help a lot of companies, I talk to a lot of different people and one of the hardest things is test data and management, so I'm definitely going to download it and try it out and see what I can do with it. Like you said, you do offer two weeks of free consulting and if someone purchases … I guess you can't tell me the exact pricing, but, do you have a ball park you could tell me, or is that something that someone would have to contact you to find out more about?

Kyle:       I think the parting line is someone would have to contact you, but I can say this, Delphix does to an ROI study with each customer, and makes sure that our price is well below the benefits that the customer is getting, so we do that for each customer, so it's customized.

Brett:      Another thing I would add as well, so I don't know how relevant this would be for your audience, but Delphix is sort of started with targeting a lot of the fortune 100, 500 companies, and going back to your point before about … A lot of companies might have legacy sidebase databases and these databases might be one, two, three terabytes, and then you're making ten copies of that, and all of sudden you're at ten, twenty, thirty, terabytes. If you look at some of the largest financial services institutions out there, they might have something like two, three petabites of data, more than half of which is just copies of the same data. The biggest reason why some of previous Delphix customers have invested in Delphix is because they've been able to consolidate that foot print. You can sort of do the math what the savings could be from that perspective, but then they've also seen enormous gains in terms of their software deliver processes, like automating the data delivery process in their SDLC.

Kyle:       It's interesting, so back when I first joined Delphix, a number of the customers back them were buying Delphix for storage savings, and then it quickly turned out that those first wave of customers came back and normal response was, “We don't care about the storage savings now, we're so much more agile, we can go so much faster. That's all we care about. Storage is just icing on the cake.”

Joe:         Okay, Kyle and Brett. Before we go, is there one piece of actual advice you can give someone to help them with virtualized data and let us know the best way to find or contact you to learn more about Delphix and your virtual data solution.

Kyle:       A very biased response would be, yeah. Download Delphix Express, try that, it'll change your life with test data management.

3 responses to “Test Data Management Virtualization [PODCAST]”

  1. Wow, this is awesome. I am definitely going to take a look. At my job we are constantly having issues with data because we work with Big Data. Therefore, it’s really hard to simulate the production data on our test systems. And it seems like a hard task that nobody wants to take on. People just close their eyes and t hope for the best, which is crazy to me! Hence, leading to ton of Production bugs. Hopefully, this can be an answer to our problems.

    • Thanks Nikolay. Please let me know how it works out for you. We are also doing an eval of Deplhix express to see if it can help us with our BDD test management issue. I will also post more info based on what I find. Cheers~Joe

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Test Data Management Virtualization