Seven reasons why it’s time to BART your backup

Let us go through seven real-world reasons that I have encountered working in the backup & restore field for the past 20 years and see where and how BART could have prevented problems. All of the following use cases are based on real-world needs and more are being added thru the continues agile development of BART.

BART is an advanced automated recovery testing tool designed to, amongst other things, run scheduled validations of virtual machines as found in your VMware infrastructure.
BART is the abbreviation for “BPSOLUTIONS Automated Recovery Tester”.
BART was created and sold by BPSOLUTIONS.

 

Scenario 1 - Errors during the restore of a VM
A customer needs to restore a virtual machine, easy enough right?
Start the restore, wait for the restore to complete, job done.
Well, no, not in this case, some backup data needed for the restore was corrupted.
The corruption was eventually restored from the replicated backup environment but this took a ticket for support from a consultant and therefore extra time and money to accomplish.
Testing of other VM restores revealed that more VM’s where hit by the corruption issue.
We’ll just chalk that one up to immature data deduplication without proper automatic error correction.

BART's solution to scenario 1
BART has multiple operation modes, one mode selects random VM's from the VMware inventory, restores the VM to an alternative name, disconnects the network interfaces, adjusts compute resources and boots the VM to validate the start of the VMware tools and thus the start of the operating system.
This will greatly reduce the possibility for unforeseen errors during actual real-world restores because issues will show up during BART runs.


 

The example of BART run output shown above shows one VM was successfully restored (green lines) and one that failed to restore during the BART run (red lines).

Want to check the recovery of your most important VM’s each week or month? No problem you can create a separate BART configuration and BART schedule for this exact requirement. No more manual testing and BART automatically runs and reports whenever you see fit. The e-mails subject gives a preview of its content without opening the email so you can spot possible issues from the subject line in your inbox.

Scenario 2 - Caught off guard by extremely slow restores that do not meet SLA's
When you eventually need to restore that big VM and find out that the once blazing fast restores are now running at 10% of the performance they used to run. The business unit is waiting for the restore to complete and you realize it's going to take much longer than you estimated. Maybe you utilize the integrated backup validation method but that doesn’t spot these issues because they utilize instant recovery and thus boot the VM from the backup environment and not actually move data.

BART's solution to scenario 2
You can attach recovery SLA's to VM’s in the form of RTO's (in hh:mm:ss format) within vCenter using VMware tagging, and/or use BART default SLA to validate the set recovery time objective against the measured recovery time by BART.
BART does not use the duration as reported by the supported backup solutions but utilizes its own stopwatch function to time the actual restore duration from start to finish.
BART will warn the team as soon as a BART run detects that the restore was not compliant with the BART SLA attached to the VM.

The BART report above shows two validations. The first one was successful and completed within its attached SLA. The second validation shows a successful restore but one that took too long to complete, therefor the SLA of the second VM was breached and reported.

Complimentary to the BART run reports (that get send after each BART run) BART features a “monthly BART report” that is sent at the beginning of each month that covers all BART validations of the previous month. This report will also show all VM's that have breached their attached SLA.

Furthermore, BART features a "reference VM"-feature that allows the BART administrator to select a single VM that will be run separately periodically on a predetermined day of week or month and time, and BART will graph that VM's performance. A sample graph is shown above.


During the creation of each monthly report a new graph is generated for the reference VM.
The visualization of performance data will make it very apparent if the restore performance is changing over time and will allow for action to be undertaken, sometimes even before SLA's are breached.

Scenario 3 - VM missing from backup
You get assigned a support ticket, it states "Please restore VM named XYZ".

You go into the management interface of your backup solution only to be surprised by the fact that the VM has never been backed up. Nobody knows how this could have happened, the server team is sure their automation added the VM to the backup during deployment 5 weeks ago. But it's simply not there so the VM will have to be rebuild from scratch losing days of valuable time.

BART's solution to scenario 3
BART is vCenter-inventory-orientated, not backup-inventory-orientated.
This means that BART checks for VM's to be validated based on your actual live environment, not on what is already in the backup.
This seems logical but most tools operate from a backup-inventory-perspective.
With BART we have a third-party tool that uses your live environment inventory to validate if items are actually in the backup, restorable (within their SLA) and bootable after the restore.
BART keeps track of the VM's that have been validated and will not rerun these validations until all VM's (except the ones you explicitly opted out for BART) have been validated.
This way the validation selection is random but prevents multiple-same-VM validations taking place until they have all been completed further optimizing the chances of quickly finding VM's that might be a cause for concern within the backup environment.


Scenario 4 - External audit doesn't accept backup report for restore validation
When under audit it is common to present the auditor with backup reports that show high success rates as proof for successful and thus usable backups.
This will however, not impress many auditors.
What they want to see are results from actual restore tests and the more data you can provide proofing that periodic restore tests are executed and the results are stored, shared, and acted upon. The more accurate and independent the recovery data is the larger the smile on the auditor's face will be.

BART's solution to scenario 4
Just like the auditor is an outside source looking into the companies way of operating BART is an outside source checking the backups on the companies behalf.

Auditors like to see manual or third-party tools periodically running and reporting on restore tests. For most companies this means manually testing restores of randomly selected VM’s.
Although this approach works fine it is labor intensive and often skipped or postponed when more urgent matters demand the attention of the administrator team.

BART automates the VM recovery testing, reports the results via e-mail to the administrators, archives all results and can share a monthly summary of all results with the anybody who is interested in and privileges to such information.
This is how BART provides all the data needed to get on top of VM backup validations and audits.
BART not only stores the results in HTML/e-email format and its own database but also stores the actual restore output as received from the backup application as proof of the validation. 


Scenario 5 - Live mount vs real-world restore times
You might be using your backup solution integrated feature to use live-mounts (starting the VM from the backup storage) to provide proof that the backup has created an backup that can be used to start the VM.


But when do you actually use these live mounts for real-world restores? Our data shows that most restores are actually moving data back from the backup solution to the live environment.

And what if you need to restore a whole set of VM's that are all part of an application that broke during an upgrade? How long will it take to restore that single VM or that group of VM's at the maximum performance of your backup solution?

BART's solution to scenario 5
Measuring real-world restore durations of virtual machines is important as it is the only way to accurately predict the restore duration once the data is actually needed.
BART can give insights into the expected duration of a single VM real-world restore but also has a group function where the administrator can use VMware tags to create virtual BART groups that will be restored during special group validation runs where BART will measure the duration of the restore for the entire group of VM’s.
These groups might, for instance, contain VM’s for a certain mission critical application where there is an SLA on the set of VM’s and not just the individual VM’s.
Multiple groups can be created an validated separately against their own group-SLA that will be treated separate from the individual VM SLA and a single VM can be assigned to multiple groups.

The screenshot above shows a group validation for a BART group called “PRIO2”.
The recovery of the group was not completed within the attached group SLA, one VM in the group failed to restore (Linux 1) and another VM (Linux 6) was restored however the backup was more than 2 days old thus triggering a warning, more about that in scenario 6.

Scenario 6 – The VM is in the backup but it last usable backup is weeks old
I’m probably not alone in having had that that sinking feeling when you need to restore a VM only to find out the last backup is from 3 months ago.
People, reporting, processes and software are not perfect, sometimes backups stop and problems go unmitigated for whatever reason. For some VM’s this might not be such a big problem, for others a backup that is older than a few days is as good as useless.

So how do we add an extra layer of protection against aged backups that somehow went unnoticed?

BART's solution to Scenario 6
When combining BART with the next-gen data protection solution Rubrik BART is able to determine the age of the most current backup and test this against the maximum age set in the BART configuration.
This way the BART administrators and recipients of the BART monthly report will be informed of any BART validations that found the most current backup of a validated VM to be older than the set maximum backup age.

 

 

In the image above we can see a BART run that ran into a VM that restored just fine but the backup was created more than 2 days ago.

BART is the tool that is constantly looking for trouble with your restores before you actually need them. It will be warning you when it finds trouble and it will be archiving results for later comparison or auditing use.
This prevents surprises and adds an extra third-party layer to the data security modern backup solutions provide.

Scenario 7 – The “Okay, so how long will this restore actually take?”-question
Anybody who works with backup & restores will have had this question directed at them.
When trouble arises and options are weighed a logical question can be to ask the people that administer the backup environment is “how long would a restore actually take if we choose to go the restore route?”.
It is a valid question because when a rebuild take 6 hours and a restore 15 nobody is going to opt for the restore.
Go ahead and try to answer that question in a timely and accurate manner based on what you currently know about a random VM in your environment.

BART's solution to scenario 7
Of course BART mails and archives all recovery reports, it makes them searchable using the BART archive-query tool but BART also stores information about the most recent (group) validations in vCenter on the virtual machine page as Custom Attributes.
One of the fields is called “BART_RestoredIn” and states the most recent duration it took BART to restore the VM on the given date. In the example below it took BART 1 minute and 49 seconds on the 19th of December 2021 to restore a VM called “Windows 1”.

All of this information is right at your fingertips within the vCenter GUI.

 

Of course the information can also be used using scripting (using, for example, PowerCLI) and with tooling such as the popular RVTools.

 

 

As you can see more validation information is stored under the BART_LastValidation custom attribute as well giving the administrator more information from their vCenter interface.

Interested in learning more about BART or running a POC in your environment please contact stefan.folkerts@bpsolutions.com.


BART supports the following backup solutions:
• Rubrik
• Spectrum Protect (for Virtual Environments)
• Veeam (all license types)

Sign up for our Newsletter
Stay up to date with the latest BPSOLUTIONS news about our new technologies, solutions, events and more.

Contact us and let's make data work for you