Ransomware has been in the news for many years, but the intensity and wide spread use of ransomware by groups of criminals has been accelerating heavily during the last year.
Backups are one of the most – if not the most – important defense against ransomware. But if subject to corruption, attackers will use your corrupted backups against you. Advanced ransomware is targeting backups as well – modifying or completely wiping them out – eliminating your last line of defense and driving large ransom payouts. Rubrik’s uniquely immutable filesystem natively prevents unauthorized access or deletion of backups, allowing IT teams to quickly restore to the most recent clean state with minimal business disruption. This blog walks you through our one-of-a-kind immutable architecture and robust security controls that harden your data from cyber-attacks.
The Effects of Ransomware
Ransomware is designed to encrypt your data so that it is no longer usable. Often, this means encryption of data held on primary storage to overwhelm IT and requires massive recovery efforts from tape or other archives. Additionally, lower level encryption of the Master Boot Record (MBR) or other operating system level encryption is used to prevent booting and other common operations. For virtualized environments, the shared data storage used to host virtual machines is a primary target, such as with NFS-backed datastores. This can effectively bring down critical services in an organization. The attackers then demand a ransom to unlock the data so that services can be resumed.
How Rubrik Has Helped Customers
Several customers have successfully survived a ransomware attack through the use of our immutable solution and instant recoveries as part of their defense in depth strategy.
For example, the City of Durham detected a ransomware attack on Friday, March 6, and their leaders credited their quick response to Rubrik’s backup solution. Durham Mayor Steve Schewell said, “The city can be assured that our backups are very good because they’re immutable. [This means that] they could not be consumed by ransomware.” As a result, they were able to quickly restore critical city services, including access to 911. In addition, Kerry Goode, Durham CIO, emphasized that core business systems, including ones that manage payroll, were back online by the start of the business week.
Kern Medical Center discovered a large ransomware attack in June 2019 when users reported they couldn’t access their systems. They were able to recover 100% of the impacted systems protected by Rubrik within minutes, including recovering their business-critical electronic medical record system. CTO Craig Witmer said, “After the incident, we were so impressed that we moved more of our legacy systems to Rubrik and are fully confident that Rubrik’s immutable backups will protect us from future incidents.”
How Do You Recover from a Ransomware Attack?
Data backups can be an effective way to restore data that has been locked/encrypted by the attack. However, what if your backup data is also encrypted or deleted by a ransomware attack? How do you ensure that your backup data is not vulnerable to these attacks?
The Key Is Immutable Backups
While primary storage systems need to be open and available for client systems, your backup data should be immutable. This means that once data has been written it cannot be read, modified, or deleted by clients on your network. This is the only way to ensure recovery when production systems are compromised.
This goes well beyond simple file permissions, folder ACLs, or storage protocols. The concept of immutability needs to be baked into the backup architecture so that no security exposure can tamper with the backups.
Rubrik Is Designed for Immutability
Rubrik uses an immutable architecture by combining an immutable filesystem with a zero trust cluster design in which operations can only be performed through authenticated APIs.
Rubrik’s approach is in contrast with other data management systems using general purpose storage that use standard protocols such as NFS or SMB to advertise their availability to a wide assortment of clients. We often find that data management solutions using general purpose storage have limited or ineffective means for securely transacting data and, in some cases, leave files in their native format while allowing clients to read the backup data directly. This is a breach of confidentiality and puts extra burden on the customer to secure the storage independent of their data management solution.
An Immutable Distributed Filesystem
One of our first design decisions was to construct Atlas, an immutable Filesystem in Userspace (FUSE) that was largely POSIX-compliant. This provides tight controls over which applications can exchange information, how each data exchange is transacted, and how data is arranged across physical and logical devices. Atlas is custom designed to be a distributed and immutable file system for writing and reading data for other Rubrik services.
Immutability is provided across two layers: the logical layer (Patch Files, Patch Blocks) and the physical layer (Stripes, Chunks). The dynamics between these two layers will be explained further in the next few sections.
The Logical Layer
All customer data brought into the system is written into a proprietary sparse file called a Patch File. These are append-only files (AOFs), meaning that your data can only be added to the Patch File while it is marked as being open. All of the customer snapshot and journal data is held within Atlas, which enforces the use of Patch Files in the underlying directory structure. This powerful filesystem will refuse writes at the API level that are not append-only, such as situations in which the write offset value does not equal the file size. Atlas has total control over how and where customer data is written.
If your backup data has been modified, then it’s essentially worthless. We solved for this by ensuring that checksums are generated for each Patch Block within a Patch File. These checksums are computed and written to a Fingerprint File stored alongside the Patch File. Rubrik always does a fingerprint check before committing any data transformations. This ensures that the original file remains intact with forced validation during read operations.
In order to counter a ransomware attack, the original, validated data must be restored from backup. Rubrik routinely verifies the Patch Blocks against their checksums to ensure data integrity at the logical Patch Block level. Patch Files are not exposed to any external systems or customer administrator accounts. This ensures that meticulous care is taken to restore exactly what you originally stored in a backup.
In a traditional approach, administrative access is granted to the filesystem – especially when using general purpose storage – which presents further confidentiality and integrity challenges and gives “Leakware” another attack vector. In addition, many other solutions simply restore whatever data is located in the backup folder or volume without performing validation and other due diligence on the data.
The Physical Layer
While the logical layer focused on data integrity at the file level, the physical layer is focused on writing customer data across the immutable cluster to achieve data integrity and data resiliency. To do this, Patch Files are logically divided into fixed length segments called Stripes. As Stripes are written, the AOF computes a Stripe level checksum, which it stores within each Stripe Metadata.
Stripes are further divided into physical Chunks stored on physical disks held within the Rubrik cluster. Activities such as replication and erasure coding occur at the Chunk level. Just as with Patch Files, as each Chunk is written, a Chunk checksum is computed and stored in the Stripe Metadata alongside the list of chunks. These checksums are periodically recomputed as part of Atlas’ background scan by reading the physical Chunks and comparing against the checksums in the Stripe Metadata. Additionally, if a data rebuild is needed, the resiliency provided by erasure coding is automatically leveraged in the background.
Zero Trust Cluster Design
Traditional approaches to cluster security often rely on a “full trust” model in which all members of the cluster are able to communicate with one another. In some cases, this includes root level authority, no mutual authentication checks, and the ability to read or modify your customer data that is held within the filesystem. This creates a weak surface area when designing a defense in depth architecture; if backup data can be compromised, there is no path to restoration when disruption occurs.
Secured Cluster Communications
Each cluster has some number of nodes that need to communicate with one another. This means we need to validate each node that wants to exchange data. For many solutions, there is little to nothing protecting node-to-node communication. At Rubrik, all of our intra-node and inter-cluster communication, as well as communication with external applications, use the TLS protocol with certificate-based mutual authentication for secure communication.
Rubrik does not use insecure protocols, such as NFS or SMB, to relay information within the cluster; all communication is performed through secure and trusted channels. In fact, all our internal communications use TLS 1.2 with strong cipher suites and Perfect Forward Secrecy (PFS).
Each Rubrik cluster shipped to a customer uses strong, randomized passwords on a per-node basis. There is no concept of a “admin/admin” style of default local authentication that is easily searchable on the web to add an attack vector.
Systems Hardening Standards
There are numerous other elements in position to protect the integrity of the system through internal hardening standards. Here are a few that help combat ransomware:
Rubrik adopted an API-first design as part of the architecture. We require authentication to all endpoints that are used to operate the solution. Authentication can be handled via credentials or secure token. This includes environments using our Role-Based Access Control (RBAC) or Multi-tenancy features to logically divide the roles, features, and resources that are under management. Rubrik’s CLI, SDKs, and other tools consume the API and are held to the same security requirements.
API endpoints that control the underlying behavior of the system require an additional level of authorization that can only be supplied from a certified technical support engineer. This prevents a malicious actor from being able to alter the behavior of a Rubrik cluster.
Numerous resources on the Internet advocate for a Defense in Depth strategy. This combines efforts across employee education and enablement, rapid deployment of patches, and a solid backup and recovery plan. In this post, I described how Rubrik uses a combination of data immutability and a zero trust cluster design to build a great product for protecting and recovering data. We help organizations further strengthen their ransomware response strategy with our application Radar to increase visibility into the scope of attack. This allows organizations to quickly pinpoint which applications and files were impacted and where they reside to further minimize business impact. Learn more about Radar here.
Many of our customers turn to Rubrik on their worst day. They need to be able to reliably recover from ransomware attacks to ensure minimal downtime of their critical services. A product with a truly immutable architecture provides our customers the peace of mind that when they need to, they can always access the data to recover from such debilitating attacks.
Let’s face it - preventing a ransomware attack is hard. Some may say it is near impossible even with the latest technology and a sound defense-in-depth approach. So, if there’s no surefire way to prevent an attack, recovery is the next best option. Within a ransomware recovery plan, though, there lies many decisions and nuances. For example, what should the priority be - quick recovery and return to operations, forensics to determine the cause of the attack, or minimizing data loss during recovery?
If a quick recovery is prioritized, then organizations are generally sacrificing the ability to do forensics to determine how the attack occurred and propagated which opens the door to a repeat attack. They’re also deciding to forgo determining what data was affected and what was not. This means that they’ll be recovering data that wasn’t touched during the attack and overwriting good data with older data during the recovery. By prioritizing forensics, organizations are investing in making sure the same attack cannot happen again. To do this, however, takes time, expertise, and tools which can cause the business to be down far longer than they’d like. Similarly, determining exactly what data is impacted as part of the attack can also take time for administrators to pour through logs to assess the situation.
Each one of those priorities seems to bring with it a substantial downside that makes it very difficult to choose. So, newer technologies, like Machine Learning (ML), have been sought after to address those choices. Conventional technologies use algorithms that require humans to explicitly program actions. While these algorithms can be quite complex and powerful, they can only do what the programmer has enabled them to do. In other words, if the algorithm encounters an unforeseen situation, it likely will cause an error or otherwise not reach the desired result. For example, a programmer might need to create an algorithm to analyze a batch of pictures and determine if a dog was in the picture. That programmer would have to decide what characteristics a dog has upfront. And then it is a binary choice as to whether they decide if the picture has a dog or not. Imagine all of the possibilities and the difficulties!
On the other hand, ML is a system that can learn and adapt without being given explicit instructions. ML can use algorithms and statistical models to analyze patterns in data. Going back to the example, a ML system doesn’t require that upfront definition of what is a dog and what is not. For the system to know what a dog is, the programmer can “feed” the system with pictures that contain dogs and those images are tagged by the programmer to let the system know that those are indeed pictures of dogs. This allows the ML system to “learn” what a dog is. Conversely, pictures without dogs can be fed into the system and tagged as not having dogs. The ML system then can set off on its analysis quest to find pictures that contain dogs while also continuing to refine and tune (or learn) its understanding. This continuous learning and adaptation is key.
Now, let’s take a look at how Machine Learning can help when we’re dealing with ransomware.
Applying Machine Learning Models to Ransomware Recovery
An organization’s backup data is rich with information. This includes the content itself along with metadata such as path, size, ACL details, UIDs, GIDs, and other attributes. The Rubrik Zero Trust Data Management™ platform can then feed that information into a machine learning pipeline that forms intelligent insights that streamline the ransomware recovery decision-making process. Let’s break down how Rubrik applies machine learning models to your data.
At Rubrik, we refer to backups as snapshots. These snapshots are being taken via the on-premises Rubrik Cloud Data Management (CDM) platform. Once a snapshot is completed in CDM, a filesystem metadata diff (FMD) file is created. This FMD file contains a list of entries corresponding to files that have been created, deleted, or modified, and is essentially a log of the file changes that have taken place on the backup. Instead of running the computationally intensive machine learning pipeline locally in CDM, we upload the FMD files to Rubrik Polaris to be processed by the machine learning pipeline residing in the cloud.
Rubrik Polaris is a SaaS Data Management platform that offers services such as cloud-native protection for AWS, Azure, and Google Cloud, Microsoft 365, Rubrik Polaris Radar for ransomware detection and recovery, and Rubrik Polaris Sonar for compliance.
To be clear, only FMD files and their associated metadata are transmitted to Rubrik Polaris which means that customers need not be concerned about their sensitive data being transmitted outside of their data center. So, not only do we have zero impact on the production infrastructure and applications including backups, but we get to leverage the scale-out compute performance of the public cloud with Rubrik Polaris all while ensuring a security-first approach to data.
Training the Model
Once the FMDs land in Rubrik Polaris, we leverage a deep neural network (DNN) to build out a full perspective of what is going on with the workload. The DNN is trained using supervised learning, which consists of presenting labeled data to a machine learning model to give it a training signal from which to learn. The DNN can then identify trends that exist across all samples and classify new data by their similarities without requiring human input. Remember the previous dog example? This is analogous to the system seeing more and more dog pictures and using those additional data points to become more accurate over time.
So, let’s look at how the DNN ultimately decides whether there has been a ransomware attack. The DNN analyses data via a machine learning pipeline for Rubrik Polaris Radar that consists of two models: an anomaly detection model and an encryption detection model.
These models and flow can be summarized as follows:
- File System Behavior Analysis: Performs behavioral analysis on the file system metadata information by looking at items like number of files added, number of files deleted, and so forth.
- File Content Analysis: If an anomaly is detected during the previous step, Radar performs an analysis to determine if there is a characteristic sharp increase in file entropy that signals a ransomware attack.
Overall, this pipeline excels at creating a historical baseline that gets refined over time. If an anomaly alert is generated, Radar can go deeper into the content of the files to look for signs of encryption and compute an encryption probability using a statistical model. This allows the analysis pipeline to compute entropy features to measure the level of encryption in the file system efficiently.
Testing Known Live Ransomware Samples
While all of the conceptual design behind Radar may sound great, how do we know it works? After all, it’s not like you are going to let loose various ransomware variants into your production environment just to see if Radar sends over an alert, right? :-)
Radar’s detection model was trained, validated, and tested against a large amount of real-world labeled data containing a diverse mix of snapshots from real-world usage, simulated usage, and snapshot changes caused by various ransomware and malicious activities.
For the machine learning pipeline, we followed a standard practice of segmenting the labeled data into 3 categories: training, validation and testing. This enabled us to ensure that the model was not overfit to the testing data; training and validation sets are used to tune the model, while testing data is used to evaluate the model on unseen data.
Even before today’s accelerated proliferation of ransomware, Rubrik had set out on a mission to help customers recover from ransomware. Our lead security engineer was quoted saying “With an effective backup solution, ransomware can ideally be reduced to a minor inconvenience.” Today, we can see that many of our customers are finally able to gather a clear picture into the anomalies that impact their environment regularly.
As ransomware becomes increasingly sophisticated and continues to adapt, successful attacks are more prevalent. Powered by machine learning, Rubrik Polaris Radar enables enterprises to respond quickly to the latest threats automatically and thus accelerates recovery by minimizing business disruption and data loss.
To learn more about how Rubrik can help you recover quickly from ransomware, contact us for a short introduction meeting and a demonstration.
Source: This blog is based on a article by Chris Wahl, Chief Technologist at Rubrik