- Security
- A
Akurative Landing: 5 Backup Mistakes That Could Cost You Your Infrastructure
Imagine: you jump from a plane, pull the ring, and instead of a parachute, a note flies out saying, "404: Not Found." In IT infrastructure, backup is just like that parachute. The problem is that 90% of engineers believe that as long as the pack is on their back, gravity won't affect them. Few check its "packing" or train for the landing itself.
Therefore, below is a brief guide on "packing the parachute": quickly and without fluff about classic mistakes and how to avoid them.
1. Manual mode: backup "by memory"
The essence is simple: the admin runs a script or copies data when a free minute arises. The key indicator here is RPO (Recovery Point Objective). This is the amount of data you are willing to lose in case of a failure. If you make backups manually once a day, your RPO is 24 hours. Today, this is catastrophically too much.
How to do it right: Full automation. The backup should be a background process that does not depend on whether the responsible engineer got enough sleep today or not. Modern orchestration systems allow for incremental backups every 15-30 minutes, minimizing losses.
2. Schrödinger's backup
The presence of a file in storage does not mean that you have a backup. Until you try to restore the system from it, the copy exists in superposition: it is both working and not working at the same time. The validation procedure is often ignored because it "consumes" resources and time. As a result, at a critical moment, it turns out that the backup is corrupted, and the RPO increases two or three times, as you have to revert to the previous (also not guaranteed to be good) copy.
How to do it right: Set up automatic checksum verification and, more importantly, periodic test restores. If your DR system can automatically "bring up" virtual machines from copies in an isolated network for testing — you are on the right track.
3. Control shot in production
Let's assume: there was a failure, and the data is corrupted. The admin initiates recovery from the last backup directly over what remains on the disks. During the process, it turns out that the backup is also corrupted or contains an old version of the configuration. By overwriting the original (albeit "damaged") data, you cut off your retreat path. Surviving fragments of data on the source are permanently destroyed by the unsuccessful recovery.
How to do it correctly: Always take a snapshot before recovery. Or recover to a new environment/new instance. This will allow you to compare the data and, if necessary, assemble a working version from "pieces" of different copies.
4. Playing Tetris
Backups tend to grow. Sooner or later, there comes a moment when the next copy simply does not fit on the disk. This often leads to mistakes: old backups are deleted manually in haste to free up space for new ones. In this chaos, it is easy to wipe out the only working full copy, leaving only useless increments.
How to do it correctly: Use retention policies and automatic rotation. The system should automatically know what to "deduplicate," what to move to "cold" storage, and what to delete after the retention period expires.
5. Recovery through Destruction
Some admins, in an effort to save space, first delete the old copy and then start creating a new one. This creates a "window of vulnerability." If a failure occurs during the creation of the new backup, you will be left without any current data. The RPO at this moment tends to infinity.
How to do it correctly: The "3-2-1 rule." At least three copies of data on two different media, one of which should be offsite (in the cloud or another data center). And never delete the old recovery point until the new one is verified.
The human factor remains the main reason for failures in recovery. This is why the industry is moving towards DR, automating not only backups but also the orchestration of recovery across the entire infrastructure. I have already written about the difference between regular backups and DR, and why the former is often useless in the enterprise, here.
If you're interested in seeing how these issues (validation, scalability, and automation) are being addressed in the current reality, come join us for a webinar. We will be discussing the release 4.4.1 and the updated roadmap of Akura.
Write comment