Do you know that uneasy feeling, of not being in control of something?
Did you experience that this feeling grows on you the bigger the thing is, that gets more and more out of control!?!?
Do you belong to the group of people who prefer the attitude "What I don't know, I don't care to know..." ?
Either way. You might want to read this article.
A music data collection can be considered quite a treasure. Protecting such a value should be of highest priority.
To do that properly - protecting that treasure - requires that you get on top of things.
And this requires quite some enthusiasm, knowledge and actions (work!).
Why am I bringing all this up? One day I realized I lost valuable audio files and files got corrupted.
And I also realized that I didn't realize I've been backing up these earlier undetected issues.
It's time to get my act together.
Many of us spent hundred of hours to shape our audio collection.
We rip it. We ripped it a 2nd time (I did). We get the tags right. We find the right cover arts.
We might re-encode or transcode data. We might add replay gain tags.
We copy it from a to b to c. We do this or that. We continuously shape it.
An audio collection is a living organism.
Once you're finished with a job - any of them - you run a backup...
So far so good.
Usually most people (I know) just simply copy/paste the data from one disk to another.
Over time that'll be done many times. You keep copying your data back and forth.
During the process you trust your OS (operating system) and integrated tools that these take care of the integrity of your data. You also assume that nobody else messes with your data.
Here it is. The reason for feeling uneasy about all this!
You trust, You assume...
...it'll be all right.
I'm now telling you -- it's not gonna be all right!
Over the last weekend I ran a data analysis.
Surpise, Surprise. I figured that 30 audio files got corrupted. 30 files.
And guess what. Not any of them belonged to the same album. That means about
30 albums were incomplete.
That really surprised me. I thought I'd be doing things right. Obviously the way I handled my collection must have had some weak spots. I simply underestimated the situation.
I re-ripped those CDs. Lucky me. I still own the CDs. And's been just 3 hours of work.
What if you downloaded your tracks??
This experience kept me thinking: I need to improve my data management process.
The new goal now is to make 100% (or close to it) sure that no data loss can occur.
The whole process must become an automated process - as much as possible.
I have to. Laziness and ignorance will be creeping in sooner or later. It can't be avoided.
How did it/does it happen ?
Perhaps those of you still reading this article, ask yourself
"How did he realize that data got corrupted???"
The vast majority of my collection is flac. flac offers an integrity check as part of the package.
What it does, is comparing a md5 checksum, which gets stored inside the flac after encoding, with a checksum built over the audio data inside the file decoded in realtime.
If the stored and the just generated md5 checksum are matching the file is OK.
If it's not OK - we know that a single bit up to the entire audio track could
be changed or be gone. It's corrupted!
That's how I found out. I ran a bulk integrity check over my entire collection. And that usually takes some hours to conclude.
I did realize though that this is actually not the full story.
Because. I actually identified the corrupted files only! What I don't know at this stage is if any files are missing!
Let's have a little generic brainstorm session at what point data corruption or loss can occur. Data can get lost or corrupted on many occasions.
- The typical copy/paste and backup routines "try" to cope with it. They are supposed to control overall integrity and usually issue error or deviation messages if a problem is detected. copy/paste is not a save mechanism though. Not any professional administrator would use this method.
- Harddisk and SSD failures. Corruption of data can occur very silently with aging/worn down harddisks. Look at e.g. Amazon reviews or elsewhere. People report of dying HDDs or SSDs after 6 months. I personally wouldn't trust a HDD that's older than 2-3 years.
- Weird operating constellations, such as power outage at the wrong moment asf.
- Software flaws (no SW is flawless) OS and apps (e.g. bulk conversion tools)
- System overload conditions - during overload conditions weird things can happen
- Cyberattacks/Malware - causing any kind of weird stuff
- Simple user faults - pushing a wrong button
With all this in mind it doesn't need to be a genius to realize that your data can get corrupted and that you might overwrite clean backups with corrupted backups.
A closer look
The main challenge with all this is that many of these corruptions and losses occur under the hood. You simply don't see it happen.
Copy/paste doesn't generate logs. How would you know that anything happened during - sometimes - hours of copying?
With tools that provide log options you could see a lot what's happening. Though I'd guess the vast majority of people out there simply don't make use of logs.
And this is why it happens that you potentially inject more and more corruption into your database over time.
What to do about it?
It's pretty simple - once you are aware of the problem. And now you are aware of it. You can do something about it.
We just need to establish a backup process that's doable.
Before we can establish such thing we need to get the current data base under control.
- Review your storage media (run file systemchecks, check time of HDD operation, asf.)
If needed replace your HDDs. 2TB run at around 70$/€ nowadays.
Don't use aged and/or out-phased drives as backup media
After 3 years max I buy myself a new master drive.
- Make sure that the existing backups are OK
Many backup tools offer TestRun backup options. The backup gets simulated.
That's a pretty handy function. You can run test runs in both directions for test
purposes. And then you can analyze the logs.
If you own a lot of flac files run a bulk integrity check. Several audio tools
or the flac binary itself offer this option.
Note: A file with an equal timestamp and an equal size doesn't have to be equal!
A byte can flip and your backup tool (usually) won't recognize it. You can address this though by running checksum tests. Many backup tools even offer that option. The issue with that. The test required to compare checksums from files during a backup slows down a backup substantially.
- If you start with a cleanup exercise - make sure you have one extra backup.
- If you own just one backup disk you better introduce a 2nd (long term) backup media
- Review your data formats
Choose a format with builtin integrity check option, e.g. flac.
- You need to introduce logs. Each backup needs a log file. Store these logs
on the backup media. And then you need to analyze the logs!!!
Look for errors or changes that doesn't make sense.
- Introduce incremental backups! You know the expression "restore points" from
e.g. Windows. In a separate backup directory only the delta backups get saved.
And the original backup remains untouched.
Incremental backups are a major safety net to prevent overwriting your clean backups.
And IBs let you roll back in time for quite a period.
It requires a bit more space on the backup media though. It works well with
minor data collection updates here and there.
If you run major data changes, e.g. bulk conversions, you'll run out of diskspace.
From that point on you usually need to build a new base backup!
To accomplish above audio file related safety-net we need to go for an audio data format which comes with an embedded integrity mechanism - a checksum.
flac would be the preferred option.
.wav files are e.g NOGOS.
Let's go on with flacs.
Flacs come with an embedded md5 checksum. That checksum is generated over the audio data chunk only. Which is good. Having a little problem or a change in the tag area is not critical and won't have an impact on the checksum mechanism. That checksum is written into the flac file.
That checksum will be renewed, as soon as you do some re-encoding or transcoding of the flac data.
If the flac codec detects a mismatch - while decoding - between the MD5 checksum and the decoded audio data chunk - it will issue a corruption message and stop decoding!!!.
That corruption message is KEY. You need to look for it and you have to fix the affected track on either of your storage devices.
How to introduce this to our "backup" AND not to forget "restore" procedure?
(We don't want to restore a corrupted collection!)
The flac(.exe) executable (you'll find it under windows and Linux) allows for identifying that checksum data mismatch and issues error messages.
What you do is, you run a test-decoding ( flac -t <file> ) on all files on your master HDD and backup media prior and after any of your backups. (That check can easily take an hour on e.g 4.5k-5k tracks)
While doing the test-decoding , flac(.exe) will issue an error message if there's a problem with the checksum. You can write a batchfile under Windows or a script ( see Appendix II) under Linux.
There's also another tool you might use:
The dbPoweramp Reference converter tool offers a function called "Test Conversion". It basically does what the flac binary is doing if run with the -t option. You can also use that one. The dbPoweramp Batch converter tool allows to check your entire disk at once. That's most convenient. dbPoweramp also offers a feature called Move Destination File on Error that moves the corrupted data automatically to a predefined target directory.
That's basically all we need to do.
Yep. That's about it.
A pretty simple measure - though very powerful. I'm sure my collection is kept much safer now.
I guess 99.9% of all collections out there got a problem. You might consider doing something about it.
If you follow my advise, I'd be really interested to see you reporting back your corruption rate (and associated backup strategy). ;)
1. Don't run simple Copy/Paste backups, use backup tools - look for recommended settings of these tools!!!
Usually you'll find numerous parameters, which doesn't mean much to you in the beginning. Later on you
usually, realize, why these parameters were introduced. Spent a little time on the subject.
2. Introduce incremental backups, which keep the original data as long as possible and store the "delta" data
at a different place on the same disk.
This way you save quite some space. You can have several backup cycles on one disk.
3. Use at least 2 backup disks - stored at different locations.
4. Buy disks which are used in the professional area. Cheap consumer stuff is not recommended
5. You don't need that fastest disks ( higher wear down effects). You'd need the most reliable.
6. Don't use old disks ( which e.g. were just replaced by your brandnew SSD) as backup media
7. Check the data integrity seperately - see above
8. Automate the process as much as possible and keep the logfiles.
Full Backup ( all partitions and bootsector)
Windows 7 - Backup and Restore
Acronis - The Free WD Version
dd (Linux - commandline)
Windows 7 - Backup and Restore
rsync (Linux and Windows - commandline - also remote via network! (ssh))
Note: The W7 tools are meanwhile able to compete with other commercial software.
IMO there's no need to go for Norton or Acronis.
Secure Copy ( with CRC check and error logging on failed transfers)
rsync (Linux and Windows - commandline)
Advise: It's always recommended to use a defragmented disk.
Copying data back and forth all the time, or any other jobs you run on your data, such as adding RG tags or similar, gets you lot of fragmentation on the disk. That fragmentation is gone if you backup your disk to another disk. If you can swap your master disk with the backup disk easily, use your backup disk as master
This way you always run a rather defragmented disk, without running an annoying defragmenattion process -
which puts a lot of pressure on your HDD.
Since I'm at home in the Linux world, I've written a simple one-liner that accomplishes the flac integrity check over your entire harddisk(s):
Open a terminal first.
**copy/paste below command into one line******
find / -iname "*flac" -print0 | while IFS= read -r -d '' "j" ; do flac -s -t "$j" 2>>/tmp/flac-integrity.log ; done
That'll take some time (hours) - 1 to 2 s per file.
You can replace "find /" with e.g. "find /media/music" to specify a specific music directory.
The scan result you'll find in /tmp/flac-integrity.log
To test above you might copy a CD to /tmp first and then you replace "find /" with e.g. "find /tmp"
You can run a:
grep "error" /tmp/flac-integrity.log | wc -l
That'll tell you if any and how many problems where found.