The silent death...


Do you know that uneasy feeling, of not being in control of something? 
Did you experience that this feeling grows on you the bigger the thing is, that gets more and more out of control!?!?  

Or... 

Do you belong to the group of people who prefer the attitude  "What I don't know, I don't care to know..." ? 


Either way. You might want to read this article. 

A music data collection can be considered quite a treasure. Protecting such a value should be of highest priority. 

To do that properly - protecting that treasure - requires that you get on top of things. 
And this requires quite some enthusiasm, knowledge and actions (work!).

Why am I bringing all this up? One day I realized I lost valuable audio files and files got corrupted. 
And I also realized that I didn't realize I've been backing up these earlier undetected issues.

It's time to get my act together.



What's happening?

Many of us spent hundred of hours to shape our audio collection.
We rip it. We ripped it a 2nd time (I did). We get the tags right. We find the right cover arts.  

We might re-encode or transcode data. We might add replay gain tags.
We copy it from a to b to c. We do this or that. We continuously shape it.  
An audio collection is a living organism.

Once you're finished with a job - any of them - you run a backup...

...in theory.

So far so good.

Usually most people (I know)  just simply copy/paste the data from one disk to another.
Over time that'll be done many times. You keep copying your data back and forth.

During the process you trust your OS (operating system) and integrated tools that these take care of the integrity of your data. You also assume that nobody else messes with your data.

Here it is. The reason for feeling uneasy about all this!

You trust, You assume...

...it'll be all right.


I'm now telling you -- it's not gonna be all right!


Over the last weekend I ran a data analysis.  
Surpise, Surprise. I figured that 30 audio files got corrupted. 30 files. 
And guess what. Not any of them belonged to the same album. That means about 
30 albums were incomplete.

That really surprised me. I thought I'd be doing things right. Obviously the way I handled my collection must have had some weak spots. I simply underestimated the situation.

I re-ripped those CDs. Lucky me. I still own the CDs. And's been just 3 hours of work.
What if you downloaded your tracks??

This experience kept me thinking:  I need to improve my data management process.

The new goal now is to make 100% (or close to it) sure that no data loss can occur.
The whole process must become an automated process - as much as possible.

I have to. Laziness and ignorance will be creeping in sooner or later. It can't be avoided. 


How did it/does it happen ?



Perhaps those of you still reading this article, ask yourself

"How did he realize that data got corrupted???"

The vast majority of my collection is flac. flac offers an integrity check as part of the package.
What it does, is comparing a md5 checksum, which gets stored inside the flac after encoding, with a checksum built over the audio data inside the file decoded in realtime.  
If the stored and the just generated md5 checksum are matching the file is OK.
If it's not OK - we know that a single bit up to the entire audio track could 
be changed or be gone. It's corrupted!

That's how I found out. I ran a bulk integrity check over my entire collection. And that usually takes some hours to conclude. 

I did realize though that this is actually not the full story.

Because. I actually identified the corrupted files only! What I don't know at this stage is if any files are missing! 

Let's have a little generic brainstorm session at what point data corruption or loss can occur. Data can get lost or corrupted on many occasions. 

  • The typical copy/paste and backup routines "try" to cope with it. They are supposed to control overall integrity and usually issue error or deviation messages if a problem is detected. copy/paste is not a save mechanism though. Not any professional administrator would use this method.  
  • Harddisk and SSD failures. Corruption of data can occur very silently with aging/worn down harddisks. Look at e.g. Amazon reviews or elsewhere. People report of dying HDDs or SSDs after 6 months. I personally wouldn't trust a HDD that's older than 2-3 years.
  • Weird operating constellations, such as power outage at the wrong moment asf.
  • Software flaws (no SW is flawless) OS and apps (e.g. bulk conversion tools)
  • System overload conditions - during overload conditions weird things can happen
  • Cyberattacks/Malware - causing any kind of weird stuff
  • Simple user faults - pushing a wrong button

With all this in mind it doesn't need to be a genius to realize that your data can get corrupted and that you might overwrite clean backups with corrupted backups.


A closer look


The main challenge with all this is that many of these corruptions and losses occur under the hood. You simply don't see it happen.

Copy/paste doesn't generate logs. How would you know that anything happened during  - sometimes - hours of copying?

With tools that provide log options you could see a lot what's happening. Though I'd guess the vast majority of people out there simply don't make use of logs.   



And this is why it happens that you potentially inject more and more corruption into your database over time.



What to do about it?


It's pretty simple - once you are aware of the problem. And now you are aware of it. You can do something about it.

We just need to establish a backup process that's doable.

Before we can establish such thing we need to get the current data base under control.
  1. Review your storage media (run file systemchecks, check time of HDD operation, asf.)
    If needed replace your HDDs. 2TB run at around 70$/€ nowadays.
    Don't use aged and/or out-phased drives as backup media

    After 3 years max I buy myself a new master drive.
  2. Make sure that the existing backups are OK
    Many backup tools offer TestRun backup options. The backup gets simulated.
    That's a pretty handy function. You can run test runs in both directions for test
    purposes. And then you can analyze the logs.

    If you own a lot of flac files run a bulk integrity check. Several audio tools
    or the flac binary itself offer this option.

    Note: A file with an equal timestamp and an equal size doesn't have to be equal!
    A byte can flip and your backup tool (usually) won't recognize it. You can address this though by running checksum tests. Many backup tools even offer that option. The issue with that. The test required to compare checksums from files during a backup slows down a backup substantially.   
  3. If you start with a cleanup exercise - make sure you have one extra backup.

  4. If you own just one backup disk you better introduce a 2nd (long term) backup media
  5. Review your data formats
    Choose a format with builtin integrity check option, e.g. flac.
  6. You need to introduce logs. Each backup needs a log file. Store these logs
    on the backup media. And then you need to analyze the logs!!!
    Look for errors or changes that doesn't make sense.
  7. Introduce incremental backups!  You know the expression "restore points" from
    e.g. Windows. In a separate backup directory only the delta backups get saved.
    And the original backup remains untouched.
    Incremental backups are a major safety net to prevent overwriting your clean backups.
    And IBs let you roll back in time for quite a period.
    It requires a bit more space on the backup media though. It works well with
    minor data collection updates here and there.
    If you run major data changes, e.g. bulk conversions,  you'll run out of diskspace.
    From that point on you usually need to build a new base backup!



To accomplish above audio file related  safety-net we need to go for an audio data format which comes with an embedded integrity mechanism - a checksum.

 flac would be the preferred option.

.wav files are e.g NOGOS.


Let's go on with flacs.


Flacs come with an embedded md5 checksum. That checksum is generated over the audio data chunk only. Which is good. Having a little problem or a change in the tag area is not  critical and won't have an impact on the checksum mechanism. That checksum is written into the flac file.

That checksum will be renewed, as soon as you do some re-encoding or transcoding of the flac data.


If the flac codec detects a mismatch - while decoding - between the MD5 checksum and the decoded audio data chunk - it will issue a corruption message and stop decoding!!!.

That corruption message is KEY. You need to look for it and you have to fix the affected track on either of your storage devices.



How to introduce this to our "backup"  AND not to forget "restore" procedure?
(We don't want to restore a corrupted collection!) 


The flac(.exe) executable (you'll find it under windows and Linux) allows for identifying that checksum data mismatch and issues error messages.
 
What you do is, you run a test-decoding ( flac -t <file> ) on all files on your master HDD and backup media prior and after any of your backups. (That check can easily take an hour on e.g  4.5k-5k tracks)
While doing the test-decoding , flac(.exe) will issue an error message if there's a problem with the checksum. You can write a batchfile under Windows or a script ( see Appendix II) under Linux.

There's also another tool you might use:

The dbPoweramp Reference converter tool offers a function called "Test Conversion". It basically does what the flac binary is doing if run with the -t option. You can also use that one. The dbPoweramp Batch converter tool allows to check your entire disk at once. That's most convenient.  dbPoweramp also offers a  feature called  Move Destination File on Error that moves the corrupted data automatically to a predefined target directory.





That's basically all we need to do.

Yep. That's about it.

A pretty simple measure - though very powerful. I'm sure my collection is kept much safer now.


I guess 99.9% of all collections out there got a problem. You might consider doing something about it.


If you follow my advise,  I'd be really interested to see you reporting back your corruption rate (and associated backup strategy). ;)

Good luck.

Cheers
\Klaus

##########################################

Appendix 1:

Backup advise:

1. Don't run simple Copy/Paste backups, use backup tools - look for recommended settings of these tools!!!
    Usually you'll find numerous parameters, which doesn't mean much to you in the beginning. Later on you
    usually,  realize, why these parameters were introduced. Spent a little time on the subject.
2. Introduce incremental backups, which keep the original data as long as possible and store the "delta" data
    at a different place on the same disk.
    This way you save quite some space. You can  have several backup cycles on one disk.
3. Use at least 2 backup disks - stored at different locations.
4. Buy disks which are used in the professional area. Cheap consumer stuff is not recommended
5. You don't need that fastest disks ( higher wear down effects). You'd need the most reliable.
6. Don't use old disks ( which e.g. were just replaced by your brandnew SSD) as backup media
7. Check the data integrity seperately - see above
8. Automate the process as much as possible and keep the logfiles.

Tools:

Full Backup ( all partitions and bootsector)
Windows 7 -  Backup and Restore
Acronis - The Free WD Version
dd (Linux - commandline)

and more

Incremental Backup
Windows 7 -  Backup and Restore
rsync (Linux and Windows - commandline - also remote via network! (ssh))

and more

Note: The W7 tools are meanwhile able to compete with other commercial software.
IMO there's no need to go for Norton or Acronis.

Secure Copy ( with CRC check and error logging on failed transfers)
Teracopy
rsync (Linux and Windows - commandline)


Advise:   It's always recommended to use a defragmented disk.
Copying data back and forth all the time, or any other jobs you run on your data, such as adding RG tags or similar, gets you lot of fragmentation on the disk.  That fragmentation is gone if you backup your disk to another disk. If you can swap your master disk with the backup disk easily, use your backup disk as master
This way you always run a rather defragmented disk, without running an annoying defragmenattion process -
which puts a lot of pressure on your HDD.


###############################################


Appendix 2:

Since I'm at home in the Linux world, I've written a simple one-liner that accomplishes the flac integrity check over your entire harddisk(s):

Open a terminal first.

**copy/paste below command into one line******

find / -iname "*flac" -print0 | while IFS= read -r -d '' "j" ; do   flac -s -t "$j"  2>>/tmp/flac-integrity.log ; done

************************************

That'll take some time (hours) - 1 to 2 s per file.

You can replace "find /" with e.g. "find /media/music" to specify a specific music directory.

The scan result you'll find in /tmp/flac-integrity.log

To test above you might copy a CD to /tmp first and then you replace  "find /" with e.g. "find /tmp"

You can run a:

grep "error"  /tmp/flac-integrity.log | wc -l

That'll tell you if any and how many problems where found.

6 comments:

  1. Hey Klaus,

    thanks for the headsup! Just checked my collection (203Gb, 14k files, 70% flac, 30% mp3) with dBpoweramp test. Only 3 files corrupted 1 of which I no longer need and 2 that I can rerip.

    Now I have to buy a external hdd so that I can do the backup^^. At the moment I just put a 2nd disc in my NAS, let it do its mirror job and take the 2nd back out...

    I stumbled upon your blog via diyaudio. Funny thing is I run a Squeezebox Touch as well and just 5 days ago ordered the DDX320v2 :D

    So you will be hearing from me in the near future about your mods which I am looking forward to doing and specificaly about how you have hooked up your SBT to the DDX.

    Schöne Grüße
    Ragnar

    ReplyDelete
  2. Great bit of info there — I wasn't aware that flacs had the md5 sum embedded and were so easy to test. Monthly cron job duly set up. Cheers!

    James

    ReplyDelete
  3. Hiya, cheers for the above, somewhat a beginner at pc stuff so will read through again and give it a bash :-) as i am slowly realizing the source file is rather important!

    I have always mirrored my master every day music hdd to back it up, so all the bits stay the same (or so one thought)

    Would say a RAID5 back up, play with the bits of a FLAC file, or would it stay the same and just spread out more ?

    cheers for any thoughts
    thanks again
    Mark

    ReplyDelete
  4. Hi,
    I got curious about this and tried it for myself with dbamp. Out of 39'000 flac files, found 4 corrupted.
    Re-ripped one of them as flac and tested it ok with dbamp.
    I then compared the wav version of the corrupted file with the wav of re-ripped file (both obtained with flac frontend) and EAC (compare wav) found them to be identical. Does this mean that corruption originated from the flac tags of the data?

    ReplyDelete

  5. The wav should have the same size as a decoded flac.

    To compare two file though the md5sum of both files have to be compared.

    The files can have the same size but can differ on the content.

    If a flac is broken - there's a mismatch of the embedded and stored md5sum and the just generated md5sum of
    the audio content.

    ReplyDelete
  6. Hi Klaus,
    after reading this blog, I now am AWARE of the corruption problem as well, but what can I do, if all my musical treasure is ripped (via dbPoweramp) to aif, ´cause I am living in the appleworld?

    By the way,thanx for all that absorbing blogstuff and DETAILED explanations, always fun to read.

    Holger

    ReplyDelete