[BackupPC-devel] some thoughts of integrity tests for v 4 ("RsyncCsumCacheVerifyProb")

Discussion:

Fresel Michal - hi competence e.U.

2011-03-23 17:59:12 UTC

hi

i was thinking about the current state of integrity-tests ("RsyncCsumCacheVerifyProb")

on compressed files this means
# decompress
# create checksum of decompressed
# compare cached checksums

what about:
on first run
# create checksum of compressed file = checksum_compressed_1
# decompress
# create checksum of decompressed
# compare checksums (of cache)
# if checksums (of cache) are the same - recreate checksum of compressed (just in case we had some disk issue meanwile) = checksum_compressed_2
# compare checksum_compressed_1 to checksum_compressed_2
# if the same - save checksum of compressed file for future use

the next run we just need to checksum the compressed file as (the internal cached checksums) were already verified above

results in faster rechecks afterwards:
-) no need to decompress
-) checksum-calculations on compressed files is faster (less MB's to calculate)

Greetings
Mike

Jeffrey J. Kosowsky

2011-03-23 18:35:33 UTC

Permalink

Post by Fresel Michal - hi competence e.U.
hi
i was thinking about the current state of integrity-tests ("RsyncCsumCacheVerifyProb")
on compressed files this means
# decompress
# create checksum of decompressed
# compare cached checksums

If checksum caching is on then the checksum is stored at the end of
the compressed file. So the file does *not* need to be decompressed.
Technically, the checksum is only added to the file the 2nd time the
file is encountered which I imagine is probably due to the fact that
the native rsync algorithm only transmits the block and full file
checksums when the file already exists (otherwise perhaps only the
full file checksum is transmitted)

Post by Fresel Michal - hi competence e.U.
on first run
# create checksum of compressed file = checksum_compressed_1
# decompress
# create checksum of decompressed
# compare checksums (of cache)
# if checksums (of cache) are the same - recreate checksum of compressed (just in case we had some disk issue meanwile) = checksum_compressed_2
# compare checksum_compressed_1 to checksum_compressed_2
# if the same - save checksum of compressed file for future use
the next run we just need to checksum the compressed file as (the internal cached checksums) were already verified above
-) no need to decompress
-) checksum-calculations on compressed files is faster (less MB's to calculate)

Seems like a lot of extra work since Craig's clever implementation
leverages existing rsync checksum calculations and gives the block and
full file md4 checksums for free.

Also, while it might be nice to have checksums of both the compressed
and decompressed files to speed up an independent file integrity test,
that is a whole other question.

Fresel Michal - hi competence e.U.

2011-03-23 19:29:03 UTC

Permalink

hi Jeffrey,

Post by Jeffrey J. Kosowsky

i was writing about "RsyncCsumCacheVerifyProb" not the cache
When rsync checksum caching is enabled (by adding the --checksum-seed=32761 option to $Conf{RsyncArgs}), the cached checksums can be occasionally verified to make sure the file contents matches the cached checksums. This is to avoid the risk that disk problems might cause the pool file contents to get corrupted, but the cached checksums would make BackupPC think that the file still matches the client.
on compressed files this starts with "decompress ...." as described above?

Post by Jeffrey J. Kosowsky

Seems like a lot of extra work since Craig's clever implementation
leverages existing rsync checksum calculations and gives the block and
full file md4 checksums for free.

it's just for speedup of the following
With checksum caching enabled, there is a risk that should a file's contents in the pool be corrupted due to a disk problem, but the cached checksums are still correct, the corruption will not be detected by a full backup, since the file contents are no longer read and compared. To reduce the chance that this remains undetected, BackupPC can recheck cached checksums for a fraction of the files. This fraction is set with the $Conf{RsyncCsumCacheVerifyProb} setting. The default value of 0.01 means that 1% of the time a file's checksums are read, the checksums are verified. This reduces performance slightly, but, over time, ensures that files contents are in sync with the cached checksums.

We just need to checksum the compressed file as the blocks and full-file of uncompressed are already verified

Greetings
Mike

Fresel Michal - hi competence e.U.

2011-03-23 19:54:01 UTC

Permalink

hi Jeffrey,

Post by Jeffrey J. Kosowsky
If checksum caching is on then the checksum is stored at the end of
the compressed file. So the file does *not* need to be decompressed.
Technically, the checksum is only added to the file the 2nd time the
file is encountered which I imagine is probably due to the fact that
the native rsync algorithm only transmits the block and full file
checksums when the file already exists (otherwise perhaps only the
full file checksum is transmitted)

as i like your BackupPC_digestVerify with the "-a Add rsync digests if missing" ...

what about to give the user the abillity to auto-create these caches on the first run (meaning that part of your script to be included in main)?
Maybe some users want the penalty of a longer "first initial backup"?

some kind of checkbox "autocreate checksums on new files" (defaults to no)
+ FAQ-Entry: - this function will create checksums on first creation and not as usual on the 2rd sync
???

Greetings

Mike

# sub of thread "some thoughts of integrity tests for v 4 ("RsyncCsumCacheVerifyProb")"

Jeffrey J. Kosowsky

2011-03-23 20:21:08 UTC

Permalink

Post by Fresel Michal - hi competence e.U.
hi Jeffrey,

as i like your BackupPC_digestVerify with the "-a Add rsync digests if missing" ...
what about to give the user the abillity to auto-create these caches on the first run (meaning that part of your script to be included in main)?
Maybe some users want the penalty of a longer "first initial backup"?
some kind of checkbox "autocreate checksums on new files" (defaults to no)
+ FAQ-Entry: - this function will create checksums on first creation and not as usual on the 2rd sync
???

I'm not sure of the purpose of this discussion. Craig is rightly
devoting his energies to 4.0. The discussion here seems to be about
adding new capabilities to 3.x. Other than bug fixes, I'm not sure you
will get much traction around adding new functionality to 3.x no
matter how simple and/or beneficial it may be.

In part, that is why I have focused on writing my own independent
routines to add functionality that I need. But I have stayed away from
changing the core 3.x code and functionality.

Some of your other emails address 4.x functionality which seems to be
a better place to have these discussions.

Regarding your proposal of adding a .info file for each pool entry, I
have thought of similar approaches but at the end of the day I worry
that it adds additional complexity and room for error without much
benefit over a more clever encoding and naming of regular pool
entries. In particular, I am not crazy about the idea of effectively
doubling the number of pool files by adding .info files.

Fresel Michal - hi competence e.U.

2011-03-23 20:46:38 UTC

Permalink

hi Jeffrey,

i really acknowledge Craig's work and love the results of his work.

As somebody already mentioned somewhere else - maybe it's better to re-think current implementations before they are hardcoded :)

So my question for the dev-list:
Do you know about users who want the penalty of a longer "first initial backup"?
It is meant for 4.0 not for 3.x!

Trying to check if some dev is already playing around with your script...
It's always easy to ask users for functions they might find useful - but asking on [dev] means (maybe) getting aproval if it even is possible :)

Just thinking of a merge of some parts of your scripts with [MAIN]
To have a "fully featured product" instead of so many useful - but still custom - scripts should even be easier for Craig as well as for end-users

And yes, like your scripts too Jeffrey :)

Greetings
Mike

Post by Jeffrey J. Kosowsky

Post by Fresel Michal - hi competence e.U.
hi Jeffrey,

as i like your BackupPC_digestVerify with the "-a Add rsync digests if missing" ...
what about to give the user the abillity to auto-create these caches on the first run (meaning that part of your script to be included in main)?
Maybe some users want the penalty of a longer "first initial backup"?
some kind of checkbox "autocreate checksums on new files" (defaults to no)
+ FAQ-Entry: - this function will create checksums on first creation and not as usual on the 2rd sync
???

Fresel Michal - hi competence e.U.

2011-03-23 20:59:46 UTC

Permalink

hi Jeffrey,

as this will create nice sub-theads here's the answer :)

The idea of a "info" file was an idea - it's implementation might even be a fully SQL-DB (kiddingly :)

Maybe we call it "metadata-information"?
these are just ideas for the next 4.0 release
it's easier to discuss now ... than to redesign the whole app afterwards :)

maybe they won't get implemented - but at least some people discussed it ...

Hopefully - if not implemented in [Main] we get some nice BackupPC_digestVerify for future releases ;)

Greetings
Mike

Post by Jeffrey J. Kosowsky
Regarding your proposal of adding a .info file for each pool entry, I
have thought of similar approaches but at the end of the day I worry
that it adds additional complexity and room for error without much
benefit over a more clever encoding and naming of regular pool
entries. In particular, I am not crazy about the idea of effectively
doubling the number of pool files by adding .info files.