[BackupPC-devel] BackupPC 4.0 features

Discussion:

[BackupPC-devel] BackupPC 4.0 features - pool structure

Craig Barratt

2011-03-02 08:32:57 UTC

The next topic is the pool structure in 4.x.

Here are the differences in pool file storage between 3.x and 4.x:

- Digest changes from partial MD4 to full-file MD5. This will
significantly reduce pool collisions - in almost all
installations there will be no pool collisions. The most
common exception will be if someone uses the now well-known
constructed cases of different files with MD5 collisions.

In 3.x a partial MD4 digest is used, so collisions are more
common. Also, the file system's hardlink limit can also
cause more entries in a pool file chain. In 4.x reference
counting is done using a simple database, so the file system
hardlink limit isn't relevant.

- If pool files do collide, a chain is created by appending one or
more bytes to the MD5 digest as a counter. The first instance of a
pool file will have a regular 16 byte digest. The next file that is
different but has the same MD5 digest will be stored as a 17 byte
digest with an extra byte of 0x01. The 256th file in the chain
(unlikely of course) will have two more bytes appended: 0x0100.
The extension is basically the file index with leading 0x00 bytes
removed.

- 4.x doesn't use hardlinks (except as inherited from existing 3.x
pools).

- In 4.x pool files are never renamed. In 3.x pool files in a chain
of repeated digests will be renamed if one of the middle files is
deleted. In the unlikely even there is a chain of repeated files in
4.x, and one of the files is deleted (ie: no longer referenced),
then it is replaced by a zero-length file. That acts as a tag that
searching through the chain should continue past that point, and
also acts as a tag that that file can be replaced by a real pool
file when the next file is added.

- In 4.x the pool files are stored two-levels deep, with 128
directories at each level. The directories are numbered in hex
from 00 to fe in steps of 2. The directory names are based on
the first two bytes of the MD5 digest, each anded with 0xfe.
For example, a file with digest 0458d9d0e9ddd2b6b21a1e60b6cdf323
will be stored in:

CPOOL_DIR/04/58/0458d9d0e9ddd2b6b21a1e60b6cdf323

while a file with digest 09682c6df94c87b1e9ee6e1d0d89e8f2 will be
stored in:

CPOOL_DIR/08/68/09682c6df94c87b1e9ee6e1d0d89e8f2

(notice that 0x09 & 0xfe == 0x08).

In 3.x the directories are three levels deep, with 16 directories
at each level based on the first 3 hex digests of the partial
MD4 digest. So in 3.x there are 16^3 = 4096 leaf directories,
while in 4.x there are 128 * 128 = 16384 leaf directories.

- The 3.x and 4.x CPOOL_DIR is the same. The trees below are
separate because of the directory naming conventions.

- In 4.x when pool file matching occurs the full-file MD5 digest
is needed to match files. There is also a flag, $bpc->{PoolV3},
that determines whether old 3.x pool files should be checked
too. Currently that flag is hardcoded and I need to make it
autodetect whether there are any old pool files (I guess based
on BackupPC_nightly?). If PoolV3 is set and there are no
candidate 4.x files, then the old digest is computed too and
3.x candidate pool files are also checked for matches.

If an old pool 3.x file is matched, then that file is renamed to
the corresponding 4.x pool file path (based on the MD5 digest).
This file might still have multiple hardlinks due to the existing
3.x backups. As those backups are expired, eventually the link
count on the pool file will decrease to 1.

- For backing up the BackupPC store in a mixed V3/V4 environment it
should be possible just copy the new V4 pool and new V4 backups
(without worrying about hardlinks that might remain on pool files
from V3 backups). However, I need to devise a way of determining
the paths of the V4 backups. Perhaps I should add a utility that
lists all the directories that should be backed up?

Craig

Craig Barratt

2011-03-02 08:32:32 UTC

Permalink

Jeffrey J. Kosowsky

2011-03-03 02:52:57 UTC

Permalink

Post by Craig Barratt
The next topic is the pool structure in 4.x.
- Digest changes from partial MD4 to full-file MD5. This will
significantly reduce pool collisions - in almost all
installations there will be no pool collisions. The most
common exception will be if someone uses the now well-known
constructed cases of different files with MD5 collisions.

AWESOME. The partial MD5 (not partial MD4 by the way -- you may want
to correct this in your documentation) checksums always seemed like
such a kluge and pita to me :P

Post by Craig Barratt
In 3.x a partial MD4 digest is used, so collisions are more
common. Also, the file system's hardlink limit can also
cause more entries in a pool file chain. In 4.x reference
counting is done using a simple database, so the file system
hardlink limit isn't relevant.
- If pool files do collide, a chain is created by appending one or
more bytes to the MD5 digest as a counter. The first instance of a
pool file will have a regular 16 byte digest. The next file that is
different but has the same MD5 digest will be stored as a 17 byte
digest with an extra byte of 0x01. The 256th file in the chain
(unlikely of course) will have two more bytes appended: 0x0100.
The extension is basically the file index with leading 0x00 bytes
removed.

As per my earlier post, I worry about the concept of still having
potential collisions (even if rare) and chains. The reason being both
that it is inelegant and that presumably it requires pool files to be
decompressed and compared byte-by-byte with any new file to check if
there is a collision. For large files, decompressing the entire file
and then comparing byte-by-byte is *slow* and could account for a
significant portion of the backup time.

Wouldn't it be better to pick a more secure checksum such as sha256sum
or even sha512sum where the chances of a collision are so
astronomically small as to be less likely than having a bit error in
your RAM. No collisions have ever been found for either of them and
they allow maximum file sizes up to 2^64-1 and 2^128-1
respectively. Presumably, the chance of a random collision is 2^-256
and 2^-512 respectively which are numbers so small that physical
hardware errors are more likely.

Eliminating any statistical likelihood of collision would then allow
you to simplify the code by eliminating the need for chains while also
speeding up adding files to the pool since you wouldn't need to check
for new collisions but just use the sha-sum.

Post by Craig Barratt
- 4.x doesn't use hardlinks (except as inherited from existing 3.x
pools).
- In 4.x pool files are never renamed. In 3.x pool files in a chain
of repeated digests will be renamed if one of the middle files is
deleted. In the unlikely even there is a chain of repeated files in
4.x, and one of the files is deleted (ie: no longer referenced),
then it is replaced by a zero-length file. That acts as a tag that
searching through the chain should continue past that point, and
also acts as a tag that that file can be replaced by a real pool
file when the next file is added.

As above, I think it would be preferable and more elegant to use a
checksum that can be relied upon as being unique.

Post by Craig Barratt
- In 4.x the pool files are stored two-levels deep, with 128
directories at each level. The directories are numbered in hex
from 00 to fe in steps of 2. The directory names are based on
the first two bytes of the MD5 digest, each anded with 0xfe.
For example, a file with digest 0458d9d0e9ddd2b6b21a1e60b6cdf323
CPOOL_DIR/04/58/0458d9d0e9ddd2b6b21a1e60b6cdf323
while a file with digest 09682c6df94c87b1e9ee6e1d0d89e8f2 will be
CPOOL_DIR/08/68/09682c6df94c87b1e9ee6e1d0d89e8f2
(notice that 0x09 & 0xfe == 0x08).

When I was writing my code in BackupPC_copyPcPool.pl to create an
inode-based pool, I thought about the tradeoffs between fewer vs. more
levels of pool. It seemed to me that as the number of pool files
increased you would be better off with more levels of pools -- with
there being a tradeoff between the number of layers and directories
accessed vs. the size of each directory. In my code, this led me to
allow for a user-specified pool depth.

I wonder whether here too, it may be best to allow for a configurable
depth to the pool. This would not change the actual pool file names
but just how many levels they are distributed across.

Then for small installations, one might choose just 2 levels but for
truly humongous installations, it might be better to have 3 or even
more levels. Presumably, different filesystems and hardware would also
influence this tradeoff.

It would seem that it wouldn't be too hard to modify the code to allow
this to be a user configurable parameter.

Post by Craig Barratt
In 3.x the directories are three levels deep, with 16 directories
at each level based on the first 3 hex digests of the partial
MD4 digest. So in 3.x there are 16^3 = 4096 leaf directories,
while in 4.x there are 128 * 128 = 16384 leaf directories.
- The 3.x and 4.x CPOOL_DIR is the same. The trees below are
separate because of the directory naming conventions.
- In 4.x when pool file matching occurs the full-file MD5 digest
is needed to match files. There is also a flag, $bpc->{PoolV3},
that determines whether old 3.x pool files should be checked
too. Currently that flag is hardcoded and I need to make it
autodetect whether there are any old pool files (I guess based
on BackupPC_nightly?). If PoolV3 is set and there are no
candidate 4.x files, then the old digest is computed too and
3.x candidate pool files are also checked for matches.
If an old pool 3.x file is matched, then that file is renamed to
the corresponding 4.x pool file path (based on the MD5 digest).
This file might still have multiple hardlinks due to the existing
3.x backups. As those backups are expired, eventually the link
count on the pool file will decrease to 1.

While I appreciate the benefit of preserving backward compatibility
with 3.x and for allowing the two to coexist, I wonder whether you
might be better off requiring either separate instances or a one-time
conversion. This would presumably simplify the code and its ongoing
maintenance considerably since you wouldn't have to maintain parallel
implementations within the same code. Also, I imagine that for most
people either they want to convert it all to 4.x to gain the benefit
of eliminated hard links or they will be ok with maintaining a
separate archival 3.x instance.

In the mixed case, I imagine that most people retain a relatively long
tail on their backups so that the link count may never decrease to 1
which again to me would argue for either a one time conversion or a
separate instance.

Post by Craig Barratt
- For backing up the BackupPC store in a mixed V3/V4 environment it
should be possible just copy the new V4 pool and new V4 backups
(without worrying about hardlinks that might remain on pool files
from V3 backups). However, I need to devise a way of determining
the paths of the V4 backups. Perhaps I should add a utility that
lists all the directories that should be backed up?

Personally, I am most interested in a utility that would convert V3 to
V4. This could presumably be run in the background so long as all new
backups were created in V4 format. The fact that the deltas are
time-backward should make this all easier since one could work
backwards from the first V4 instance or alternatively just start by
converting the last V3 instance to a new filled version and working
backwards from there. Since this could run in the background, it
wouldn't matter at least to me how slow it runs. But at least, after a
while I could be guaranteed a nice, consistent, uniform V4 system with
all of its advantages (save the fact that old V3 versions would still
obviously lack extended attributes).

I really don't like the mixed idea where gradually V3 pool files get
renamed which would then add files with potentially thousands of hard
links to a new clean V4 pool. Also, this would make it very difficult
to back up the old V3 portion since the pool for such backups would
now be split between the old V3 pool and pool files transferred and
renamed in the new V4 pool.

Jeffrey J. Kosowsky

2011-03-03 03:20:37 UTC

Permalink

Post by Jeffrey J. Kosowsky
As per my earlier post, I worry about the concept of still having
potential collisions (even if rare) and chains. The reason being both
that it is inelegant and that presumably it requires pool files to be
decompressed and compared byte-by-byte with any new file to check if
there is a collision. For large files, decompressing the entire file
and then comparing byte-by-byte is *slow* and could account for a
significant portion of the backup time.
Wouldn't it be better to pick a more secure checksum such as sha256sum
or even sha512sum where the chances of a collision are so
astronomically small as to be less likely than having a bit error in
your RAM. No collisions have ever been found for either of them and
they allow maximum file sizes up to 2^64-1 and 2^128-1
respectively. Presumably, the chance of a random collision is 2^-256
and 2^-512 respectively which are numbers so small that physical
hardware errors are more likely.
Eliminating any statistical likelihood of collision would then allow
you to simplify the code by eliminating the need for chains while also
speeding up adding files to the pool since you wouldn't need to check
for new collisions but just use the sha-sum.

Just as a follow-up, in case any of you are worried about the 1 in
2^512 (=10^222) chance of a collision, then I offer you these
consolations:
1. The total number of particles in the universe is estimated to be
between 10^72 and 10^87
2. The fame and fortune that you will gain from being the first to
find a SHA-256 or SHA-512 collision will surely outweigh the cost
of any data lost in backuppc due to such collision.

Plus, should collisions ever be found, I'm sure the NSA will be there
ready and waiting with SHA-1024 and beyond which could easily be
swapped in assuming that Craig writes the code in his usual modular
fashion...