Craig Barratt
2011-03-02 08:32:57 UTC
The next topic is the pool structure in 4.x.
Here are the differences in pool file storage between 3.x and 4.x:
- Digest changes from partial MD4 to full-file MD5. This will
significantly reduce pool collisions - in almost all
installations there will be no pool collisions. The most
common exception will be if someone uses the now well-known
constructed cases of different files with MD5 collisions.
In 3.x a partial MD4 digest is used, so collisions are more
common. Also, the file system's hardlink limit can also
cause more entries in a pool file chain. In 4.x reference
counting is done using a simple database, so the file system
hardlink limit isn't relevant.
- If pool files do collide, a chain is created by appending one or
more bytes to the MD5 digest as a counter. The first instance of a
pool file will have a regular 16 byte digest. The next file that is
different but has the same MD5 digest will be stored as a 17 byte
digest with an extra byte of 0x01. The 256th file in the chain
(unlikely of course) will have two more bytes appended: 0x0100.
The extension is basically the file index with leading 0x00 bytes
removed.
- 4.x doesn't use hardlinks (except as inherited from existing 3.x
pools).
- In 4.x pool files are never renamed. In 3.x pool files in a chain
of repeated digests will be renamed if one of the middle files is
deleted. In the unlikely even there is a chain of repeated files in
4.x, and one of the files is deleted (ie: no longer referenced),
then it is replaced by a zero-length file. That acts as a tag that
searching through the chain should continue past that point, and
also acts as a tag that that file can be replaced by a real pool
file when the next file is added.
- In 4.x the pool files are stored two-levels deep, with 128
directories at each level. The directories are numbered in hex
from 00 to fe in steps of 2. The directory names are based on
the first two bytes of the MD5 digest, each anded with 0xfe.
For example, a file with digest 0458d9d0e9ddd2b6b21a1e60b6cdf323
will be stored in:
CPOOL_DIR/04/58/0458d9d0e9ddd2b6b21a1e60b6cdf323
while a file with digest 09682c6df94c87b1e9ee6e1d0d89e8f2 will be
stored in:
CPOOL_DIR/08/68/09682c6df94c87b1e9ee6e1d0d89e8f2
(notice that 0x09 & 0xfe == 0x08).
In 3.x the directories are three levels deep, with 16 directories
at each level based on the first 3 hex digests of the partial
MD4 digest. So in 3.x there are 16^3 = 4096 leaf directories,
while in 4.x there are 128 * 128 = 16384 leaf directories.
- The 3.x and 4.x CPOOL_DIR is the same. The trees below are
separate because of the directory naming conventions.
- In 4.x when pool file matching occurs the full-file MD5 digest
is needed to match files. There is also a flag, $bpc->{PoolV3},
that determines whether old 3.x pool files should be checked
too. Currently that flag is hardcoded and I need to make it
autodetect whether there are any old pool files (I guess based
on BackupPC_nightly?). If PoolV3 is set and there are no
candidate 4.x files, then the old digest is computed too and
3.x candidate pool files are also checked for matches.
If an old pool 3.x file is matched, then that file is renamed to
the corresponding 4.x pool file path (based on the MD5 digest).
This file might still have multiple hardlinks due to the existing
3.x backups. As those backups are expired, eventually the link
count on the pool file will decrease to 1.
- For backing up the BackupPC store in a mixed V3/V4 environment it
should be possible just copy the new V4 pool and new V4 backups
(without worrying about hardlinks that might remain on pool files
from V3 backups). However, I need to devise a way of determining
the paths of the V4 backups. Perhaps I should add a utility that
lists all the directories that should be backed up?
Craig
Here are the differences in pool file storage between 3.x and 4.x:
- Digest changes from partial MD4 to full-file MD5. This will
significantly reduce pool collisions - in almost all
installations there will be no pool collisions. The most
common exception will be if someone uses the now well-known
constructed cases of different files with MD5 collisions.
In 3.x a partial MD4 digest is used, so collisions are more
common. Also, the file system's hardlink limit can also
cause more entries in a pool file chain. In 4.x reference
counting is done using a simple database, so the file system
hardlink limit isn't relevant.
- If pool files do collide, a chain is created by appending one or
more bytes to the MD5 digest as a counter. The first instance of a
pool file will have a regular 16 byte digest. The next file that is
different but has the same MD5 digest will be stored as a 17 byte
digest with an extra byte of 0x01. The 256th file in the chain
(unlikely of course) will have two more bytes appended: 0x0100.
The extension is basically the file index with leading 0x00 bytes
removed.
- 4.x doesn't use hardlinks (except as inherited from existing 3.x
pools).
- In 4.x pool files are never renamed. In 3.x pool files in a chain
of repeated digests will be renamed if one of the middle files is
deleted. In the unlikely even there is a chain of repeated files in
4.x, and one of the files is deleted (ie: no longer referenced),
then it is replaced by a zero-length file. That acts as a tag that
searching through the chain should continue past that point, and
also acts as a tag that that file can be replaced by a real pool
file when the next file is added.
- In 4.x the pool files are stored two-levels deep, with 128
directories at each level. The directories are numbered in hex
from 00 to fe in steps of 2. The directory names are based on
the first two bytes of the MD5 digest, each anded with 0xfe.
For example, a file with digest 0458d9d0e9ddd2b6b21a1e60b6cdf323
will be stored in:
CPOOL_DIR/04/58/0458d9d0e9ddd2b6b21a1e60b6cdf323
while a file with digest 09682c6df94c87b1e9ee6e1d0d89e8f2 will be
stored in:
CPOOL_DIR/08/68/09682c6df94c87b1e9ee6e1d0d89e8f2
(notice that 0x09 & 0xfe == 0x08).
In 3.x the directories are three levels deep, with 16 directories
at each level based on the first 3 hex digests of the partial
MD4 digest. So in 3.x there are 16^3 = 4096 leaf directories,
while in 4.x there are 128 * 128 = 16384 leaf directories.
- The 3.x and 4.x CPOOL_DIR is the same. The trees below are
separate because of the directory naming conventions.
- In 4.x when pool file matching occurs the full-file MD5 digest
is needed to match files. There is also a flag, $bpc->{PoolV3},
that determines whether old 3.x pool files should be checked
too. Currently that flag is hardcoded and I need to make it
autodetect whether there are any old pool files (I guess based
on BackupPC_nightly?). If PoolV3 is set and there are no
candidate 4.x files, then the old digest is computed too and
3.x candidate pool files are also checked for matches.
If an old pool 3.x file is matched, then that file is renamed to
the corresponding 4.x pool file path (based on the MD5 digest).
This file might still have multiple hardlinks due to the existing
3.x backups. As those backups are expired, eventually the link
count on the pool file will decrease to 1.
- For backing up the BackupPC store in a mixed V3/V4 environment it
should be possible just copy the new V4 pool and new V4 backups
(without worrying about hardlinks that might remain on pool files
from V3 backups). However, I need to devise a way of determining
the paths of the V4 backups. Perhaps I should add a utility that
lists all the directories that should be backed up?
Craig