Discussion:
[BackupPC-devel] Update on BackupPC Development / Is backuppc dead?
Wessel Dankers
2012-08-06 14:46:12 UTC
Permalink
Hi,

(This thread was originally in -users.)
My novice understanding is that the next release is a "whopper" which
fundamentally changes many things about how BackupPC works...so it's
taking a long time.
I'm sorry to have to cast doubt on this. I have heard the above many times
on this list, but BackupPC development is opaque and centralised in the
main developer. A lot of us write patches, and the Debian packagers seem to
be helping keep BackupPC up to date. But getting those patches upstream
seems impossible, and I'm beginning to wonder if v4 is coming at all.
I'm currently experimenting with some ideas of my own.

My employer uses BackupPC to backup ~150 unix servers nightly. To work
around some performance problems I had to alter the source code a little.
While doing this I realized there might be some nice gains to be made by
doing certain things differently.

The ideas overlap to a limited extent with the ideas[0] that Craig posted
to this list. For instance, no more hardlinks, and garbage collection is
done using flat-file databases. Some things are quite different. I'll try
to explain my ideas here.

Observations
============

1) I/O: writes are fast, reads are slow. Random reads are very slow.

Writing can be done asynchronously. Even random writes can be elevatored to
the platters fairly efficiently. Reading, on the other hand, is blocking to
the process doing it. Running many processes in parallel can alleviate the
random-read bottleneck a little but it's still best to do sequential reads
whenever possible.

2) CPU and RAM are cheap.

BackupPC parallelizes well and for larger installations 12-core or even
24-core systems can be had for reasonable prices. Memory is dirt cheap.

File data
=========

All files are to be split into chunks of (currently) 2 megabytes. These
chunks are addressed by their sha512 hashes (that is, the filename for each
chunk is simply the base64 encoded hash). Any compression is applied after
splitting. This provides dedup even for large files that change or
otherwise differ only slightly.

The use of sha512 will eliminate collisions completely, and provide safe
dedup without expensive whole-file comparisons. This obviates the need for
a lot of disk reading.

Pool metadata
=============

Each backup consists of a number of databases, one for each share. Each
database contains a complete list of files and for each file the file
attributes and a list of sha512 hashes that together describe the contents.

These databases are of the ‘write once, read many times’ variety, similar
to djb's cdb-databases or iso9660 filesystems. This part is already
implemented:

https://git.fruit.je/hardhat
https://git.fruit.je/hardhat-perl

Some informal estimations based on real-world data indicate that files take
up about 100 bytes of metadata on average in this format.

Garbage collection
==================

The list of used hashes is collected from each backup by scanning the
databases for each share. These are collected into a list for the host they
belong to and the lists for each host are collected again into a global
list.

The actual garbage collection is simply enumerating all files in the chunk
pool and checking with the global list if each chunk is still needed.

This part is already implemented:

https://git.fruit.je/hashlookup-perl

These hash lists need to be sorted and merged so that lookups will be
efficient. That part is already implemented as well, but doesn't yet have a
git repository of its own.

Incremental backups
===================

Incremental backups use a previous backup as a reference. Any files that
have the same timestamp+size are not transferred from the client but
instead the file metadata is copied from the previous backup. That means
that incremental backups are complete representations of the state of the
client. The reference backup is not needed to browse or restore files.

Integration
===========

The above ideas and code need to be integrated into BackupPC. I've created
a git repository for that:

https://git.fruit.je/backuppc

but the code in that repository is pretty much guaranteed to not work at
all, for now.

My next target is to create BackupPC::Backup::Reader and
BackupPC::Backup::Writer classes similar to the poolreader and poolwriter
classes. The reader class might even get a variant that can read v3
backups; useful for migration scenarios.

===

So, where to go from here? I'd love to hear from Craig what he thinks of
all this. As far as I'm aware he has not started work on 4.0 yet, so I'll
just take the liberty to continue tinkering with the above ideas in the
meanwhile. :)

And of course, comments/feedback are more than welcome.

Kind regards,
--
Wessel Dankers <***@fruit.je>

[0]
http://sourceforge.net/mailarchive/message.php?msg_id=27140174
http://sourceforge.net/mailarchive/message.php?msg_id=27140175
http://sourceforge.net/mailarchive/message.php?msg_id=27140176
Les Mikesell
2012-08-06 18:05:53 UTC
Permalink
On Mon, Aug 6, 2012 at 9:46 AM, Wessel Dankers
Post by Wessel Dankers
The ideas overlap to a limited extent with the ideas[0] that Craig posted
to this list. For instance, no more hardlinks, and garbage collection is
done using flat-file databases. Some things are quite different. I'll try
to explain my ideas here.
Personally I think the hardlink scheme works pretty well up to about
the scale that I'd want on a single machine and you get a badly needed
atomic operation with links more or less for free. If you are going
to do things differently, wouldn't it make sense to use one of the
naturally distributed scalable databases (bigcouch, riak, etc.) for
storage from the start since anything you do is going to involve
re-inventing the atomic operation of updating a link or replacing it
and the big win would be making this permit concurrent writes from
multiple servers?
--
Les Mikesell
***@gmail.com
Wessel Dankers
2012-08-07 08:36:10 UTC
Permalink
Hi Les,
Post by Les Mikesell
On Mon, Aug 6, 2012 at 9:46 AM, Wessel Dankers
Post by Wessel Dankers
The ideas overlap to a limited extent with the ideas[0] that Craig posted
to this list. For instance, no more hardlinks, and garbage collection is
done using flat-file databases. Some things are quite different. I'll try
to explain my ideas here.
Personally I think the hardlink scheme works pretty well up to about
the scale that I'd want on a single machine and you get a badly needed
atomic operation with links more or less for free.
Adding a chunk can be done atomically using a simple rename(). Removing a
chunk can be done atomically using unlink(). The only danger lies in
removing a chunk that is still being used (because there's a backup still
in progress whose chunks aren't being counted yet by the gc procedure). The
simplest way to prevent that is to grant exclusive access to the gc
process. Note that hardlinks do not prevent this race either. It's a
problem we need to solve anyway.

Craig lists a couple of reasons to abandon the hardlink scheme in
http://sourceforge.net/mailarchive/message.php?msg_id=27140176
Post by Les Mikesell
If you are going to do things differently, wouldn't it make sense to use
one of the naturally distributed scalable databases (bigcouch, riak,
etc.) for storage from the start since anything you do is going to
involve re-inventing the atomic operation of updating a link or replacing
it and the big win would be making this permit concurrent writes from
multiple servers?
Using something like ceph/rados for storing the chunks could be interesting
at some point but I'd like to keep things simple for now.

cheers,
--
Wessel Dankers <***@fruit.je>
Tobias Stroh
2012-08-07 14:07:15 UTC
Permalink
Hi,

a few observations of my own, no to be taken seriously:

1) Block level deduplication

There are already a lot of filesystem/filesystem layers in fuse (such as
ZFS, lessfs, ...) which do this. This is often more efficient then
rolling an own solution and is well abstracted.
In my opinion it does not make sense to do block level deduplication in
the application layer, except if you do it on the client side to safe
bandwidth.

2) Database

I would suggest not abusing the file system as database and using
something like SQLite. This gives you features like transactions, atomic
operations, etc. and also improves speed.

3) v4

Is v4 published somewhere? What you are doing seems to be more like a
fork, if there are huge changes in v4 and you are working on v3.

Regards
Post by Wessel Dankers
Hi Les,
Post by Les Mikesell
On Mon, Aug 6, 2012 at 9:46 AM, Wessel Dankers
Post by Wessel Dankers
The ideas overlap to a limited extent with the ideas[0] that Craig posted
to this list. For instance, no more hardlinks, and garbage collection is
done using flat-file databases. Some things are quite different. I'll try
to explain my ideas here.
Personally I think the hardlink scheme works pretty well up to about
the scale that I'd want on a single machine and you get a badly needed
atomic operation with links more or less for free.
Adding a chunk can be done atomically using a simple rename(). Removing a
chunk can be done atomically using unlink(). The only danger lies in
removing a chunk that is still being used (because there's a backup still
in progress whose chunks aren't being counted yet by the gc procedure). The
simplest way to prevent that is to grant exclusive access to the gc
process. Note that hardlinks do not prevent this race either. It's a
problem we need to solve anyway.
Craig lists a couple of reasons to abandon the hardlink scheme in
http://sourceforge.net/mailarchive/message.php?msg_id=27140176
Post by Les Mikesell
If you are going to do things differently, wouldn't it make sense to use
one of the naturally distributed scalable databases (bigcouch, riak,
etc.) for storage from the start since anything you do is going to
involve re-inventing the atomic operation of updating a link or replacing
it and the big win would be making this permit concurrent writes from
multiple servers?
Using something like ceph/rados for storing the chunks could be interesting
at some point but I'd like to keep things simple for now.
cheers,
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
BackupPC-devel mailing list
List: https://lists.sourceforge.net/lists/listinfo/backuppc-devel
Wiki: http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/
Tyler J. Wagner
2012-08-07 14:47:13 UTC
Permalink
Post by Tobias Stroh
2) Database
I would suggest not abusing the file system as database and using something
like SQLite. This gives you features like transactions, atomic operations,
etc. and also improves speed.
I do not understand why people object to MySQL for this. If you're going to
use a database, use a good one. Sqlite is fine for small, simple apps doing
a small number of transactions. BackupPC is not that. MySQL is fast
(enough), reliable (enough), and well understood with tons of useful HOWTOs
and software tools (phpmyadmin, mytop, mysqldump, webmin).

Regards,
Tyler
--
"To have a child is to give fate a hostage."
-- John F. Kennedy
Les Mikesell
2012-08-07 15:02:29 UTC
Permalink
Post by Tyler J. Wagner
Post by Tobias Stroh
2) Database
I would suggest not abusing the file system as database and using something
like SQLite. This gives you features like transactions, atomic operations,
etc. and also improves speed.
I do not understand why people object to MySQL for this. If you're going to
use a database, use a good one. Sqlite is fine for small, simple apps doing
a small number of transactions. BackupPC is not that. MySQL is fast
(enough), reliable (enough), and well understood with tons of useful HOWTOs
and software tools (phpmyadmin, mytop, mysqldump, webmin).
My experience with trying to scale mysql up has not been that great,
but it is from several years ago and maybe things have gotten better.
The suggestion for sqllite was probably just to handle metadata
operations (directory lists, mapping to chunks, reference counts of
chunks, etc.) rather than the file content itself. While it could
probably do that, you still need to tie the db transactions to the
filesystem operations in some sort of transaction-managed steps that
will at best slow everything down.
--
Les Mikesell
***@gmail.com
Tyler J. Wagner
2012-08-07 15:05:52 UTC
Permalink
Post by Les Mikesell
My experience with trying to scale mysql up has not been that great,
but it is from several years ago and maybe things have gotten better.
The suggestion for sqllite was probably just to handle metadata
operations (directory lists, mapping to chunks, reference counts of
chunks, etc.) rather than the file content itself. While it could
probably do that, you still need to tie the db transactions to the
filesystem operations in some sort of transaction-managed steps that
will at best slow everything down.
I am definitely NOT suggesting using MySQL for file storage, only for
metadata operations. :)

Regards,
Tyler
--
"Copyright is a bargain, not property. We agreed not to copy because
they agreed it would only be for a short period of time. They have broken
their end of the bargain; we are now breaking ours."
-- Russell Nelson
Tobias Stroh
2012-08-07 15:10:48 UTC
Permalink
Hi,

I think it is a matter of weighting the ease of setting it up vs. the
power needed.

And if you are certain, that you will never have more then one backuppc
server using one database, you might as well get rid of the network
overhead.

For simple queries it might even be faster than mysql and for more
complicated ones PostgreSQL would be better, I think. Note that there
are recent developements: The new WAL mode makes writes more efficient,
and you can use BerkeleyDB as an alternative backend.
And administration should not be necessary anyways :).

And yes: Saving large binary blobs in databases is generally a bad idea
(It is written twice for one thing).

Regards
Post by Tyler J. Wagner
Post by Tobias Stroh
2) Database
I would suggest not abusing the file system as database and using something
like SQLite. This gives you features like transactions, atomic operations,
etc. and also improves speed.
I do not understand why people object to MySQL for this. If you're going to
use a database, use a good one. Sqlite is fine for small, simple apps doing
a small number of transactions. BackupPC is not that. MySQL is fast
(enough), reliable (enough), and well understood with tons of useful HOWTOs
and software tools (phpmyadmin, mytop, mysqldump, webmin).
Regards,
Tyler
Tyler J. Wagner
2012-08-07 15:20:55 UTC
Permalink
Post by Tobias Stroh
I think it is a matter of weighting the ease of setting it up vs. the
power needed.
I think that's only an argument for non-packaged installation methods.
"apt-get install backuppc" would produce a working install with database
from a fresh install. I don't see why the rpm package couldn't do the same.
Post by Tobias Stroh
And if you are certain, that you will never have more then one backuppc
server using one database, you might as well get rid of the network
overhead.
Network overhead via local socket is tiny. Compared to loading and
unloading sqlite databases, MySQL is a a huge gain.
Post by Tobias Stroh
For simple queries it might even be faster than mysql and for more
complicated ones PostgreSQL would be better, I think.
Yes, but in modern installations MySQL, like BackupPC, is very easy to set
up and use. Package managers script actions to create DBs and users, for
instance. Most of us know that PostgreSQL is theoretically better, but for
most jobs we still just install MySQL because it's good enough.

That's an argument to improve PostgreSQL package management, as well.

Regards,
Tyler
--
"Copyright is a bargain, not property. We agreed not to copy because
they agreed it would only be for a short period of time. They have broken
their end of the bargain; we are now breaking ours."
-- Russell Nelson
Les Mikesell
2012-08-07 15:35:33 UTC
Permalink
Post by Tyler J. Wagner
Network overhead via local socket is tiny. Compared to loading and
unloading sqlite databases, MySQL is a a huge gain.
Loading? Why would you stop/start backuppc any more often than your
mysql daemon?
Post by Tyler J. Wagner
Yes, but in modern installations MySQL, like BackupPC, is very easy to set
up and use. Package managers script actions to create DBs and users, for
instance. Most of us know that PostgreSQL is theoretically better, but for
most jobs we still just install MySQL because it's good enough.
Another thing to consider is the difficulty of moving your
installation from one machine to another (probably with a different or
upgraded OS, possibly with a different CPU type) without losing
history, Or making duplicate copies for offsite storage. Having to
deal with a separate daemon and a separately-changing data format adds
complications for not-much benefit unless it also lets you scale up or
gives you redundancy.
--
Les Mikesell
***@gmail.com
Tyler J. Wagner
2012-08-07 15:42:24 UTC
Permalink
Post by Les Mikesell
Post by Tyler J. Wagner
Network overhead via local socket is tiny. Compared to loading and
unloading sqlite databases, MySQL is a a huge gain.
Loading? Why would you stop/start backuppc any more often than your
mysql daemon?
Sorry, I misunderstood. I thought the design called for multiple small
sqlite databases, not just one that's kept open. If that's the case, it
makes more sense.

Except then you have the problem of when that DB is written out, and how to
dump it for backup or moving. As you later said, you get easy redundancy
with MySQL.
Post by Les Mikesell
Another thing to consider is the difficulty of moving your
installation from one machine to another (probably with a different or
upgraded OS, possibly with a different CPU type) without losing
history, Or making duplicate copies for offsite storage. Having to
deal with a separate daemon and a separately-changing data format adds
complications for not-much benefit unless it also lets you scale up or
gives you redundancy.
That's not a problem for anyone who has ever moved a website. You move
files independently from the database. For a simple move or backup, it's
easy to script mysqldump+rsync+mysql.

Apropos (or perhaps ironic) to this discussion, all of my servers running
MySQL run webmin, and so have MySQL backups made daily. This gets grabbed
by BackupPC daily, so I have a reasonably stable DB state in BackupPC.

Regards,
Tyler
--
"Copyright is a bargain, not property. We agreed not to copy because
they agreed it would only be for a short period of time. They have broken
their end of the bargain; we are now breaking ours."
-- Russell Nelson
Les Mikesell
2012-08-07 16:19:39 UTC
Permalink
Post by Tyler J. Wagner
Except then you have the problem of when that DB is written out, and how to
dump it for backup or moving. As you later said, you get easy redundancy
with MySQL.
Easy is probably the wrong word there. Possible, maybe. And, unless
you put the content in the db, you still have to deal with consistency
with the filesystem.
Post by Tyler J. Wagner
Post by Les Mikesell
Another thing to consider is the difficulty of moving your
installation from one machine to another (probably with a different or
upgraded OS, possibly with a different CPU type) without losing
history, Or making duplicate copies for offsite storage. Having to
deal with a separate daemon and a separately-changing data format adds
complications for not-much benefit unless it also lets you scale up or
gives you redundancy.
That's not a problem for anyone who has ever moved a website. You move
files independently from the database. For a simple move or backup, it's
easy to script mysqldump+rsync+mysql.
Umm, no. That means your web site is down for the duration or the
backup is inconsistent. Web sites that scale would already be running
a distributed DB.
Post by Tyler J. Wagner
Apropos (or perhaps ironic) to this discussion, all of my servers running
MySQL run webmin, and so have MySQL backups made daily. This gets grabbed
by BackupPC daily, so I have a reasonably stable DB state in BackupPC.
You can do that without webmin (just ssh it in the pre-user command),
and I have some like that too. It works, but it's not a great fit for
backuppc since even small changes in the file defeat pooling.
--
Les Mikesel
***@gmail.com
Les Mikesell
2012-08-07 15:25:54 UTC
Permalink
Post by Tobias Stroh
I think it is a matter of weighting the ease of setting it up vs. the
power needed.
And if you are certain, that you will never have more then one backuppc
server using one database, you might as well get rid of the network
overhead.
But what is wrong with the hardlink model in that case? Throw a disk
with fast seeks at it or one of those hybrid SSD models if you just
need a tiny bit more speed. And a lot of RAM, of course. Then
again, who has problems with network overhead these days?
Post by Tobias Stroh
For simple queries it might even be faster than mysql and for more
complicated ones PostgreSQL would be better, I think. Note that there
are recent developements: The new WAL mode makes writes more efficient,
and you can use BerkeleyDB as an alternative backend.
And administration should not be necessary anyways :).
Ummm, yeah - what could possibly go wrong?
Post by Tobias Stroh
And yes: Saving large binary blobs in databases is generally a bad idea
(It is written twice for one thing).
But, like most things there are tradeoffs to consider. For example,.
if you did put everything in a database, you could use the replication
it provides for a 2nd copy - which is about the worst issue with the
current model. If you use a naturally replicated/distributed model
you get fault tolerance and the ability to run multiple servers for
free.
--
Les Mikesell
***@gmail.com
Bill Broadley
2012-08-07 22:55:02 UTC
Permalink
Post by Tyler J. Wagner
I do not understand why people object to MySQL for this.
I think one of the big attractions to backuppc is the ease of getting it
going.
Post by Tyler J. Wagner
If you're going to
use a database, use a good one.
I'd expect backuppc's usage to be very simple. Basically metadata ->
sha512 mapings.
Post by Tyler J. Wagner
Sqlite is fine for small, simple apps doing
a small number of transactions.
Having written and benchmarked code for doing backuppc stuff recently I
wouldn't agree. Do you think that sqlite operations on metadata are
likely to be a bottleneck for normal backuppc type operations?

Seems like 99.9% of the use would be:
select * from checksums where checksum=<checksum>

Or:
insert into files metadata=<metadata>
insert into checksums file=<file>, checksum=<checksum>

I'd be pretty surprised if anything like that where sqlite would be a
limiting factor.
Post by Tyler J. Wagner
BackupPC is not that. MySQL is fast
(enough), reliable (enough), and well understood with tons of useful HOWTOs
and software tools (phpmyadmin, mytop, mysqldump, webmin).
Right, sure. It is however very complex. So in a disaster, your
important server is down... someone is yelling at you to get your files
back....

Do you want to copy over a perl script that looks at sqllite and has a
uber simple restore.pl command? Or do you want to first have to restore
mysql contents, permissions, lock files, my.cnf, recover the root
password, get it running, then try to restore?

Sure, I'm all for a SQL agnostic layer, but I don't see why the default
should be anything but something simple/trivial/invisible to the user
like sqlite.
Bill Broadley
2012-08-07 22:43:14 UTC
Permalink
Post by Tobias Stroh
Hi,
1) Block level deduplication
There are already a lot of filesystem/filesystem layers in fuse (such as
ZFS, lessfs, ...) which do this.
True.
Post by Tobias Stroh
This is often more efficient then
rolling an own solution and is well abstracted.
Dunno, I've not seen particularly impressive numbers from existing
solutions. Sure ZFS does it.... but you need a fair bit of ram per
storage. It also results in much more random IO. In general I'm not
sold on the somewhat better storage efficiency at the cost of more
random I/O. After all another TB of disk is cheap. Adding
significantly more random IO is not.

Additionally making backuppc depend on fuse + zfs/lessfs/whatever seems
worrisome. Sure having a "do not do dedupe" flag makes sense. Pushing
off dedupe on the filesystem for all users sounds like a nightmare of
support/complaining users/strange issues. After all dedupe is (IMO) a
leading backuppc feature, I likely would have used something else if I
had to do dedupe myself.
Post by Tobias Stroh
In my opinion it does not make sense to do block level deduplication in
the application layer, except if you do it on the client side to safe
bandwidth.
I like file level dedupe, fast, easy, huge storage win, simple.
Imagine in a disaster recovery situation. With file level dedupe you
just would need the checksum of your interested blob, which could
potentially be a sqlite query from an ISO or similar. Something like
cp /pool/<sha512> ~/myImportantData

Or you could have a much more complex time figuring out which checksum
of each block is, reading them, and concatenating them together.

Seems like a fair bit of unnecessary complexity that in most cases is
not going to be a huge storage efficiency win and is going to be slower.
Les Mikesell
2012-08-07 14:53:55 UTC
Permalink
On Tue, Aug 7, 2012 at 3:36 AM, Wessel Dankers
Post by Wessel Dankers
Post by Les Mikesell
Personally I think the hardlink scheme works pretty well up to about
the scale that I'd want on a single machine and you get a badly needed
atomic operation with links more or less for free.
Adding a chunk can be done atomically using a simple rename(). Removing a
chunk can be done atomically using unlink().
That gets you a chunk of data with some random name on disk. You
also have to maintain a list of such chunks that are relevant to a
particular file and a reference count, and you have to do that
atomically without any common tool that provides atomic operations.
You get the reference and count for free with a filesystem hardlink.
Post by Wessel Dankers
The only danger lies in
removing a chunk that is still being used (because there's a backup still
in progress whose chunks aren't being counted yet by the gc procedure). The
simplest way to prevent that is to grant exclusive access to the gc
process. Note that hardlinks do not prevent this race either. It's a
problem we need to solve anyway.
Hardlinks maintain the reference count atomically. The equivalent
problem remaining there is that there in no single atomic operation to
check that the link count is one and remove it.
Post by Wessel Dankers
Craig lists a couple of reasons to abandon the hardlink scheme in
http://sourceforge.net/mailarchive/message.php?msg_id=27140176
And yet, he hasn't released something better. It will be at least
non-trivial... And you'll need a tool equivalent to fsck to check
for corruption of your references and counts.
Post by Wessel Dankers
Post by Les Mikesell
If you are going to do things differently, wouldn't it make sense to use
one of the naturally distributed scalable databases (bigcouch, riak,
etc.) for storage from the start since anything you do is going to
involve re-inventing the atomic operation of updating a link or replacing
it and the big win would be making this permit concurrent writes from
multiple servers?
Using something like ceph/rados for storing the chunks could be interesting
at some point but I'd like to keep things simple for now.
For me, 'simple' means building something on top of the best work
someone else has already done. For a single machine/filesystem you
might as well just use a compressing, de-dup'ing filesystem and ignore
the whole chunking/linking mess. And in fact, I wonder how it would
work to just remove those operations from backuppc and leave the rest
as is. But, I still think the next level in scaling would be
easiest on top of a naturally distributed and replicated database.
--
Les Mikesell
***@gmail.com
Bill Broadley
2012-08-07 22:32:47 UTC
Permalink
Wessel great to hear someone is tinkering with improving backuppc

I've spent a fair bit of thinking about backuppc and have pondered the
write something from scratch vs trying to improve backuppc. I'm quite
fond of backuppc because of:
* VERY nice web interface
* Multiple restore methods, nice browsing, etc.
* Fairly robust
* Dedupe
* compatible with standard client side tools like rsync/tar
* Fairly painless to backup 10-100 windows/linux/osx machines.

My main complaints:
* the pool hard to scale in size or performance
* It's hard to do offsite backups (all backups in 2 or more places)
* management issues with how big a pool can be, how hard it is to
replicate, and how hard it is to fsck (depending on size and
filesystem of course).

Especially these days when 36x3TB isn't hard to fit in a server. Nor
are multiple gbit/10G/40G network connections.

2nd on my wish list is encryption, ideally people backing up files from
their client wouldn't have to trust the admin to not violate their privacy.

To this end I've been writing some code to:
* Keep client side sqlite database for metadata and file -> blob
checksum mappings.
* Use all available client side CPU cores (even my $200 tablet has 4)
for compression, encryption, and checksum. Keeping up with a GigE
network connection is non-trivial.
* implement a network protocol (using protobufs) for a dedupe enabled
upload to server
* Protocol allows for "I have these new encrypted blobs, server which
ones don't you have?".

The above code is written, benchmarked, tested, but certainly not
production ready.

What isn't written is the ability to let servers trade encrypted blobs
to ensure replications goals are met. Ideally it would allow the use
case of allowing 3 sys admins in an organization want to trade 10TB so
they can all have 3 copies of their backups. Assuming GigE connections
and reasonable low churn it seems feasible.
Post by Wessel Dankers
The ideas overlap to a limited extent with the ideas[0] that Craig posted
I strongly agree with Craig's identification of the weaknesses and his
proposed solution of content addressable storage as a replacement of
hard links.
Post by Wessel Dankers
to this list. For instance, no more hardlinks, and garbage collection is
done using flat-file databases. Some things are quite different. I'll try
to explain my ideas here.
I cringe at the thought of a flat-file database. Seems little reason to
no use sqllite or similar, then let the folks that prefer mysql,
postgres, sql flavor of the day implement whatever is needed.

Most importantly it should be reliable, and not freak if the sum of the
metadata is larger than ram. IMO sqlite or similar is the easiest way
to do that.
Post by Wessel Dankers
Observations
============
1) I/O: writes are fast, reads are slow. Random reads are very slow.
Writing can be done asynchronously. Even random writes can be elevatored to
the platters fairly efficiently. Reading, on the other hand, is blocking to
the process doing it. Running many processes in parallel can alleviate the
random-read bottleneck a little but it's still best to do sequential reads
whenever possible.
Not sure in what context you are discussing this. It's not an issue
client side (IMO), and server side you mostly want fast metadata reads,
and fast blob writes. After all I'd expect 100s of backups (writes to
the pool) for every read (a restore).

Ideally I could have N disks and put 1/N of the sha512 blobs on each.
Or maybe 2/N if I want to survive a single disk failure. Much like how
Hadoop file system works.

Of course even better would be a network service like say the hadoop
distributed file system or similar blob/object store. Sadly I don't
know of a nice robust and easy to use content addressable storage system
that would be ideal for backuppc.
Post by Wessel Dankers
2) CPU and RAM are cheap.
BackupPC parallelizes well and for larger installations 12-core or even
24-core systems can be had for reasonable prices. Memory is dirt cheap.
Agreed on cores and ram being cheap. Ideally a hardlinked pool
replacement would allow:
* file dedupe
* scaling performance and storage with additional disks
* Easy ability to check integrity

So basically it sounds like the ideal system would be CAS:
http://en.wikipedia.org/wiki/Content-addressable_storage
Post by Wessel Dankers
File data
=========
All files are to be split into chunks of (currently) 2 megabytes. These
chunks are addressed by their sha512 hashes (that is, the filename for each
chunk is simply the base64 encoded hash). Any compression is applied after
splitting. This provides dedup even for large files that change or
otherwise differ only slightly.
Seems low, why not 10-100 times that? More sequential I/O is better
right? In most cases I've seen where block level dedupe would be a big
win they were doing something unsafe in the first place. Like say for
instance backing up a large live database. (like Mysql, Zodb, or exchange).

The main reason for conserving seeks is that in most use cases churn is
low, so most of the work is identifying what pieces need to be uploaded
from a client and if the server already has any of those pieces.
Post by Wessel Dankers
The use of sha512 will eliminate collisions completely, and provide safe
I suspect about half the people will argue that's too much, and about
half will argue that's too little. Ideally this would be settable. In
particular there's some promising candidates that are likely to be much
faster than sha512, yet still be very collision resistant... thus the
sha-3 project.
Post by Wessel Dankers
dedup without expensive whole-file comparisons. This obviates the need for
a lot of disk reading.
I don't follow how sha512/CAS avoids the need for a lot of disk reading.

So say a 1GB file on the client changes. With file level dedupe the
client basically has to say:
"Hey server, do you have checksum foo?"

The server then does a single flat file or sql lookup and says yes or no.

With 2MB blocks you have to do the above 512 times, right?

Additionally any store or retrieve will have to do 512 seeks. Keep in
mind that an average disk these days does 120MB/sec or so, and even a
small RAID can do many times that.

In either dedupe case on the client side you have to either have a cache
of time stamp, filename -> hash and trust the timestamps. Or you have
to read the entire file and compute checksums.

Seems like the biggest win would be to have the client send you the
complete file checksum. I think Craig mentioned some magic to be
checksum compatible with rsync using some particular flag or another.
Wessel Dankers
2012-08-08 14:36:51 UTC
Permalink
Hi Bill,
Post by Bill Broadley
* Keep client side sqlite database for metadata and file -> blob
checksum mappings.
* Use all available client side CPU cores (even my $200 tablet has 4)
for compression, encryption, and checksum. Keeping up with a GigE
network connection is non-trivial.
* implement a network protocol (using protobufs) for a dedupe enabled
upload to server
* Protocol allows for "I have these new encrypted blobs, server which
ones don't you have?".
The above code is written, benchmarked, tested, but certainly not
production ready.
This could be very interesting as an extra protocol next to rsync/smb/etc.
Post by Bill Broadley
I cringe at the thought of a flat-file database. Seems little reason to
no use sqllite or similar, then let the folks that prefer mysql,
postgres, sql flavor of the day implement whatever is needed.
I'm starting to regret ever using the word ‘database’. What I've written is
simply something that outputs a list of files in a backup, with an index
for quick lookups. It can only be written once and doesn't allow write
access. It's not a database in the sense that sqlite or mysql are.
Post by Bill Broadley
Post by Wessel Dankers
1) I/O: writes are fast, reads are slow. Random reads are very slow.
Not sure in what context you are discussing this.
The context of a server that's trying to backup a large number of clients.
Post by Bill Broadley
It's not an issue
client side (IMO), and server side you mostly want fast metadata reads,
and fast blob writes. After all I'd expect 100s of backups (writes to
the pool) for every read (a restore).
Server side you just want to spool data to disk, ideally without reading
anything (as that's blocking).
Post by Bill Broadley
Ideally I could have N disks and put 1/N of the sha512 blobs on each.
By ditching hardlinks that's exactly what you'll be able to do.
Post by Bill Broadley
Post by Wessel Dankers
File data
=========
All files are to be split into chunks of (currently) 2 megabytes. These
chunks are addressed by their sha512 hashes (that is, the filename for each
chunk is simply the base64 encoded hash). Any compression is applied after
splitting. This provides dedup even for large files that change or
otherwise differ only slightly.
Seems low, why not 10-100 times that? More sequential I/O is better
right? In most cases I've seen where block level dedupe would be a big
win they were doing something unsafe in the first place. Like say for
instance backing up a large live database. (like Mysql, Zodb, or exchange).
I wanted the blocks to fit in ram easily. Simplifies the code a lot.
But you may be right, perhaps the chunks could be a lot bigger.
Post by Bill Broadley
I don't follow how sha512/CAS avoids the need for a lot of disk reading.
If you use a small hash (like rc4) you need to compare the files before
concluding it's not a collision. This is a lot of disk reading. With sha512
you'd be able to forgo that.
Post by Bill Broadley
So say a 1GB file on the client changes. With file level dedupe the
"Hey server, do you have checksum foo?"
The server then does a single flat file or sql lookup and says yes or no.
With 2MB blocks you have to do the above 512 times, right?
There's no reason why we can't store a per-file hash in the metadata as
well. But with the smaller chunks we can pin-point the changed bit without
the server having to read anything but a few k's of metadata.

cheers,
--
Wessel Dankers <***@fruit.je>
Les Mikesell
2012-08-08 14:46:36 UTC
Permalink
On Wed, Aug 8, 2012 at 9:36 AM, Wessel Dankers
Post by Wessel Dankers
Post by Bill Broadley
With 2MB blocks you have to do the above 512 times, right?
There's no reason why we can't store a per-file hash in the metadata as
well. But with the smaller chunks we can pin-point the changed bit without
the server having to read anything but a few k's of metadata.
Can you make that mesh with rsync cached checksums? And would they be
reliable enough to pool the blocks?

--
Les Mikesell
***@gmail.com

Loading...