rsync performance on large files strongly depends on file's (dis)similarity
Thomas Knauth <thomas.knauth <at> gmx.de>
2014-04-11 11:35:44 GMT
I've found this post on rsync's expected performance for large files:
I have a related but different observation to share: with files in the
multi-gigabyte-range, I've noticed that rsync's runtime also depends
on how much the source/destination diverge, i.e., synchronization is
faster if the files are similar. However, this is not just because
less data must be transferred.
For example, on an 8 GiB file with 10% updates, rsync takes 390
seconds. With 50% updates, it takes about 1400 seconds, and at 90%
updates about 2400 seconds.
My current explanation, and it would be awesome if someone more
knowledgeable than me could confirm, is this: with very large files,
we'd expect a certain level of false alarms, i.e., weak checksum
matches, but strong checksum does not. However, with large files that
are very similar, a weak match is much more likely to be confirmed
with a matching strong checksum. Contrary, with large files that are
very dissimilar a weak match is much less likely to be confirmed with
a strong checksum, exactly because the files are very different from
each other. rsync ends up computing lots of strong checksums, which do
not result in a match.
Is this a valid/reasonable explanation? Can someone else confirm this
relationship between rsync's computational overhead and the file's