Donovan Baarda | 1 Jul 2002 01:19
Picon
Picon
Gravatar

Re: Block size optimization - let rsync find the optimal blocksize by itself.

On Sun, Jun 30, 2002 at 06:23:10PM +0200, Olivier Lachambre wrote:
[...]
>   Well, the first comment: during my work, I wanted to verify that the
> theorical optimal block size sqrt(24*n/Q) given by Andrew Tridgell in his
> PHd Thesis was actually the good one, and when doing the tests on randomly
> generated & modified files I discovered that the size sqrt(78*n/Q) is the
> actual optimal block size, I tried to understand this by reading all the
> thesis, then quite a lot of documentation about rsync but I just can't
> figure out why the theorical & experimental optimal block sizes so much
> don't match. I _really_ don't think it's coming from my tests, there must be
> somewhat else. Maybe the rsync developpers have just changed some part of
> the algorithm. And also, even without using data compression during the
> sync, rsync is always more efficient as it should be theorically, actually
> between 1.5 and 2 times more efficient. Nobody will complain about that but
> I'd be happy if someone would be nice enough to explain me this thing.

I believe that the compression option turns on compression of transfered
data, including file lists, instruction streams etc. Even with compression
turned off, the miss data in the delta is still compressed. 

Another thing to be aware of is when rsync compresses miss data, it also
compresses all the hit data to prime the compressor with context, and throws
away the compressed output for the hit data.

Perhaps this "context compression" is affecting the optimal block size? 

>   Now the auto-optimization algorithm when updating many files at a time.
> Let's consider a set of files to be updated. We will consider only the files
> which have been changed since the last update (e.g. we can find the other
> ones by sending a MD5 sum for each file and trying to match it). We sync the
(Continue reading)

tridge | 1 Jul 2002 02:09
Picon
Favicon
Gravatar

Re: Block size optimization - let rsync find the optimal blocksize by itself.

Olivier,

>   Well, the first comment: during my work, I wanted to verify that the
> theorical optimal block size sqrt(24*n/Q) given by Andrew Tridgell in his
> PHd Thesis was actually the good one, and when doing the tests on randomly
> generated & modified files I discovered that the size sqrt(78*n/Q) is the
> actual optimal block size, I tried to understand this by reading all the
> thesis, then quite a lot of documentation about rsync but I just can't
> figure out why the theorical & experimental optimal block sizes so much
> don't match. I _really_ don't think it's coming from my tests, there must be
> somewhat else.

First off, you need to make sure you are taking into account the
conditions I mentioned for that optimal size to be correct. In
particular I assumed:

  If, for example, we assume that the two files are the same except for
  Q sequences of bytes, with each sequence smaller than the block size
  and separated by more than the block size from the next sequence

In practice there is no 'correct' model for real files, so I chose a
simple module that I thought would give a reasonable approximation
while being easy to analyse.

Also, you didn't take into account that the function I gave was for
the simpler version of rsync that I introduced in chapter 3. Later in
the thesis I discuss how s_s can be reduced without compromising the
algorithm (see 'Smaller Signatures' in chapter 4). That changes the
calculation of optimal block size quite a bit.

(Continue reading)

tridge | 1 Jul 2002 02:09
Picon
Favicon
Gravatar

Re: Block size optimization - let rsync find the optimal blocksize by itself.

> I believe that the compression option turns on compression of transfered
> data, including file lists, instruction streams etc. Even with compression
> turned off, the miss data in the delta is still compressed. 

nope!

Compression only compresses file data, not file lists etc. The file
lists are 'packed' with a simple fixed algorithm (finding common
filename prefixes for example). It would have been better to just use
deflate for the lot, but unfortunately I didn't do that.

If you don't use -z then deflate is not used at all, and neither is
the 'compression of unsent data' trick.

Cheers, Tridge

--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html

jw schultz | 1 Jul 2002 02:45
Favicon

Re: Block size optimization - let rsync find the optimal blocksize by itself.

I dislike flaming and i don't intend my comments as flame
but you have made some statements that at face value are
problematic.  Perhaps you could correct my misunderstanding.

On Sun, Jun 30, 2002 at 06:23:10PM +0200, Olivier Lachambre wrote:
> Hello,
>   Another French student in the rsync mailing list. I have been working on
> rsync this year for a documentation project for school and I would like to
> give some comment about rsync block size optimization first, and then to
> submit a way to make rsync choose by itself the optimal blocksize when
> updating a large number of files.
> 
>   Well, the first comment: during my work, I wanted to verify that the
> theorical optimal block size sqrt(24*n/Q) given by Andrew Tridgell in his
> PHd Thesis was actually the good one, and when doing the tests on randomly
> generated & modified files I discovered that the size sqrt(78*n/Q) is the
> actual optimal block size, I tried to understand this by reading all the
> thesis, then quite a lot of documentation about rsync but I just can't
> figure out why the theorical & experimental optimal block sizes so much
> don't match. I _really_ don't think it's coming from my tests, there must be
> somewhat else. Maybe the rsync developpers have just changed some part of
> the algorithm. And also, even without using data compression during the
> sync, rsync is always more efficient as it should be theorically, actually
> between 1.5 and 2 times more efficient. Nobody will complain about that but
> I'd be happy if someone would be nice enough to explain me this thing.

Firstly, I'll give Andrew the benefit of the doubt.  His
track record has been very good. 

In the real world file generation and modification is not
(Continue reading)

Dave Dykstra | 1 Jul 2002 17:36
Favicon

Re: rsync 2.5.5 and Mac OS X

I compile rsync on Mac OSX (not sure of the osx version, but uname -a
says it's darwin 5.5) but haven't tried running as a daemon.  I suggest
that you try to debug it further.  setgroups() is only called one place,
in clientserver.c, with parameters setgroups(0, NULL), and only if
the define HAVE_SETGROUPS is set.  This call was a relatively recent
security fix addition to rsync.  The OSX setgroups man page says that the
EINVAL error is only supposed to occur if the first parameter is greater
than NGROUPS_MAX.  Apparently it is also reporting that error if the first
parameter is zero.

- Dave Dykstra

On Thu, Jun 27, 2002 at 02:51:40PM -0600, grbear <at> shaw.ca wrote:
> I've got mine setup in inetd.conf to be executed on a per-call request
> of port 873 as written in the man pages. The advantage of using this
> method, is that when the client is done and disconnects, the daemon
> quits, thereby freeing up resources that it was using. Word of warning
> though if you want to try this route.. you need to add an rsync tcp
> entry for port 873 into Services with the NetInfo manager in addition to
> editing the /etc/inetd.conf file.  You'll have to restart to get any
> changes to inetd.conf take effect as 'killall -HUP inetd' has no effect.
> 
> When run in the manner above, it is started up as the root user (as
> specified in inetd.conf).
> 
> ----- Original Message -----
> From: Catalino Cuadrado <ccuadrado <at> mail.wesleyan.edu>
> Date: Thursday, June 27, 2002 2:19 pm
> Subject: Re: rsync digest, Vol 1 #778 - 11 msgs
> 
(Continue reading)

Olivier Lachambre | 1 Jul 2002 18:29
Picon

Re: Block size optimization - let rsync find the optimal blocksize by itself.

At 17:09 30/06/2002 -0700, you wrote:
>Olivier,
>
>>   Well, the first comment: during my work, I wanted to verify that the
>> theorical optimal block size sqrt(24*n/Q) given by Andrew Tridgell in his
>> PHd Thesis was actually the good one, and when doing the tests on randomly
>> generated & modified files I discovered that the size sqrt(78*n/Q) is the
>> actual optimal block size, I tried to understand this by reading all the
>> thesis, then quite a lot of documentation about rsync but I just can't
>> figure out why the theorical & experimental optimal block sizes so much
>> don't match. I _really_ don't think it's coming from my tests, there must be
>> somewhat else.
>
>First off, you need to make sure you are taking into account the
>conditions I mentioned for that optimal size to be correct. In
>particular I assumed:
>
>  If, for example, we assume that the two files are the same except for
>  Q sequences of bytes, with each sequence smaller than the block size
>  and separated by more than the block size from the next sequence
>
>In practice there is no 'correct' model for real files, so I chose a
>simple module that I thought would give a reasonable approximation
>while being easy to analyse.

I did not explain at all what my tests were : I did not use real files
but a randomly generated file in which I have put 1 byte long
differences, separated from another difference by much more than the block
size.

(Continue reading)

Stuart Inglis | 1 Jul 2002 22:33

Rsync woes with large numbers of files

Hi everyone,

I recently read a thread about the problems people are having with file
systems with a large number of files on them. We have a 80GB file system
with ~10 million files on it. Rsync runs out of memory on a 512M RAM
machine while (I assume) reading in the list of files to send.

To avoid this problem we process each of the 40 top level directories
one at a time (in a "for f in *" type loop) and this almost solves our
problems. Some of the top level directories themselves are too large, so
we need to put another for loop inside that directory to try and stop
rsync running out of memory.

My question is this: does rsync need to stat every single file in the
filesystem before it tries to sync a collection of files? Can it simply
keep a reasonable number of files on both sides of the connection (say,
1-10,000 files?) and transfer them in batches? 

On our server this takes >24 hours to sync the machine, this isn't so
bad, but it'd be good to make rsync a little more robust so that if we
create a whole bunch of files we know the rsync will work, currently we
have to rely on logs and update the script to keep things being backed
up.

Cheers
Stuart

PS: I'm using Linux, does anyone know what filesystems support file
logging, so that I can simply rsync based a file that contains a daily
list of changed files?
(Continue reading)

Olivier Lachambre | 2 Jul 2002 11:06
Picon

Re: Block size optimization - let rsync find the optimal blocksize by itself.

At 17:45 30/06/2002 -0700, you wrote:
>
>I dislike flaming and i don't intend my comments as flame
>but you have made some statements that at face value are
>problematic.  Perhaps you could correct my misunderstanding.

[...]
>>   Well, the first comment: during my work, I wanted to verify that the
>> theorical optimal block size sqrt(24*n/Q) given by Andrew Tridgell in his
>> PHd Thesis was actually the good one, and when doing the tests on randomly
>> generated & modified files I discovered that the size sqrt(78*n/Q) is the
>> actual optimal block size, I tried to understand this by reading all the
>> thesis, then quite a lot of documentation about rsync but I just can't
>> figure out why the theorical & experimental optimal block sizes so much
>> don't match. I _really_ don't think it's coming from my tests, there must be
>> somewhat else. Maybe the rsync developpers have just changed some part of
>> the algorithm. And also, even without using data compression during the
>> sync, rsync is always more efficient as it should be theorically, actually
>> between 1.5 and 2 times more efficient. Nobody will complain about that but
>> I'd be happy if someone would be nice enough to explain me this thing.
>
>Firstly, I'll give Andrew the benefit of the doubt.  His
>track record has been very good. 
>

I think that the theorical optimal block size sqrt(24*n/Q) given by Andrew
in his PHd Thesis was actually the good one in the first version of rsync.
Rsync has been deeply modified since, as he explained it yesterday, that
is why the former optimal block size is no more accurate, which is not a
problem but still interesting to know.
(Continue reading)

Olivier Lachambre | 2 Jul 2002 11:06
Picon

Re: Block size optimization - let rsync find the optimal blocksize by itself.

At 09:19 01/07/2002 +1000, you wrote:

>[...]
>This relies on optimal block size being related for a set of files. I'm not
>sure that this heuristic actually applies, and I don't know how much benefit
>this would buy for all the complexity it would add.
>

I think that many clients do not care about multiplying complexity by 2 or
5, even if the speedup rate is only multiplied by 1.1.

Olivier
_______

Olivier Lachambre
2, rue Roger Courtois
25 200 MONTBELIARD
FRANCE

e-mail : lachambre <at> club-internet.fr

--

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html

Miroslaw Luc | 2 Jul 2002 15:23
Picon
Favicon

Rsync: Segmentation fault

Rsync 2.5.5; transfer via ssh; sparc-sun-solaris2.5 (Ultra 1); gcc 2.8.1.
Every time rsync on this box causes boundary violation. I attached two
strange examples. I have a few rsync's corefiles (100MB and above) and I
can examine them. I will be grateful for any help.
-Mirek

<------------------------------------------------------------------------->
received 966761 names
done
recv_file_list done
[...]
rsync: connection unexpectedly closed (21807465 bytes read so far)
rsync error: error in rsync protocol data stream (code 12) at io.c(150)
_exit_cleanup(code=12, file=io.c, line=150): about to call exit(12)

Core was generated by `/rsync/bin/rsync --server --sender -vvvlHogDtprRS
--partial --numeric-ids . /'.
Program terminated with signal 11, Segmentation fault.
[...]
(gdb) bt
#0  0x2d898 in dopr (buffer=0xef7ffcb8 <Address 0xef7ffcb8 out of bounds>,
    maxlen=1024, format=0x3c6d8 "rsync error: %s (code %d) at %s(%d)\n",
    args=0xef800114) at lib/snprintf.c:155
Cannot access memory at address 0xef7ffad4.
(gdb) p currlen
Cannot access memory at address 0xef7ffb04.
<------------------------------------------------------------------------->

<------------------------------------------------------------------------->
io timeout after 120 seconds - exiting
(Continue reading)


Gmane