Philip Hazel | 1 Sep 2005 15:55
Picon
Picon

Preliminary testing of a new Exim test suite

I have at last started re-implementing the Exim test suite in a way that 
should, with luck, make it portable and runnable on hosts other than my 
personal workstation. There is still a very long way to go, but I now 
have something that is worth trying, to see if the general principles 
are sound. Therefore, I am seeking testers. If you would like to help, 
please download

ftp://ftp.csx.cam.ac.uk/pub/software/email/exim/Testing/exim-testsuite-0.00.tar.bz2

and start with the README file. In order to run the test suite you need 
to have root access via sudo on your host. Oh, and Perl. Full details 
are in the README.

The tests assume the latest version of Exim - indeed they assume the 
latest 4.53 snapshot in some cases, so if you run the suite against an 
earlier version there are likely to be failures. There are some "hooks" 
in Exim to assist with testing, and these may well have to be changed to 
fit with the new scheme.

I'll be continuing work on this, and extending it to cover as much as I 
possibly can, but as it is so preliminary, the version will probably 
stay at 0.00 for a while. There are nearly 620 tests in the old suite, 
and I've only coverted 51 so far.

Feedback is invited...

Thanks.
Philip

--

-- 
(Continue reading)

Michael Haardt | 1 Sep 2005 18:51
Picon

Randomising retry times?

Hello,

if a few hosts with the same retry rules send large amounts to a host
that was previously down, it may happen that both experience problems with
the restricted number of connections offered by the previously down host.
Retrying later gets the same behaviour.  It's not as extreme as it sounds,
but to a certain amount, I do see a wave shape.

How about randomising a part of geometric retry times?

Michael

--

-- 
Philip Hazel | 2 Sep 2005 10:26
Picon
Picon

Re: Randomising retry times?

On Thu, 1 Sep 2005, Michael Haardt wrote:

> Hello,
> 
> if a few hosts with the same retry rules send large amounts to a host
> that was previously down, it may happen that both experience problems with
> the restricted number of connections offered by the previously down host.
> Retrying later gets the same behaviour.  It's not as extreme as it sounds,
> but to a certain amount, I do see a wave shape.
> 
> How about randomising a part of geometric retry times?

That sounds extreme and complicated, difficult to explain, and liable to
errors, for what is a situation case.

Incidentally, I have often argued that the use of backup MX (currently 
going out of favour) is one way to avoid this problem. The pending mail 
collects on the backup and can be transferred in an orderly fashion when 
the primary comes up. Of course, you then have the problem of keeping 
the acceptance rules identical on the backup and the primary, so I can 
see why people don't like backup MX any more. There are problems both 
ways.

-- 
Philip Hazel            University of Cambridge Computing Service,
ph10 <at> cus.cam.ac.uk      Cambridge, England. Phone: +44 1223 334714.
Get the Exim 4 book:    http://www.uit.co.uk/exim-book

--

-- 
(Continue reading)

Michael Haardt | 2 Sep 2005 10:49
Picon

Re: Randomising retry times?

> That sounds extreme and complicated, difficult to explain, and liable to
> errors, for what is a situation case.
>
> Incidentally, I have often argued that the use of backup MX (currently 
> going out of favour) is one way to avoid this problem. The pending mail 
> collects on the backup and can be transferred in an orderly fashion when 
> the primary comes up. Of course, you then have the problem of keeping 
> the acceptance rules identical on the backup and the primary, so I can 
> see why people don't like backup MX any more. There are problems both 
> ways.

My problem is not with mail from outside, but inside the cluster.

Think of it as ethernet: It uses exponential backoff, but takes a random
number within the backoff interval.  That way it gets rid of the situation
that two stations always collide.

And that's pretty much my problem, so I thought about solving it just
the same.  I can see a use for fixed retry intervals, but does anybody
really care about a geometric interval being met exactly? I think people
use it for the exponential backoff character, not caring about the exact
interval as long as it keeps growing.  Using it as interval for a random
value keeps that character, and avoids retry collions.

Michael

--

-- 
Daniel Tiefnig | 2 Sep 2005 11:23
Picon

Re: Preliminary testing of a new Exim test suite

Philip Hazel wrote:
> Feedback is invited...

So, here we go...

First, I'm not sure whether running via sudo is the best option. (One
has to configure sudo, add its user to the exim-group, and relogin to be
able to run the tests...) Maybe ppl. should be able to run tests as
root? You could set up some suid binaries and drop to exim-user-id after
that. Just a thought, I didn't check the code to see whether this is
possible...

First problem came with first test:
[..]
| Exim user is Debian-exim
| Exim group is Debian-exim
| Program caller is tiefnig ()

You see the epmty space enclosed in brackets? That will cause problems
later on, see below.

[...]
| Basic/0001 Basic configuration setting
| ** Subtest 1 (starting at line 8)
| ** Return code 1 (expected 0)
[...]
| Exim configuration error in line 207 of \
|   /home/tiefnig/exim/exim-testsuite-0.00/confs/0001:
|  user exim was not found

(Continue reading)

Daniel Tiefnig | 2 Sep 2005 11:41
Picon

Re: Randomising retry times?

Michael Haardt wrote:
> Philip Hazel wrote:
>> That sounds extreme and complicated, difficult to explain, and
>> liable to errors, for what is a situation case.

I'd agree with that. Obfuscating retry times any further wouldn't help
much, I guess, ...

> My problem is not with mail from outside, but inside the cluster.

... but hat's a good point. I see this problem on our cluster very
clearly. We're using a 2-stage system, with "smart" relay hosts and
mailbox servers. If one mailbox server gets overloaded, (or offline)
refusing SMTP connections, Mail gets queued on the relay hosts. When the
mailbox server is available again, the (8) relay hosts start delivering
mail more ore less at the same time, hitting the mailbox server quite hard.

I'm not sure whether randomized retry times would solve the problem,
though. Some form of ratelimiting (as implemented in exim allready) may
be the better (more deterministic) solution, but may also influence
regular mail traffic. In any case this needs some hands-on experience
with the concerned system.

JM2C,
daniel

--

-- 
Philip Hazel | 2 Sep 2005 12:08
Picon
Picon

Re: Preliminary testing of a new Exim test suite

On Fri, 2 Sep 2005, Daniel Tiefnig wrote:

> Philip Hazel wrote:
> > Feedback is invited...
> 
> So, here we go...

Thank you! Those are just the kind of problems I was expecting.

> First, I'm not sure whether running via sudo is the best option. (One
> has to configure sudo, add its user to the exim-group, and relogin to be
> able to run the tests...) Maybe ppl. should be able to run tests as
> root? You could set up some suid binaries and drop to exim-user-id after
> that. Just a thought, I didn't check the code to see whether this is
> possible...

The thing is, the test script wants to run as "you" most of the time, so 
it can call Exim "as an ordinary user". Not even as the exim user. It
just needs root now and again to set things up and to do things like 
remove the hints files. For that reason, sudo seemed the best solution 
to me, but maybe that's because I am used to using it all the time.

The other problems look, at first glance, like things I should be able 
to solve. 

> Basic/0005 failes:
> | ** Subtest 4 (starting at line 36)
> | ** Return code 1 (expected 0)
> 
> Stderr prints two loglines, with message:
(Continue reading)

Tony Finch | 2 Sep 2005 12:25
Picon
Favicon

Re: Randomising retry times?

On Fri, 2 Sep 2005, Michael Haardt wrote:
>
> And that's pretty much my problem, so I thought about solving it just
> the same.  I can see a use for fixed retry intervals, but does anybody
> really care about a geometric interval being met exactly?

Exim doesn't promise to do so; the retry time is just a "do not try
before" time. The actual delivery attempt will not occur until a queue
runner turns up. However the hints database can compensate for this
randomization.

If your problem is within a cluster you control, then it sounds to me like
a configuration error if your front end can hammer your back end to death.
Unfortunately Exim doesn't make it easy to control its rate of outgoing
delivery attempts, so if your front end is also delivering elsewhere (to
the Internet) then restricting the delivery rate to your back end will
hurt other email. Perhaps you should look at tuning the back end so that
it is hurt less by floods of email (queue_only_load, smtp_accept_queue,
smtp_accept_queue_per_connection). Alternatively, you might try separating
the back-end delivery queue from your main queue so that it can be managed
by an Exim with less aggressive parallelism settings.

Tony.
-- 
<fanf <at> exim.org>   <dot <at> dotat.at>   http://dotat.at/   ${sg{\N${sg{\
N\}{([^N]*)(.)(.)(.*)}{\$1\$3\$2\$1\$3\n\$2\$3\$4\$3\n\$3\$2\$4}}\
\N}{([^N]*)(.)(.)(.*)}{\$1\$3\$2\$1\$3\n\$2\$3\$4\$3\n\$3\$2\$4}}

--

-- 
(Continue reading)

Michael Haardt | 2 Sep 2005 12:38
Picon

Re: Randomising retry times?

> Exim doesn't promise to do so; the retry time is just a "do not try
> before" time. The actual delivery attempt will not occur until a queue
> runner turns up. However the hints database can compensate for this
> randomization.

Right.  All I want is to randomise the earliest possible retry time.

> If your problem is within a cluster you control, then it sounds to me like
> a configuration error if your front end can hammer your back end to death.

I am talking about failure recovery, not regular operation.

> Unfortunately Exim doesn't make it easy to control its rate of outgoing
> delivery attempts, so if your front end is also delivering elsewhere (to
> the Internet) then restricting the delivery rate to your back end will
> hurt other email. Perhaps you should look at tuning the back end so that
> it is hurt less by floods of email (queue_only_load, smtp_accept_queue,
> smtp_accept_queue_per_connection).

Queueing under load is bad advice, as it makes things worse.  Mail servers
are usually limited by I/O.  Immediate delivery delivers from page cache.
Queued delivery has to read the mail back into core, thus causing
additional I/O.  It is fine for short peaks, but if in trouble, then try
to avoid queuing and better limit the number of connections and let others
retry.

It's just too bad if they all wait, and then retry all at once.  That's
what I am trying to avoid.

> Alternatively, you might try separating
(Continue reading)

Tony Finch | 2 Sep 2005 12:50
Picon
Favicon

Re: Randomising retry times?

On Fri, 2 Sep 2005, Michael Haardt wrote:
>
> Queueing under load is bad advice, as it makes things worse.  Mail servers
> are usually limited by I/O.  Immediate delivery delivers from page cache.

However delivery requires many more fsyncs than just queueing.

> It is fine for short peaks, but if in trouble, then try to avoid queuing
> and better limit the number of connections and let others retry.

I agree that managing the offered load is more effective.

> Put it differently: What would break if geometric retry intervals would
> use the computed earliest retry time as interval for a random number?

Probably nothing :-)

Tony.
-- 
<fanf <at> exim.org>   <dot <at> dotat.at>   http://dotat.at/   ${sg{\N${sg{\
N\}{([^N]*)(.)(.)(.*)}{\$1\$3\$2\$1\$3\n\$2\$3\$4\$3\n\$3\$2\$4}}\
\N}{([^N]*)(.)(.)(.*)}{\$1\$3\$2\$1\$3\n\$2\$3\$4\$3\n\$3\$2\$4}}

--

-- 

Gmane