Charlie Brady | 16 Aug 2010 22:32
Picon
Picon

runsv failing when starting up logger - missing pipe - failure of logpipe init?


I'm running runit 1.7.2, and have a sytem showing this anomolous behaviour 
(one service is failing to start).

Here is the bad service:

[root <at> mbgvm1 ~]# sv status /service/tug-metrics/
down: /service/tug-metrics/: 1316s, normally up; down: log: 0s, normally 
up, want up
[root <at> mbgvm1 ~]#

strace of the runsv process shows that the child is dying immediately 
because fd 5 is not valid:

2005  fork()                            = 8121
2005  gettimeofday({1281988874, 32865}, NULL) = 0
2005  open("supervise/pid.new", O_WRONLY|O_CREAT|O_TRUNC|O_NONBLOCK, 0644) 
= 5
2005  write(5, "8121\n", 5)             = 5
2005  close(5)                          = 0
2005  rename("supervise/pid.new", "log/supervise/pid") = 0
2005  open("supervise/stat.new", O_WRONLY|O_CREAT|O_TRUNC|O_NONBLOCK, 
0644) = 5
2005  write(5, "run\n", 4)              = 4
2005  close(5)                          = 0
2005  rename("supervise/stat.new", "log/supervise/stat") = 0
2005  open("supervise/status.new", O_WRONLY|O_CREAT|O_TRUNC|O_NONBLOCK, 
0644) = 5
2005  write(5, " <at> \0\0\0Li\231\24\1\365|\334\271\37\0\0\0u\0\1", 20) = 20
2005  close(5)                          = 0
(Continue reading)

Charlie Brady | 17 Aug 2010 01:30
Picon
Picon

Re: runsv failing when starting up logger - missing pipe - failure of logpipe init?


On Mon, 16 Aug 2010, Charlie Brady wrote:

> Here's the code in runsv where the child is dying:
> 
> ...
>   if (p == 0) {
>     /* child */
>     if (haslog) {
>       if (s->islog) {
>         if (fd_copy(0, logpipe[0]) == -1)
>           fatal("unable to setup filedescriptor for ./log/run");
> ...
> 
> and here is where logpipe[0] should have been initialised:
> 
> ...
>       if (pipe(logpipe) == -1)
>         fatal("unable to create log pipe");
> ...
> 
> Has anyone else seen this error condition or can posit a situation where 
> it might be seen?

The next question to ponder is where the bug lies. The runsv process here 
has no fd 5 and fd 6 - IOW, logpipe[0] is 5, but isn't a valid fd. Are 
there circumstances where a pipe can just cease to be? Should runsv have 
detected this issue (where pipe() did not return -1, but the fds returned 
were not valid)?

(Continue reading)

Laurent Bercot | 17 Aug 2010 09:57
Favicon

Re: runsv failing when starting up logger - missing pipe - failure of logpipe init?

>> Has anyone else seen this error condition or can posit a situation where 
>> it might be seen?
> 
> The next question to ponder is where the bug lies. The runsv process here 
> has no fd 5 and fd 6 - IOW, logpipe[0] is 5, but isn't a valid fd. Are 
> there circumstances where a pipe can just cease to be? Should runsv have 
> detected this issue (where pipe() did not return -1, but the fds returned 
> were not valid)?
> 
> Is this a linux kernel bug?

 Before accusing the Linux kernel, let's check the runsv code and see
whether there's a possible execution path that leads to the situation
you're describing...

 The pipe creation part looks correct.
 The part where the error occurs looks correct.
 Okay, so is there a place where the pipe might be closed? Sure enough,
there is: right at the end, if svd[0].want == W_EXIT, svd[0].state == DOWN,
svd[1].pid != 0 and svd[1].want != W_EXIT, then logpipe[1] and logpipe[0]
both get closed. And this is the only place where it can happen.

 My bet is that at some point, your runsv ran through that code, but
somehow managed to live and the services didn't die, i.e. another control
message was sent and processed before the exit condition was reached, and
runsv is still trying to supervise things - but runs into trouble with the
closed logpipe. I have no time to investigate further right now, but earlier
in your strace, you should see stuff such as the control messages arriving,
the logpipe getting closed, etc.
 If my bet is correct, then the bug is that there's a case where runsv can
(Continue reading)

Charlie Brady | 17 Aug 2010 13:56
Picon
Picon

Re: runsv failing when starting up logger - missing pipe - failure of logpipe init?


On Tue, 17 Aug 2010, Laurent Bercot wrote:

>  The pipe creation part looks correct.
>  The part where the error occurs looks correct.
>  Okay, so is there a place where the pipe might be closed? Sure enough,
> there is: right at the end, if svd[0].want == W_EXIT, svd[0].state == DOWN,
> svd[1].pid != 0 and svd[1].want != W_EXIT, then logpipe[1] and logpipe[0]
> both get closed. And this is the only place where it can happen.
> 
>  My bet is that at some point, your runsv ran through that code, but
> somehow managed to live and the services didn't die, i.e. another control
> message was sent and processed before the exit condition was reached, and
> runsv is still trying to supervise things - but runs into trouble with the
> closed logpipe. I have no time to investigate further right now, but earlier
> in your strace, you should see stuff such as the control messages arriving,
> the logpipe getting closed, etc.
>  If my bet is correct, then the bug is that there's a case where runsv can
> close the logpipe and still keep going, whereas it should exit as soon as
> the logger dies no matter what (or just exit on the spot and let the logger
> die on its own).

Thanks Laurent, for your pointer. Unfortunately the strace won't help, 
since it wasn't started until long after the runsv process was already 
malfunctioning.

Presumably Gerrit will have a good think about possible execution paths. 
I'll look further too.

---
(Continue reading)

Jean-Michel Bruenn | 17 Aug 2010 19:08
Picon
Gravatar

hello - hanging services

Hey, 

i'm curious what happens with hung(hanging?) services (or zombies) is it
possible with runit to detect those and restart the service?

Cheers

Jean-Michel Bruenn | 17 Aug 2010 19:24
Picon
Gravatar

Re: hello - hanging services

Hello,

thanks for your answer. Wouldn't it be a good improvement for runit,
if it would take care of hanging tasks, also? There's "run", "finish"
and the "log" stuff - wouldn't it be possible to add "check" as script,
which is running a command all X seconds and if it gets a response it
knows "ah okay, the service is still running" and if it gets no
response "oh, the service seems to have died, let's restart it"?

Of course, totally optional, up to the user whether to use that or not.

Difficult to implement?

Cheers

On Tue, 17 Aug 2010
13:13:55 -0400 (EDT) Charlie Brady
<charlieb-supervise <at> budge.apana.org.au> wrote:

> 
> On Tue, 17 Aug 2010, Jean-Michel Bruenn wrote:
> 
> > Hey, 
> > 
> > i'm curious what happens with hung(hanging?) services (or zombies) is it
> > possible with runit to detect those and restart the service?
> 
> hung/hanging services and zombies are different things. A zombie is a 
> process which doesn't exist - it's just a process remnant - a status 
> report which the kernel is hanging onto waiting for someone to ask for it. 
(Continue reading)

Charlie Brady | 17 Aug 2010 19:38
Picon
Picon

Re: hello - hanging services


On Tue, 17 Aug 2010, Jean-Michel Bruenn wrote:

> Hello,
> 
> thanks for your answer. Wouldn't it be a good improvement for runit,
> if it would take care of hanging tasks, also?

You gotta detect them first, which is a non-trivial problem 
(algorithmically impossible in general - 
http://en.wikipedia.org/wiki/Halting_problem).

> There's "run", "finish"
> and the "log" stuff - wouldn't it be possible to add "check" as script,

check script already exists, but not what you are suggesting:

http://manpages.ubuntu.com/manpages/jaunty/man8/sv.8.html

> which is running a command all X seconds and if it gets a response it
> knows "ah okay, the service is still running" and if it gets no
> response "oh, the service seems to have died, let's restart it"?
> 
> Of course, totally optional, up to the user whether to use that or not.
> 
> Difficult to implement?

Yes.

Please check the archives - this has been discussed previously.
(Continue reading)

Charlie Brady | 17 Aug 2010 22:51
Picon
Picon

Re: runsv failing when starting up logger - missing pipe - failure of logpipe init?


On Tue, 17 Aug 2010, Laurent Bercot wrote:

>  Before accusing the Linux kernel, let's check the runsv code and see
> whether there's a possible execution path that leads to the situation
> you're describing...
> 
>  The pipe creation part looks correct.
>  The part where the error occurs looks correct.
>  Okay, so is there a place where the pipe might be closed? Sure enough,
> there is: right at the end, if svd[0].want == W_EXIT, svd[0].state == DOWN,
> svd[1].pid != 0 and svd[1].want != W_EXIT, then logpipe[1] and logpipe[0]
> both get closed. And this is the only place where it can happen.
> 
>  My bet is that at some point, your runsv ran through that code, but
> somehow managed to live and the services didn't die, i.e. another control
> message was sent and processed before the exit condition was reached, and
> runsv is still trying to supervise things - but runs into trouble with the
> closed logpipe. I have no time to investigate further right now, but earlier
> in your strace, you should see stuff such as the control messages arriving,
> the logpipe getting closed, etc.
>  If my bet is correct, then the bug is that there's a case where runsv can
> close the logpipe and still keep going, whereas it should exit as soon as
> the logger dies no matter what (or just exit on the spot and let the logger
> die on its own).

Some additional information. There have also been some runsv fatal startup 
errors, due to failure to flock supervise/lock. The most recent of those 
post-dates the most recent ./log/run failure.

(Continue reading)

Laurent Bercot | 18 Aug 2010 12:57
Favicon

Re: hello - hanging services

>> which is running a command all X seconds and if it gets a response it
>> knows "ah okay, the service is still running" and if it gets no
>> response "oh, the service seems to have died, let's restart it"?
>> 
>> Difficult to implement?
> 
> Yes.

 More precisely, it's not so much "difficult to implement" (I've done
it for a paying customer's project) as "impossible to do without specific
support in the service you're trying to manage".
 In other words, what Jean-Michel wants is a software watchdog; it can
be done, but it's pretty intrusive. It requires having a library, a daemon,
and making library calls in the managed process' source, sending messages
to the daemon by doing so. The daemon is configured with a certain policy
that decides "the service is running fine" or "the service has hung"
depending on the frequency of the messages it receives.

 It's doable, and a watchdog library/daemon may even have its place in
a supervision suite (I'll think about it), but it certainly has nothing
to do with purely external process management tools such as runsvdir/runsv
or svscan/supervise. It's a whole piece of software on its own.

 I'm certain that a lot of open source software watchdogs already exist
out there. I'm also certain that none of them is as lightweight and easy
to use as I'd like, but that's another story.

--

-- 
 Laurent

(Continue reading)

Jean-Michel Bruenn | 18 Aug 2010 17:06
Picon
Gravatar

Re: hello - hanging services

> >> 
> >> Difficult to implement?
> > 
> > Yes.
> 
>  More precisely, it's not so much "difficult to implement" (I've done
> it for a paying customer's project) as "impossible to do without specific
> support in the service you're trying to manage".
>  In other words, what Jean-Michel wants is a software watchdog; it can
> be done, but it's pretty intrusive. It requires having a library, a daemon,
> and making library calls in the managed process' source, sending messages
> to the daemon by doing so. The daemon is configured with a certain policy
> that decides "the service is running fine" or "the service has hung"
> depending on the frequency of the messages it receives.
> 
>  It's doable, and a watchdog library/daemon may even have its place in
> a supervision suite (I'll think about it), but it certainly has nothing
> to do with purely external process management tools such as runsvdir/runsv
> or svscan/supervise. It's a whole piece of software on its own.
> 
>  I'm certain that a lot of open source software watchdogs already exist
> out there. I'm also certain that none of them is as lightweight and easy
> to use as I'd like, but that's another story.

In fact i was thinking about something more simple, i guess you guys
know nagios? similar to nagios - Just run a command, check for output
or timeout, for example for apache, you write a script called
"hangcheck" which gets run all X seconds by runit. This script
contains something like:

(Continue reading)


Gmane