Robert Ransom | 1 Jan 2012 22:41
Picon

Re: Changeset 0306c5a64775

On 2011-12-31, Roderic Morris <roderyc <at> gmail.com> wrote:
> Right, I noted that and had a similar example when I submitted the patch.
>
> The external-events package doesn't work as advertised, but I couldn't
> figure out how to fix it. It's really the only mechanism that can be
> used to do both library calls correctly though.

See attached for a patch that fixes one clear bug in the
external-events package.  That doesn't fix the problem with
wait-for-child-process.

In order to fix the general problem of incorrect deadlock detection, I
think we need:

1. a way to mark thread queues as ‘blocking on an external event’ for
deadlock-detection purposes
2. a way to mark condvars and placeholders as ‘blocking on an external
event’ (getaddrinfo and wait-for-child-process should be using
placeholders; the posix-signals package should be using condvars)
3. support for registering long-term handlers to handle every
occurrence of an external event UID
4. an as-general-as-possible ‘external asynchronous result’ system to
handle placeholders which should be filled in from C

For piece 1 (and 2), we could try iterating through a list of weak
references to thread queues to determine whether any threads are
blocked on one of the thread queues, but we might need to optimize
that data structure if many of those thread queues turn out to not
have any threads on them.

(Continue reading)

Michael Sperber | 5 Jan 2012 18:15
Picon

Re: Changeset 0306c5a64775


Robert Ransom <rransom.8774 <at> gmail.com> writes:

> On 2011-12-31, Roderic Morris <roderyc <at> gmail.com> wrote:
>> If I'm looking at the right revision, it's to allow the use of wait()
>> without triggering deadlock detection
>
> It might do that, but it breaks get-address-info from the
> net-addresses package (which (ab)uses external-event UIDs in the same
> way).

Not quite: In `net-addresses', the external event is indeed triggered
from the external code, which is not the case with the waitpid stuff,
which just trampolines into the external events from Scheme.  So I
believe the solution is to do with the wait code as with the getaddrinfo
case, namely to wait in a separate thread (or revert to the old
implementation if threads are not available.)  I'll look into doing this
over the weekend.

Meanwhile, thanks for the analysis and the patches (which look good) -
I've pushed those.

--

-- 
Cheers =8-} Mike
Friede, Völkerverständigung und überhaupt blabla

Robert Ransom | 6 Jan 2012 07:20
Picon

Re: Changeset 0306c5a64775

On 2012-01-05, Michael Sperber <sperber <at> deinprogramm.de> wrote:
>
> Robert Ransom <rransom.8774 <at> gmail.com> writes:
>
>> On 2011-12-31, Roderic Morris <roderyc <at> gmail.com> wrote:
>>> If I'm looking at the right revision, it's to allow the use of wait()
>>> without triggering deadlock detection
>>
>> It might do that, but it breaks get-address-info from the
>> net-addresses package (which (ab)uses external-event UIDs in the same
>> way).
>
> Not quite: In `net-addresses', the external event is indeed triggered
> from the external code, which is not the case with the waitpid stuff,
> which just trampolines into the external events from Scheme.  So I
> believe the solution is to do with the wait code as with the getaddrinfo
> case, namely to wait in a separate thread (or revert to the old
> implementation if threads are not available.)  I'll look into doing this
> over the weekend.

When I wrote that, I suspected that part of the problem with
wait-for-child-process was that it dynamically allocated a new
external-event UID for each event it wants to wait for, rather than
using one external-event UID for a whole class of events to be waited
for.  The PreScheme code to handle external events seems to me to be
designed for the latter case.  getaddrinfo also uses a new UID for
each event, rather than one UID for the whole class of
getaddrinfo-completion events; I still suspect that that is why
multiple simultaneous calls to wait-for-child-process broke a later
call to getaddrinfo.
(Continue reading)

Robert Ransom | 6 Jan 2012 07:36
Picon

Re: Changeset 0306c5a64775

On 2012-01-06, Robert Ransom <rransom.8774 <at> gmail.com> wrote:

> See attached for a bundle which fixes wait-for-child-process.

Oops.  You want changeset fdd24fd71a02 (and its ancestors).  See
attached for a bundle containing only the good commits.

Robert Ransom
erana Owl | 6 Jan 2012 15:15
Picon

For MIT, 27-trees

Goodday schemers,

I've made an extension to octrees, called a 27-tree.
Paper is here :
http://soft.vub.ac.be/~jceuppen/papers/paper27tree.ps

Python code with ultima 8 game is here:
http://soft.vub.ac.be/~jceuppen/python/koboldsquest-2.1.tar.gz

Finally a scheme implementation which can be used with my former framebuffer code, to make a 3D engine, is here:
http://soft.vub.ac.be/~jceuppen/scheme/
(version 0.3 has been tested in guile and s48, fb extension is also in that directory)

Enjoy,
Johan

Michael Sperber | 6 Jan 2012 15:53
Picon

Re: Changeset 0306c5a64775


Thanks for looking into this!

Robert Ransom <rransom.8774 <at> gmail.com> writes:

> See attached for a bundle which fixes wait-for-child-process.  I'm no
> longer convinced that changeset 0306c5a64775 was wrong, but I think my
> bundle uses a cleaner approach, and it works with the current
> external-event and interrupt systems.

I'm somewhat suspicious of the changes in the bundle, as it's not clear
to me why deadlock should get erroneously signalled before the changes:
After all, there's is special provision to *not* signal deadlock in the
root scheduler - it calls `waiting-for-external-events?', and if that
returns #t, no deadlock is assumed.  (And I still think the right fix is
to handle wait the same as getaddrinfo.)

You've looked at the code in depth - did you consider this?

> When I wrote that, I suspected that part of the problem with
> wait-for-child-process was that it dynamically allocated a new
> external-event UID for each event it wants to wait for, rather than
> using one external-event UID for a whole class of events to be waited
> for.  The PreScheme code to handle external events seems to me to be
> designed for the latter case. [...]

> getaddrinfo also uses a new UID for each event, rather than one UID
> for the whole class of getaddrinfo-completion events; I still suspect
> that that is why multiple simultaneous calls to wait-for-child-process
> broke a later call to getaddrinfo.

It was designed for both cases, but in fact the getaddrinfo code was the
original motivation for implementing it.  It's needed there because
there may be multiple simultaneous active calls to getaddrinfo from
different threads, and they all need to be notified separately of
completion.

So, if that doesn't work correctly, there's a bug.

--

-- 
Cheers =8-} Mike
Friede, Völkerverständigung und überhaupt blabla

erana Owl | 6 Jan 2012 16:58
Picon

Re: For MIT, 27-trees

New scheme48-fb is available, version 0.1.9, See Changelog for details.
http://soft.vub.ac.be/~jceuppen/scheme/

3D engine code coming up later on.

Enjoy,
Johan

2012/1/6 erana Owl <0wl256 <at> gmail.com>
Goodday schemers,

I've made an extension to octrees, called a 27-tree.
Paper is here :
http://soft.vub.ac.be/~jceuppen/papers/paper27tree.ps

Python code with ultima 8 game is here:
http://soft.vub.ac.be/~jceuppen/python/koboldsquest-2.1.tar.gz

Finally a scheme implementation which can be used with my former framebuffer code, to make a 3D engine, is here:
http://soft.vub.ac.be/~jceuppen/scheme/
(version 0.3 has been tested in guile and s48, fb extension is also in that directory)

Enjoy,
Johan


Robert Ransom | 7 Jan 2012 20:05
Picon

Re: Changeset 0306c5a64775

On 2012-01-06, Michael Sperber <sperber <at> deinprogramm.de> wrote:
>
> Thanks for looking into this!
>
> Robert Ransom <rransom.8774 <at> gmail.com> writes:
>
>> See attached for a bundle which fixes wait-for-child-process.  I'm no
>> longer convinced that changeset 0306c5a64775 was wrong, but I think my
>> bundle uses a cleaner approach, and it works with the current
>> external-event and interrupt systems.
>
> I'm somewhat suspicious of the changes in the bundle, as it's not clear
> to me why deadlock should get erroneously signalled before the changes:
> After all, there's is special provision to *not* signal deadlock in the
> root scheduler - it calls `waiting-for-external-events?', and if that
> returns #t, no deadlock is assumed.  (And I still think the right fix is
> to handle wait the same as getaddrinfo.)
>
> You've looked at the code in depth - did you consider this?

Before Roderic's patch, waiting-for-external-events? didn't know that
threads which were waiting on the process-id's placeholder were
waiting for an external event, just as it still doesn't know that
threads which are waiting on a signal queue are waiting for an
external event.  My branch provides a way to inform
waiting-for-external-events? that threads blocked on a given
synchronization object will be alerted when an external event of some
sort occurs; this should be used for signal queues, too.

Regarding getaddrinfo, I think the Right Thing is to make the Scheme
interface to getaddrinfo fill in a placeholder, rather than using the
external event system directly as it does now.  Ideally, the external
asynchronous result code would be written in such a way that both
getaddrinfo and wait-for-child-process could use it.

>> When I wrote that, I suspected that part of the problem with
>> wait-for-child-process was that it dynamically allocated a new
>> external-event UID for each event it wants to wait for, rather than
>> using one external-event UID for a whole class of events to be waited
>> for.  The PreScheme code to handle external events seems to me to be
>> designed for the latter case. [...]
>
>> getaddrinfo also uses a new UID for each event, rather than one UID
>> for the whole class of getaddrinfo-completion events; I still suspect
>> that that is why multiple simultaneous calls to wait-for-child-process
>> broke a later call to getaddrinfo.
>
> It was designed for both cases, but in fact the getaddrinfo code was the
> original motivation for implementing it.  It's needed there because
> there may be multiple simultaneous active calls to getaddrinfo from
> different threads, and they all need to be notified separately of
> completion.
>
> So, if that doesn't work correctly, there's a bug.

There probably is a bug.  It looks like the same bug in how interrupts
are handled (when multiple interrupts of the same type arrive at about
the same time, some seem to get 'lost' or delayed) that causes my test
case for the POSIX signals package to fail.  (That test case has been
disabled since I wrote it because I couldn't figure out why it wasn't
working or how to fix it.)

Someone may have to put 'bread crumbs' into the VM and RTS in order to
figure out exactly what is going wrong.

Robert Ransom

Michael Sperber | 7 Jan 2012 20:15
Picon

Re: Changeset 0306c5a64775


Robert Ransom <rransom.8774 <at> gmail.com> writes:

> Before Roderic's patch, waiting-for-external-events? didn't know that
> threads which were waiting on the process-id's placeholder were
> waiting for an external event, just as it still doesn't know that
> threads which are waiting on a signal queue are waiting for an
> external event.

Why would it have to know that?  And, even it were necessary to know
that, it would be easier to look at the condvar's waiters instead of
marking the queues, no?

I still feel I'm missing something important in your argument ...

> Regarding getaddrinfo, I think the Right Thing is to make the Scheme
> interface to getaddrinfo fill in a placeholder, rather than using the
> external event system directly as it does now.  

OK - it still seems to me the hard part is communicating the information
from C to Scheme somehow.

But at this point I really just want to fix the deadlock issue and the
potential bug before the release (which I really, really want to happen
soon).  After that, we can make bigger changes.

--

-- 
Cheers =8-} Mike
Friede, Völkerverständigung und überhaupt blabla

Robert Ransom | 7 Jan 2012 20:41
Picon

Re: Changeset 0306c5a64775

On 2012-01-07, Michael Sperber <sperber <at> deinprogramm.de> wrote:
>
> Robert Ransom <rransom.8774 <at> gmail.com> writes:
>
>> Before Roderic's patch, waiting-for-external-events? didn't know that
>> threads which were waiting on the process-id's placeholder were
>> waiting for an external event, just as it still doesn't know that
>> threads which are waiting on a signal queue are waiting for an
>> external event.
>
> Why would it have to know that?

In the process-id case, threads waiting on a process ID's placeholder
are awakened by process-terminated-children, which is called by the
os-signal interrupt handler.  Nothing waits on an external-event UID.

In the signal-queue case, threads waiting on a signal queue's
ready-for-read condvar are awakened by the os-signal interrupt handler
pushing a signal object into the signal queue.  Nothing waits on an
external-event UID.

If the deadlock-detection code isn't told that those synchronization
objects will be alerted by (an interrupt handler triggered by) an
external event of some sort, then it will falsely report that a
deadlock has occurred.

> And, even it were necessary to know
> that, it would be easier to look at the condvar's waiters instead of
> marking the queues, no?

No, because we also need to look at placeholders' waiters, and the
simplest way to handle both condvars and placeholders (as well as
other synchronization ‘primitives’ which may be added in the future)
is to look at their underlying thread queues.

> I still feel I'm missing something important in your argument ...
>
>> Regarding getaddrinfo, I think the Right Thing is to make the Scheme
>> interface to getaddrinfo fill in a placeholder, rather than using the
>> external event system directly as it does now.
>
> OK - it still seems to me the hard part is communicating the information
> from C to Scheme somehow.
>
> But at this point I really just want to fix the deadlock issue and the
> potential bug before the release (which I really, really want to happen
> soon).  After that, we can make bigger changes.

The interrupt system has been slightly broken for as long as I nave
used Scheme 48, and I haven't figured out why.  I think not using
interrupts in a way that will definitely trigger that brokenness (i.e.
not flooding it with multiple interrupts of the same type at once) is
sufficient for the release.

Robert Ransom


Gmane