Patrick Stein | 1 Jun 21:43 2008

Cannot start two SBCLs at the same time on a PPC Mac

I started playing with writing UFFI bindings for OpenMPI.  I think
I have a good start on this, but I can't truly test it at home
because I only have SBCL installed on one machine, and I am having
a terrible time getting even the simplest SBCL program to run twice
at the same time.

Here's my #p"run.sh" shell script:
=======================================
   #!/bin/sh

   exec sbcl \
       --noinform \
       --end-runtime-options \
       --no-sysinit \
       --no-userinit \
       --noprint \
       --disable-debugger \
       --eval '(sb-ext:quit)' \
       --end-toplevel-options
=======================================

If I do "./run.sh" then everything works just as I would expect.
And, I can do "./run.sh ; ./run.sh" and it runs twice.

But, if I try to do "./run.sh & ./run.sh" so that they both run at
the same time, I get one SEGV and one success.  If I put three in
the background and one in the foreground, I get three SEGVs and one
success.

Thinking that maybe it tried to lock standard-in or standard-out
(Continue reading)

Kevin Reid | 1 Jun 21:58 2008
Picon

Re: Cannot start two SBCLs at the same time on a PPC Mac

On Jun 1, 2008, at 15:43, Patrick Stein wrote:
> If I do "./run.sh" then everything works just as I would expect.
> And, I can do "./run.sh ; ./run.sh" and it runs twice.
>
> But, if I try to do "./run.sh & ./run.sh" so that they both run at
> the same time, I get one SEGV and one success.  If I put three in
> the background and one in the foreground, I get three SEGVs and one
> success.

I noticed the same behavior a long time ago, and can reproduce it  
even now, as shown below. I am also on a PPC Mac (PowerBook G4), Mac  
OS X 10.4.11.

If anyone would like me to try building with experimental fixes or  
debugging traces, let me know (on sbcl-devel).

[Magpie:~ kpreid]$ for x in 1 2 3 4; do sbcl --no-sysinit --no- 
userinit --eval "(progn (print $x) (quit))" & done
[5] 4991
[6] 4992
[7] 4993
[8] 4994
[Magpie:~ kpreid]$ This is SBCL 1.0.16.11, an implementation of ANSI  
Common Lisp.
More information about SBCL is available at <http://www.sbcl.org/>.

SBCL is free software, provided as is, with absolutely no warranty.
It is mostly in the public domain; some portions are provided under
BSD-style licenses.  See the CREDITS and COPYING files in the
distribution for more information.
(Continue reading)

Nikodemus Siivola | 2 Jun 01:30 2008
Picon

Re: Cannot start two SBCLs at the same time on a PPC Mac

> [5]   Segmentation fault      sbcl --no-sysinit --no-userinit --eval
> "(progn (print $x) (quit))"
> [6]   Segmentation fault      sbcl --no-sysinit --no-userinit --eval
> "(progn (print $x) (quit))"
> [7]   Segmentation fault      sbcl --no-sysinit --no-userinit --eval
> "(progn (print $x) (quit))"
> [8]-  Done                    sbcl --no-sysinit --no-userinit --eval
> "(progn (print $x) (quit))"

This is very wierd. Even with segfaults in unknown addresses you
should be getting debugger (or at least LDB) prompts and reports from
the shell about stopped processes. If SBCL dies from a segfault like
this it seems to me that the handlers have not been set up yet, which
means nothing very much is happening yet.

I might be instructive to stick

 sleep(60);
 exit(42);

into main(), and see how far things get. I suspect the trouble happens
just before the call

  enable_lossage_handler()

...that in turn make me think that maybe mmap'ing the core didn't go
quite right.

Cheers,

(Continue reading)

Patrick Stein | 2 Jun 04:36 2008

Re: Cannot start two SBCLs at the same time on a PPC Mac

On Sun, Jun 1, 2008 at 6:30 PM, Nikodemus Siivola
<nikodemus <at> random-state.net> wrote:
> This is very wierd. Even with segfaults in unknown addresses you
> should be getting debugger (or at least LDB) prompts and reports from
> the shell about stopped processes. If SBCL dies from a segfault like
> this it seems to me that the handlers have not been set up yet, which
> means nothing very much is happening yet.

Yes, I was hoping to figure out what was going on in gdb(1), but I definitely
wasn't getting the core read in right there.  Or, at least the initial function
it chose was on a page that wasn't executable.

When I get a chance, I'll try sticking some stuff into main to figure out
exactly where.  I suppose if it set up the interrupt handlers to jump into
the sbcl.core somewhere and then managed to catch a signal before
the mmap was ready... I dunno....

I, too, suspect mmap(2), but there's a segment of code where it tries to
determine if you're masked from getting an interrupt while handling an
interrupt.  That seems like a potential place to go afoul in the kernel, too.

more later,
Patrick

ps. Sorry about the vacuous reply earlier, Nikodemus.  I have not yet
mastered iPhone+Gmail+MailingLists.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
(Continue reading)

Nikodemus Siivola | 2 Jun 10:18 2008
Picon

Re: Cannot start two SBCLs at the same time on a PPC Mac

On Mon, Jun 2, 2008 at 5:36 AM, Patrick Stein <sbcl-help <at> nklein.com> wrote:

> Yes, I was hoping to figure out what was going on in gdb(1), but I definitely
> wasn't getting the core read in right there.  Or, at least the initial function
> it chose was on a page that wasn't executable.

Oh, I missed that bit of your email. GDB can't understand SBCL core
files To run SBCL under GDB you normally just do "gdb sbcl".

In this case, something like this might be instructive:

cat > cmds
handle SIGUSR1 nostop noprint pass
trace main
trace enable_lossage_handler
trace arch_install_interrupt_handlers
trace create_initial_thread
run --eval '(progn (sleep 1) (quit))'
q
^D

gdb -x cmds sbcl & gdb -x cmds sbcl

> When I get a chance, I'll try sticking some stuff into main to figure out
> exactly where.  I suppose if it set up the interrupt handlers to jump into
> the sbcl.core somewhere and then managed to catch a signal before
> the mmap was ready... I dunno....
>
> I, too, suspect mmap(2), but there's a segment of code where it tries to
> determine if you're masked from getting an interrupt while handling an
(Continue reading)

Patrick Stein | 2 Jun 16:51 2008

Re: Cannot start two SBCLs at the same time on a PPC Mac

On Mon, Jun 2, 2008 at 3:18 AM, Nikodemus Siivola
<nikodemus <at> random-state.net> wrote:
> Oh, I missed that bit of your email. GDB can't understand SBCL core
> files To run SBCL under GDB you normally just do "gdb sbcl".

Yes, there was some confusing language in my previous message
because I was at times trying to refer to the /usr/local/lib/sbcl/sbcl.core
and at other times trying to refer to the core file created by Darwin for
me after the SEGV.  I was hoping that GDB would be able to still grok
the Darwin generated core file.  And, it looks like it does.  It doesn't
understand the stack very well though.  But, I can still query global
C variables and such.

So, I did this.  I added these lines to the top of runtime.c:
const char* cur_file = __FILE__;
unsigned int cur_line = __LINE__;

and these at the top of some other files:
extern const char* cur_file;
extern unsigned int cur_line;

Then, at various points around runtime.c and thread.c, I did this:
{ cur_file = __FILE__; cur_line = __LINE__; }

When, I had only done it for runtime.c, I made it the whole way down
the end of main() to the create_initial_thread().  That doesn't mean that
everything went well before then... it still could be the mmap(2) having
problems.  But, it didn't die in the enable_lossage_handler() portion.

Now, I've run it again with the flagging in thread.c.  I've made it to the
(Continue reading)

Patrick Stein | 2 Jun 16:56 2008

Re: Cannot start two SBCLs at the same time on a PPC Mac

On Mon, Jun 2, 2008 at 9:51 AM, Patrick Stein <sbcl-help <at> nklein.com> wrote:
> On Mon, Jun 2, 2008 at 3:18 AM, Nikodemus Siivola
> <nikodemus <at> random-state.net> wrote:
>> My mmap suspicions are centered getting the memory at the wrong
>> address (but having mmap lie about it), or getting MAP_SHARED even
>> when we ask for MAP_PRIVATE.

So, I like the MAP_SHARED theory... but I don't like the wrong address theory.
I can't form a useful hypothesis about why it would lie to all but one of the
procs in this scenario.  I would expect to it to either lie to all of
them or just
some percentage of them.  But, it would have to be lying to n-1 of them.
And, I can't think of how it could manage that.

ttyl,
Patrick

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
Patrick Stein | 2 Jun 18:53 2008

Re: Cannot start two SBCLs at the same time on a PPC Mac

So, I just recompiled with QSHOW defined.  Then, I did this:
===============================================================
for job in 1 2; do
    /tmp/sbcl/bin/sbcl --noinform --no-userinit --no-sysinit \
        --eval "(progn (print ${job}) (quit))" \
        1> /tmp/out.${job} 2>&1 &
done
===============================================================

I was really surprised by the output.  Job 1 crashed.  Job 2 generated
all kinds of complaints about possible heap WP violations and
succeeded.

Job 1 crashed with the program counter at 0x10aea7d8
          whilst the initial_function was 0x1000f9bd.

So, it's looking like both processes memory fault, one catches the
faults, the other doesn't.  Of course, none of the memory faults
in job 2 seem to have happened while the PC was at 10aea7d8.  Though,
I suppose that maybe job 1 is trying to handle the fault by jumping
into someplace bad.  I dunno.

more later,
Patrick

====diff -c out.1 out.2========================================
*** out.1	2008-06-02 11:35:51.000000000 -0500
--- out.2	2008-06-02 11:35:51.000000000 -0500
***************
*** 77,79 ****
(Continue reading)

Nikodemus Siivola | 2 Jun 20:09 2008
Picon

Re: Cannot start two SBCLs at the same time on a PPC Mac

On Mon, Jun 2, 2008 at 7:53 PM, Patrick Stein <sbcl-help <at> nklein.com> wrote:

> + Memory fault at: 0x100e749c, PC: 0x10061290
> + heap WP violation? fault_addr=100e749c, page_index=231

These seem business as usual: the heap is WP protected, so that we can
mark pages dirty when they are first written to.

Cheers,

 -- Nikodemus

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
sbcl-help | 3 Jun 01:09 2008

Re: Cannot start two SBCLs at the same time on a PPC Mac

On 6/2/08, Nikodemus Siivola <nikodemus <at> random-state.net> wrote:
> These seem business as usual: the heap is WP protected, so that we can
> mark pages dirty when they are first written to.

Ah, that makes sense. I have recompiled with a more advanced tracking
of which lines in src/runtime/ have been visited. It doesn't seem like
the dying proc ever makes it into the memory fault handler.  I have
now placed markers in the other interrupt handlers, but the PC seems
to indicate those aren't involved either.

Somehow, the dying proc segfaults without getting caught.

more later,
Patrick

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

Gmane