Paul Goyette | 30 Nov 05:03 2015

Enhanced usability of syscall_{,disestablish}

Currently, the syscall_autoload capability (sys_nomodule() routine in
kern/kern_syscall.h) works only with the default emulation, even though
syscall_{,dis}establish() can handle addition (or removal) of syscall
packages to other emulations.

Because of this restriction in sys_nomodule(), non-default emulations
can only be "enhanced" by manually loaded modules.

This leads to the current practice of having "everything including the
kitchen sink" included in non-default emulations, e.g. compat_netbsd32.

When the module is built-in to the kernel, either all of the potential
dependencies must also be built-in (and the module header must identify
the list of dependencies), or the module must be built with some of the
functionality omitted.  (Dependencies for built-in modules cannot be
resolved through the auto-load mechanism.)  Neither approach is optimal.

Including everything in kernel results in "bloat" (and sort of defeats
the purpose of modularity) for things that are not intended to be used.

Building the built-in compat module with reduced content precludes the
ability to use the omitted features, _even_if_ the underlying features
are later loaded into the kernel.

As an example, compat_netbsd32 currently includes the possibility of
translating the posix mqueue syscalls from 32-bit to 64-bit mode.  If
you include the compat_netbsd32 built-in module in your kernel, you
must also include mqueue.  Perhaps not a big deal, but the same is
true for nfsserver (due to translation of the nfssrv syscall) as
reported in PRs kern/50410 and kern/50468
(Continue reading)

Michael McConville | 30 Nov 03:40 2015

Kernel free() and NULL-safety

Is it decided that free/kern_free shouldn't be NULL-safe? That seems odd
and potentially risky to me, especially because POSIX specifies
userland's free() as NULL-safe. This change would also simplify the
kernel code by removing innumerable NULL checks.

I know that Linux and OpenBSD have NULL-safe kernel free(), so it may
also pose a risk when adapting or importing code.

Maxime Villard | 28 Nov 19:54 2015

x86: map kernel DATA+BSS with NX/XD bit

here is a (draft) patch to map the kernel DATA and BSS segments with
the NX/XD bit in the PTEs on i386+amd64.

A nice PoC: patch your (amd64) kernel with the shellcode below, and
launch this:

	#include <stdio.h>
	#include <stdlib.h>
	#include <unistd.h>
	int main() {
		sched_getparam(0, 0x01);

	gcc -m32 -o test test.c

You get a message from the kernel. Code got executed from the static
buffer (which just returns 5). Then, patch your kernel with the pmap
diff, reboot and relaunch the program: the kernel panics.

Finding information on this part of the kernel is not quite easy; I did
test this patch on amd64, but not i386 - my i386 CPU does not support

Do you have any suggestions? Is there something obviously wrong?


(Continue reading)

David Holland | 28 Nov 17:02 2015

per-process namespaces

As a few people have heard, I thought up a way to implement
per-process namespaces reasonably cheaply without requiring massive
rewrites of everything. It is kind of a hack, but not super awful.

Preliminary patch is here:

It at least mostly passes anita test, but hasn't itself really been
tested as the userspace component hasn't been written yet. There's
also no man pages yet.

Comments on the approach? (And if you spot any obvious bugs in the
code, would be happy to hear about that too.)

The basic principle is that there are two namespaces (ordinary and
privileged) and setugid programs automatically switch to the
privileged namespace (which only root can muck with) by default. Also,
namespaces cannot be modified once installed and inherit from the
previous state. With these mechanisms, unprivileged processes can
manipulate their namespace arbitrarily without compromising
anything. Or so I believe. This is desirable for e.g. sandboxing GNOME
applications or enforcing bl3 in pkgsrc builds.

The scheme works by installing rules of the form (dir-vnode, name) ->
target-vnode. It is the responsibility of whoever crafts the namespace
to install coherent rules; e.g. a simple chroot requires two rules of
this form:
	(/chroots/foo, ..) -> /chroots/foo
	/ -> /chroots/foo)
where / is a special case of the rule table key.
(Continue reading)

Robert Elz | 26 Nov 17:38 2015

In-kernel units for block numbers, etc ...

    Date:        Mon, 17 Aug 2015 23:20:01 +0000 (UTC)
    From:        mlelstv <at> (Michael van Elst)
    Message-ID:  <20150817232001.13113A6558 <at>>

  | The following reply was made to PR bin/50108; it has been noted by GNATS.

The quotes are from a message Michael van Elst made in reply to PR bin/50108
back on 17th August (2015 in case it isn't implied).   The full message
should be available in the PR.

As people who have been following the (meandering) thread related to
"beating a dead horse" on the netbsd-users list will know, I have been
looking at supporting drives with 4K sector sizes properly in NetBSD.

Currently what we have is a mess.   Really, despite this ...

  |  At some point, about when the SCSI subsystem was integrated into
  |  the kernel, the model was changed, the kernel now uses the fixed
  |  DEV_BSIZE=512 coordinates and the driver translates that into
  |  physical blocks.

being kind of close, it is just not true.

Now that PR, and Michael's message, were mostly on the topic of
kernel/user interactions, and noted that userland is expected to use
sector size units, not DEV_BSIZE, and all is fine most of the time,
except for code that's shared (shared WAPBL code was the issue of the PR).

So, not getting all the in-kernel details precisely correct may be
(Continue reading)

Martin Husemann | 26 Nov 14:43 2015

Hack for pullup of e_vm_default_addr() change

Hey folks,

I just commited this slightly intrusive cleanup to our "use topdown VA
depending on the properties of the exec'd binary" support:

Log Message:
We never exec(2) with a kernel vmspace, so do not test for that, but instead
KASSERT() that we don't.
When calculating the load address for the interpreter (e.g. ld.elf_so),
we need to take into account wether the exec'd process will run with
topdown memory or bottom up. We can not use the current vmspace's flags
to test for that, as this happens too early. Luckily the execpack already
knows what the new state will be later, so instead of testing the current
vmspace, pass the info as additional argument to struct emul
Fix all such functions and adopt all callers.


and I would like to pull this change up - but unfortunately such a signature
change and following kernel rev bump can not be done on a release branch.

I came up with a working but quite ugly solution that I'd like to present
below. But first some background:

On most architectures the decision "we use topdown memory layout" is done
once and for all binaries. Only exception that I am aware of is sparc64:
we can not use topdown for some binaries, because they have been compiled
(Continue reading)

Randy White | 24 Nov 09:04 2015

I'm interested in implementing lockless fifo/lifo in the kernel

I love NetBSD, and I would like to contribute. I see the open job for lockless queues, and stacks. I want to learn, and I want to help. I have literature on UNIX kernel development. I have many systems and I think I could fund myself for the most part. I am familiar with lockless programming.

I am looking forward to working on netbsd and help maintaining its awesomeness.

Frank Wille | 23 Nov 19:45 2015

thousands of callout_cpu0


While debugging a different problem I checked the callout list with ddb(4)
on my Amiga and saw this:

db> callout
hardclock_ticks now: 825
    ticks  wheel                  arg  func
      -26 -1/-256                   0  pffasttimo
  4529516 -1/-256              4520a0  callout_cpu0
  4529521 -1/-256              4520a8  callout_cpu0
  4529526 -1/-256              4520b0  callout_cpu0
  4529531 -1/-256              4520b8  callout_cpu0
  4529536 -1/-256              4520c0  callout_cpu0
  4529541 -1/-256              4520c8  callout_cpu0
  4529546 -1/-256              4520d0  callout_cpu0
  4529551 -1/-256              4520d8  callout_cpu0
  4529556 -1/-256              4520e0  callout_cpu0
  4529561 -1/-256              4520e8  callout_cpu0
  4529566 -1/-256              4520f0  callout_cpu0
  4529571 -1/-256              4520f8  callout_cpu0
  4529577 -1/-256              452100  callout_cpu0
  4529582 -1/-256              452108  callout_cpu0
  4529587 -1/-256              452110  callout_cpu0
  4534507 -1/-256              454088  callout_cpu0
791670873 -1/-256              454090  callout_cpu0
2002868682  -1/-256           2f300000  ?

There must be more than a thousand of callout_cpu0 entries with a ticks
value that looks like a pointer (4529516 == 0x451d6c), after only a few
seconds of booting into single user.

The problem exists since at least 7.99.4 (7.0 is ok). I saw it on several
architectures, like i386, ppc and m68k.

Any idea what happened here?


Frank Wille

Masao Uebayashi | 21 Nov 12:13 2015


I'm too young to understand how signal works in kernel.  But I guess
I'm not alone.

I think that renaming things a bit would help people to understand the code.

- sendsig() -> netbsd_sendsig()
- trapsignal() -> netbsd_trapsignal()

These are native emul functions of e_sendsig and e_trapsignal respectively.


- postsig() -> sendsig()

This is so badly named and incredibly confusing, as these is a
function called sigpost() which is completely different.

sigpost() posts a signal to a signal queue.  sigpost() can be called
from anywhere including interrupt context, because all it does is to

sendsig(), the function which is called as postsig() now, and referred
to by the signal(9) manual page and comments all over the tree, is the
sequence of code that is called when a signal is delivered (some
actual action is taking place).

sendsig() ("postsig()") is only called from userret() -> mi_userret()
-> lwp_userret().  If a pending signal is found and it is decided to
be delivered, trap frame is overwritten by signal handler trampoline.
When lwp returns back to user mode, it knows it is running signal


Hopefully these verbs in signal(9) man page and comments are
thoroughly reviewed, sorted out, and updated:

- send
- post
- deliver
- invoke
- schedule

Ryota Ozaki | 20 Nov 06:13 2015

Use lltable/llentry for NDP


As I promised somewhere, I'm trying to use
lltable/llentry, which were introduced for ARP
while ago, for NDP as well.

Here is a patch:

Unlike ARP case, the old data structure (llinfo_nd6)
is similar to new one (llentry) and there isn't
so much radical changes (compared to ARP case).

One noticeable change is for neighbor cache GC
mechanism that was introduced to prevent IPv6 DoS
attacks. net.inet6.ip6.neighborgcthresh was the max
number of caches that we store in the system. After
introducing lltable/llentry, the value is changed
to be per-interface basis because lltable/llentry
stores neighbor caches in each interface separately.
And the change brings one degradation; the old GC
mechanism dropped exceeded packets based on LRU
while the new implementation drops packets in order
from the beginning of lltable (a hash table + linked
lists). It would be improved in the future.

Any comments and suggestions are appreciated.


Emmanuel Dreyfus | 19 Nov 15:21 2015

WAPBL panic


On NetBSD 6.1.5/amd64, I get 

panic: wapbl_flush: current transaction too big to flush

^Mcpu1: Begin traceback...
^Mprintf_nolog() at netbsd:printf_nolog
^Mwapbl_begin() at netbsd:wapbl_begin
^Mwapbl_begin() at netbsd:wapbl_begin+0x45
^Mffs_valloc() at netbsd:ffs_valloc+0x5e
^Mufs_makeinode() at netbsd:ufs_makeinode+0x62
^Mufs_create() at netbsd:ufs_create+0x5c
^MVOP_CREATE() at netbsd:VOP_CREATE+0x38
^Mvn_open() at netbsd:vn_open+0x2ed
^Mdo_open() at netbsd:do_open+0xcd
^Msys_open() at netbsd:sys_open+0x59
^Msyscall() at netbsd:syscall+0xc4
^Mcpu1: End traceback...

A fsck was run two days ago.  Any idea of what can be wrong? Is it a
known problem? Would updating to netbsd-7 help?


Emmanuel Dreyfus
manu <at>