Maxime Villard | 28 Jul 21:44 2014

Set of 26 potential bugs

I've improved my code scanner, and it has found 26 potential bugs under sys/.
Instead of flooding tech-kern <at> /tech-net <at> /etc with questions and fixes, I put
here a comprehensive list:

Some of them will probably need to be pulled up, like the leak in netinet6
(#03-0x02) which seems to be triggerable from root.


Sanchit Gupta | 28 Jul 07:10 2014

Interested in working on 'Rewriting Kernfs and Procfs'

I am Sanchit Gupta, an Undergraduate at University of Illinois Urbana Champaign. I am currently an intern
at Vmware (Kernel Team). My mentor is Matthew Green, one of the members of NetBSD core group.

I am really interested in operating systems and wanted to work on some of the projects under NetBSD to gain
some experience, and also to get to know other developers out there who are also into systems programming.
I know that I am a beginner at this, but it would be great if I could learn under the guidance of some of your

I wanted to get involved with development on NetBSD by working on Kernfs and Procfs. I would be grateful if
you could get me hooked up with the team that is currently working on this issue. 

Also, I wanted to know whether the ZFS porting was completed or not as I am also interested in working on that project.

Ryota Ozaki | 27 Jul 07:38 2014

Making global variables of if.c MPSAFE


This is another work toward MPSAFE networking.
sys/net/if.c contains several global variables
(e.g., ifnet_list) that should be protected
from parallel accesses somehow once we get rid
of the big kernel lock.

Currently there is a mutex for serializing
accesses to index_gen, however, that's not
enough; we have to protect the other variables

The global variables are read-mostly, so I
replace the mutex with a rwlock and use it
for all. Unfortunately, ifnet_list may be
accessed from interrupt context (only read
though) so that I add a spin mutex for it;
we hold the mutex when we modify ifnet_list
as well as the rwlock.

Here is the patch. It doesn't remove any existing
locks and splnets yet.

Any comments and suggestions are welcome.

(Continue reading)

David Holland | 27 Jul 07:19 2014

fdiscard error cases

Currently filesystems that don't support discard fail with EOPNOTSUPP,
devices that don't support discard fail with ENODEV, and once you get
to the hardware devices that don't feel like doing TRIM fail silently
(as TRIM is an advisory operation...).

It was pointed out that it would be well to distinguish devices that
don't currently support discard, but theoretically should (because
they're disks) from devices where it makes no sense (e.g. ttys). This
is probably a good idea.

The question is: what errnos to use for which cases? I'd originally
suggested ENOSYS for the ones where it's just not implemented, but
this is traditionally not appropriate (ENOSYS being meant for
unimplemented whole syscalls)... another choice is to use EOPNOTSUPP
for the devices that should theoretically support discard, but it was
also at the same time proposed to use ENODEV for those and EOPNOTSUPP
for devices where it doesn't make sense.

Currently ENODEV prints "Operation not supported by device" and
EOPNOTSUPP prints "Operation not supported". Traditionally EOPNOTSUPP
is for sockets, of course... although that stopped being strictly true
some time ago. (Then there's also ENOTSUP, which posix invented when
they needed EOPNOTSUPP but didn't have it.)

The devices in question are ccd, cgd (maybe), ld, raidframe, and vnd,
and it might make sense to apply the same errno in sd until it gains
support for the SAS equivalent of TRIM... and maybe also in other
random disk devices, like the vax ones. (It's also possible that vax
disks should accept and ignore discard...)

(Continue reading)

Kengo NAKAHARA | 25 Jul 11:06 2014

RFC: IRQ affinity (aka interrupt routing)


I implement IRQ affinity (aka interrupt routing) for amd64. Here is
the implementation.
But the UI is rough, so could you comment aboud the UI?

The implementation is consist of following three pathes:
    (1) IRQ affinity implementation itself
        The usage is "sysctl -w  kern.cpu_affinity.irq=18:1" ("18" is
        IRQ number, "1" is cpuid). This mean the IRQ 18 interrupts
        route to cpuid 1 cpu core.

    (2) new kernfs file to show interrupt numbers per cpu
        "vmstat -i" cannot show interrupt numbers per cpu. So I
        implement new kernfs file to show per cpu. The usage is "cat
        /kern/interrupts". For example, we can see below output
        which is like linux's /proc/interrupts
         IRQ      CPU#00          CPU#01
           1           0*              0
           3           0*              0
           4           0*              0
           6           0*              0
           7           0*              0
           9           0*              0
          12           0*              0
          14           0*              0
          15           0*              0
(Continue reading)

Thor Lancelot Simon | 23 Jul 14:46 2014

Marvell 88SE9230 AHCI?

I just ordered a Supermicro low-power rackmount x86 machine which -- I
discovered after ordering it -- uses a Marvell 88SE9230 as a SATA controller.

From FreeBSD's ahci.c, I see this is approximately an AHCI compatible
controller but that it has some quirks.  Is anyone using NetBSD with this
controller or the similar 88SE9128, 9128, or 9130?


Alan Barrett | 23 Jul 11:31 2014

More detailed build infomation in kernels

I have some private patches that append arbitrary additional 
information to the kernel version string.  Essentially, I pass 
BUILDINFO="<multi-line message here>" in the environment (through and the make wrapper), and then a modified version of 
src/sys/conf/ appends it to the value of the "version" 
variable in the vers.c file that's compiled into the kernel.  I 
also add the information to /etc/release.

The additional information is exposed in sysctl kern.version, and 
in {struct utsname}.version as returned by uname(3), and in the 
output from uname(1) -v.

I use this feature to add infomation about the source tree
and build date, so I see information like this:

$ sysctl kern.version
kern.version = NetBSD 6.99.47 (APB)
fossil repository: apb-local-src.fossil
fossil tags: local
fossil commit: 449e51b700 (2014-07-19 15:41:14 UTC)
fossil comment: merge src from trunk as of 2014-07-19 00:00 UTC

I imagine that it would be useful for official builds to
include some sort of official statement here.

The multi-line BUILDINFO strings are truncated and folded to a 
single line by uname(3), which is unhelpful, so I am inclined to 
store them in a new kernel variable, exposed via a new sysctl 
node, instead of appending to the existing kernel version 
variable.  Then the new information would not be exposed by 
(Continue reading)

Taylor R Campbell | 23 Jul 02:42 2014

task queues

We don't have a nice easy lightweight way for a driver to ask that a
task be executed asynchronously in thread context.  The closest we
have is workqueue(9), but each user has to do bookkeeping for a
different workqueue and incurs a long-term kthread, or ncpu long-term
kthreads, that are mostly idle -- most users of workqueue(9) create a
single struct work for the workqueue and use it for a single purpose.

For USB drivers, we have usb_task, but the API is limited and, for
most users, broken: there's no way to wait for a task to complete,
e.g. before detaching a device and unloading its driver module.
There's also no reason this is peculiar to USB drivers.

The other month I threw together an facility for scheduling
lightweight tasks to be executed in thread context on the current CPU,
supporting cancellation, and without requiring a long-term kthread per

All you need to do to call my_task_action asynchronously is allocate a
struct task object, e.g. in your device softc, and do

	task_init(task, &my_task_action);

static void
my_task_action(struct task *task)

	printf("My task ran!\n");

(Continue reading)

Taylor R Campbell | 23 Jul 02:25 2014

pserialized reader/writer locks

While reading the fstrans(9) code a few months ago and trying to
understand what it does, I threw together a couple of simple
pserialize(9)-based reader/writer locks -- one recursive for nestable
transactions like fstrans(9), the other non-recursive.

Unlike rwlock(9), frequent readers do not require interprocessor
synchronization: they use pool_cache(9) to manage per-CPU caches of
in-use records which writers wait for.  Writers, in contrast, are very

The attached code is untested and just represents ideas that were
kicking around in my head.  Thoughts?  Worthwhile to experiment with?
/*	$NetBSD$	*/

 * This code does not run!  I have not even compile-tested it.

 * Copyright (c) 2014 The NetBSD Foundation, Inc.
 * All rights reserved.
 * This code is derived from software contributed to The NetBSD Foundation
 * by Taylor R. Campbell.
 * Redistribution and use in source and binary forms, with or without
(Continue reading)

Taylor R Campbell | 23 Jul 02:12 2014

removing mappings of uvm objects or device pages

How can I remove all virtual mappings of a range of a uvm object, or
all virtual mappings of a physical page that is a device page rather
than a physical RAM page?

- I can't uvm_unmap(vm_map, start, end) because I don't know every
  vm_map into which this uvm object has been mapped and where,
  although if I had a way to list weak references to vm_maps perhaps I
  could keep books about this.

- I can't pmap_page_protect(vm_page, VM_PROT_NONE) because there is no
  struct vm_page for the pages in question -- they're device pages,
  not system RAM pages.

Background (with apologies for the length; this is a little tricky):

Graphics drivers involve buffers, represented by UVM objects, that can
be mapped into the GPU virtual address space or into one or more of
the various CPU address spaces.  Sometimes the driver needs to free up
some GPU virtual address space, so it will choose an inactive buffer
to evict, say at V_gpu, and unmap the GPU page table entry for V_gpu.

If the buffer is not mapped into any CPU virtual address space, all is
well and good.  But if it is mapped into a CPU virtual address space,
say at the virtual address V_cpu, it's a little tricky.

While shared between GPU and CPU virtual address spaces, graphics
buffers are backed by physical pages of system RAM, say at the
physical address P_ram.  The GPU page table's entry for V_gpu points
at P_ram.  But the buffer's fault handler for V_cpu doesn't map it to
P_ram -- it maps V_cpu to a special physical address P_aper in a
(Continue reading)

Taylor R Campbell | 23 Jul 01:01 2014

uvm objects and bus_dma

Several graphics drivers use swap-backed buffers that sometimes need
to be wired into physical memory and made available for DMA by the
device.  For buffers that are always wired and always mapped into KVA,
we can use bus_dmamem.  But these buffers are numerous and large, so
they should not be wired or mapped into KVA all the time.

I propose to add the following MI APIs to bus_dma(9):

int	bus_dmamem_pgfl(bus_dma_tag_t dmat);

   Return a page freelist number fit for use with uao_set_pgfl or
   uvm_pagealloc_strat(UVM_PGA_STRAT_ONLY).  Pages allocated from this
   freelist are guaranteed to have physical addresses fit for DMA with

int	bus_dmamap_load_pglist(bus_dma_tag_t dmat, bus_dmamap_t map,
	    struct pglist *pglist, bus_size_t size, int flags)

   Load map for DMA to the pages in pglist.  The pages must have been
   allocated from the freelist returned by bus_dmamem_pgfl(dmat), and
   are typically obtained by uvm_obj_wirepages.

Typical usage:

	/* Limit to 36-bit paddrs for DMA.  */
	if (bus_dmatag_subregion(pa->pa_dmat64, 0, __BITS(0,35), &dmat,
		goto fail0;

	/* Create an object.  */
(Continue reading)