Mark Williamson | 24 Apr 17:01 2015

Regression: Requiring CAP_SYS_ADMIN for /proc/≤pid>/pagemap causes application-level breakage

Hi all,

<resending without unwanted HTML-ifying - apologies for the noise if
this appears twice for you!>

Recent changes have restricted a userspace interface used by our
product; specifically, a security patch to require CAP_SYS_ADMIN when
opening /proc/PID/pagemap
(https://github.com/torvalds/linux/commit/ab676b7d6fbf4b294bf198fb27ade5b0e865c7ce,
original LKML discussion here: https://lkml.org/lkml/2015/3/9/864).

Although I've marked this as a "Regression", we do realise there are
legitimate security concerns over the original implementation of this
interface.  Still, given the kernel's strong stance on preserving
userspace interfaces, we thought we ought to flag this quickly as
something that has changed application-relevant behaviour.

We believe this change came into released kernels with Linux 4.0.  We
first observed problems when testing on Ubuntu 15.04 this week; I see
the patch is now backported to the various -stable kernel lines, so
I'd expect it to show up in other distros in due course.  The obvious
solution (to simply run with CAP_SYS_ADMIN) is quite undesirable for
our product, which is a debugger; we're expecting our users to run
without special privileges.

In our use of /proc/PID/pagemap, we currently make use of the physical
pageframe addresses.  We should be able to work with a scrambled
representation of these (Andy Lutomirski suggested this in the
original discussion - https://lkml.org/lkml/2015/3/16/1273) so long as
the scrambling remained consistent during the lifetime of the open
(Continue reading)

Andreas Gruenbacher | 24 Apr 13:03 2015
Picon

[RFC v3 00/45] Richacls

Hello,

here's another update of the richacl patch queue.  The changes since the last
posting (https://lwn.net/Articles/638242/) include:

 * The nfs client now allocates pages for received acls on demand like the
   server does.  It no longer caches the acl size between calls.

 * All possible acls consisting of only owner <at> , group <at> , and everyone <at>  entries
   which are equivalent to the file mode permission bits are now recognized.
   This is needed because by the NFSv4 specification, the nfs server must
   translate the file mode permission bits into an acl if it supports acls at
   all.

 * Support for the dacl attribute over NFSv4.1 for Automatic Inheritance, and
   also for the write_retention and write_retention_hold permissions.

 * The richacl_compute_max_masks() documentation has been improved.

 * Various minor bug fixes.

The git version is available here:

  git://git.kernel.org/pub/scm/linux/kernel/git/agruen/linux-richacl.git \
	richacl-2015-04-24

The richacl command-line has been split into getrichacl and setrichacl, in line
with getfacl and setfacl.  Watch out for that when updating the user-space.

Things still to be done, or which I'm not entirely happy with:
(Continue reading)

Pantelis Antoniou | 24 Apr 11:45 2015

[PATCH v3 0/4] of: overlay: kobject & sysfs'ation

The first patch puts the overlays as objects in the sysfs in
/sys/firmware/devicetree/overlays.

The next adds a master overlay enable switch (that once is set to
disabled can't be re-enabled), while the one after that
introduces a number of default per overlay attributes.

The patchset is against linus's tree as of today.

The last patch updates the ABI docs for the sysfs entries.

Changes since v2:
* Removed the unittest patch.
* Split the sysfs attribute patch to a global and a per-overlay
  patch.
* Dropped binary attributes using textual kobj_attributes instead.

Changes since v1:
* Maintainer requested changes.
* Documented the sysfs entries
* Per overlay sysfs attributes.

Pantelis Antoniou (4):
  of: overlay: kobjectify overlay objects
  of: overlay: global sysfs enable attribute
  of: overlay: add per overlay sysfs attributes
  Documentation: ABI: /sys/firmware/devicetree/overlays

 .../ABI/testing/sysfs-firmware-devicetree-overlays |  23 ++++
 drivers/of/base.c                                  |   5 +
(Continue reading)

Pantelis Antoniou | 24 Apr 11:42 2015

[PATCH] of: unittest: overlay: Keep track of created overlays

During the course of the overlay selftests some of them remain
applied. While this does not pose a real problem, make sure you track
them and destroy them at the end of the test.

Signed-off-by: Pantelis Antoniou <pantelis.antoniou@...>
---
 drivers/of/unittest.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)

diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c
index e844907..1801634 100644
--- a/drivers/of/unittest.c
+++ b/drivers/of/unittest.c
 <at>  <at>  -23,6 +23,8  <at>  <at> 
 #include <linux/i2c.h>
 #include <linux/i2c-mux.h>

+#include <linux/bitops.h>
+
 #include "of_private.h"

 static struct unittest_results {
 <at>  <at>  -1109,6 +1111,59  <at>  <at>  static const char *overlay_path(int nr)

 static const char *bus_path = "/testcase-data/overlay-node/test-bus";

+/* it is guaranteed that overlay ids are assigned in sequence */
+#define MAX_UNITTEST_OVERLAYS	256
+static unsigned long overlay_id_bits[BITS_TO_LONGS(MAX_UNITTEST_OVERLAYS)];
+static int overlay_first_id = -1;
(Continue reading)

Sri Jayaramappa | 23 Apr 20:21 2015

[PATCH V2] Test compaction of mlocked memory

Commit commit 5bbe3547aa3b ("mm: allow compaction of unevictable pages")
introduced a sysctl that allows userspace to enable scanning of locked
pages for compaction.  This patch introduces a new test which fragments
main memory and attempts to allocate a number of huge pages to exercise
this compaction logic.

Tested on machines with up to 32 GB RAM. With the patch a much larger
number of huge pages can be allocated than on the kernel without the
patch.

Example output:
On a machine with 16 GB RAM:
sudo make run_tests vm
...
-----------------------
running compaction_test
-----------------------
No of huge pages allocated = 3834
[PASS]
...

Signed-off-by: Sri Jayaramappa <sjayaram@...>
Cc: linux-kernel@...
Cc: linux-api@...
Cc: Andrew Morton <akpm@...>
Cc: Eric B Munson <emunson@...>
---
Changes in V2:
BINARIES in the Makefile is now one value per line.

(Continue reading)

Sri Jayaramappa | 22 Apr 23:01 2015

[PATCH] Test compaction of mlocked memory

Commit commit 5bbe3547aa3b ("mm: allow compaction of unevictable pages")
introduced a sysctl that allows userspace to enable scanning of locked
pages for compaction.  This patch introduces a new test which fragments
main memory and attempts to allocate a number of huge pages to exercise
this compaction logic.

Tested on machines with up to 32 GB RAM. With the patch a much larger
number of huge pages can be allocated than on the kernel without the patch.

Signed-off-by: Sri Jayaramappa <sjayaram@...>
Cc: linux-kernel@...
Cc: linux-api@...
Cc: Andrew Morton <akpm@...>
Cc: Eric B Munson <emunson@...>
---
 tools/testing/selftests/vm/Makefile          |    2 +-
 tools/testing/selftests/vm/compaction_test.c |  219 ++++++++++++++++++++++++++
 tools/testing/selftests/vm/run_vmtests       |   12 ++
 3 files changed, 232 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/vm/compaction_test.c

diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index a5ce953..e528836 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
 <at>  <at>  -2,7 +2,7  <at>  <at> 

 CFLAGS = -Wall
 BINARIES = hugepage-mmap hugepage-shm map_hugetlb thuge-gen hugetlbfstest
-BINARIES += transhuge-stress
(Continue reading)

Li Xi | 22 Apr 20:56 2015
Picon

[v14 0/4] ext4: add project quota support

The following patches propose an implementation of project quota
support for ext4. A project is an aggregate of unrelated inodes
which might scatter in different directories. Inodes that belong
to the same project possess an identical identification i.e.
'project ID', just like every inode has its user/group
identification. The following patches add project quota as
supplement to the former uer/group quota types.

The semantics of ext4 project quota is consistent with XFS. Each
directory can have EXT4_INODE_PROJINHERIT flag set. When the
EXT4_INODE_PROJINHERIT flag of a parent directory is not set, a
newly created inode under that directory will have a default project
ID (i.e. 0). And its EXT4_INODE_PROJINHERIT flag is not set either.
When this flag is set on a directory, following rules will be kept:

1) The newly created inode under that directory will inherit both
the EXT4_INODE_PROJINHERIT flag and the project ID from its parent
directory.

2) Hard-linking a inode with different project ID into that directory
will fail with errno EXDEV.

3) Renaming a inode with different project ID into that directory
will fail with errno EXDEV. However, 'mv' command will detect this
failure and copy the renamed inode to a new inode in the directory.
Thus, this new inode will inherit both the project ID and
EXT4_INODE_PROJINHERIT flag.

4) If the project quota of that ID is being enforced, statfs() on
that directory will take the quotas as another upper limits along
(Continue reading)

Josh Triplett | 21 Apr 19:46 2015

[PATCH 0/2] clone: Support passing tls argument via C rather than pt_regs magic

clone has some of the quirkiest syscall handling in the kernel, with a pile of
special cases, historical curiosities, and architecture-specific calling
conventions.  In particular, clone with CLONE_SETTLS accepts a parameter "tls"
that the C entry point completely ignores and some assembly entry points
overwrite; instead, the low-level arch-specific code pulls the tls parameter
out of the arch-specific register captured as part of pt_regs on entry to the
kernel.  That's a massive hack, and it makes the arch-specific code only work
when called via the specific existing syscall entry points; because of this
hack, any new clone-like system call would have to accept an identical tls
argument in exactly the same arch-specific position, rather than providing a
unified system call entry point across architectures.

The first patch allows architectures to handle the tls argument via normal C
parameter passing, if they opt in by selecting HAVE_COPY_THREAD_TLS.  The
second patch makes 32-bit and 64-bit x86 opt into this.

These two patches came out of the clone4 series, which isn't ready for this
merge window, but these first two cleanup patches were entirely uncontroversial
and have acks.  I'd like to go ahead and submit these two so that other
architectures can begin building on top of this and opting into
HAVE_COPY_THREAD_TLS.  However, I'm also happy to wait and send these through
the next merge window (along with v3 of clone4) if anyone would prefer that.

Josh Triplett (2):
  clone: Support passing tls argument via C rather than pt_regs magic
  x86: Opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit

 arch/Kconfig                 |  7 ++++++
 arch/x86/Kconfig             |  1 +
 arch/x86/ia32/ia32entry.S    |  2 +-
(Continue reading)

kwan.huen | 21 Apr 03:47 2015

Write with Stream ID Support


The attached patch set enables basic write with stream ID support. 
First patch reads the stream id embedded in the bio and passes to the 
device along with the write command.
Second patch adds two new nvme commands to be used with ioctl 
such that application can do open/close stream and host
initiated garbage collection.
Li Xi | 20 Apr 03:39 2015
Picon

[v13 0/5] ext4: add project quota support

The following patches propose an implementation of project quota
support for ext4. A project is an aggregate of unrelated inodes
which might scatter in different directories. Inodes that belong
to the same project possess an identical identification i.e.
'project ID', just like every inode has its user/group
identification. The following patches add project quota as
supplement to the former uer/group quota types.

The semantics of ext4 project quota is consistent with XFS. Each
directory can have EXT4_INODE_PROJINHERIT flag set. When the
EXT4_INODE_PROJINHERIT flag of a parent directory is not set, a
newly created inode under that directory will have a default project
ID (i.e. 0). And its EXT4_INODE_PROJINHERIT flag is not set either.
When this flag is set on a directory, following rules will be kept:

1) The newly created inode under that directory will inherit both
the EXT4_INODE_PROJINHERIT flag and the project ID from its parent
directory.

2) Hard-linking a inode with different project ID into that directory
will fail with errno EXDEV.

3) Renaming a inode with different project ID into that directory
will fail with errno EXDEV. However, 'mv' command will detect this
failure and copy the renamed inode to a new inode in the directory.
Thus, this new inode will inherit both the project ID and
EXT4_INODE_PROJINHERIT flag.

4) If the project quota of that ID is being enforced, statfs() on
that directory will take the quotas as another upper limits along
(Continue reading)

Kirill A. Shutemov | 17 Apr 14:20 2015
Picon

[PATCH] mm: fix mprotect() behaviour on VM_LOCKED VMAs

On mlock(2) we trigger COW on private writable VMA to avoid faults in
future.

mm/gup.c:
 840 long populate_vma_page_range(struct vm_area_struct *vma,
 841                 unsigned long start, unsigned long end, int *nonblocking)
 842 {
 ...
 855          * We want to touch writable mappings with a write fault in order
 856          * to break COW, except for shared mappings because these don't COW
 857          * and we would not want to dirty them for nothing.
 858          */
 859         if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
 860                 gup_flags |= FOLL_WRITE;

But we miss this case when we make VM_LOCKED VMA writeable via
mprotect(2). The test case:

	#define _GNU_SOURCE
	#include <fcntl.h>
	#include <stdio.h>
	#include <stdlib.h>
	#include <unistd.h>
	#include <sys/mman.h>
	#include <sys/resource.h>
	#include <sys/stat.h>
	#include <sys/time.h>
	#include <sys/types.h>

	#define PAGE_SIZE 4096
(Continue reading)


Gmane