Paul Menage | 1 Aug 2011 21:31
Picon
Favicon

Re: [PATCH][BUGFIX] cgroups: more safe tasklist locking in cgroup_attach_proc

On Fri, Jul 29, 2011 at 7:28 AM, Ben Blum <bblum@...> wrote:
> Fix unstable tasklist locking in cgroup_attach_proc.
>
> From: Ben Blum <bblum@...>
>
> According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is
> not sufficient to guarantee the tasklist is stable w.r.t. de_thread and
> exit. Taking tasklist_lock for reading, instead of rcu_read_lock,
> ensures proper exclusion.
>
> Signed-off-by: Ben Blum <bblum@...>

Acked-by: Paul Menage <menage@...>

Thanks,
Paul

> ---
>  kernel/cgroup.c |    6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff a/kernel/cgroup.c b/kernel/cgroup.c
> --- a/kernel/cgroup.c   2011-07-21 19:17:23.000000000 -0700
> +++ b/kernel/cgroup.c   2011-07-29 06:17:47.000000000 -0700
>  <at>  <at>  -2024,7 +2024,7  <at>  <at> 
>                goto out_free_group_list;
>
>        /* prevent changes to the threadgroup list while we take a snapshot. */
> -       rcu_read_lock();
> +       read_lock(&tasklist_lock);
(Continue reading)

Eric W. Biederman | 2 Aug 2011 00:25
Favicon

Re: [PATCH 02/14] allow root in container to copy namespaces (v3)

"Serge E. Hallyn" <serge@...> writes:

> Quoting Eric W. Biederman (ebiederm@...):
>> Serge Hallyn <serge@...> writes:
>> 
>> > From: Serge E. Hallyn <serge.hallyn@...>
>> >
>> > Othewise nested containers with user namespaces won't be possible.
>> >
>> > It's true that user namespaces are not yet fully isolated, but for
>> > that same reason there are far worse things that root in a child
>> > user ns can do.  Spawning a child user ns is not in itself bad.
>> >
>> > This patch also allows setns for root in a container:
>> >  <at> Eric Biederman: are there gotchas in allowing setns from child
>> > userns?
>> 
>> Yes.  We need to ensure that the target namespaces are namespaces
>> that have been created in from user_namespace or from a child of this
>> user_namespace.
>> 
>> Aka we need to ensure that we have CAP_SYS_ADMIN for the new namespace.
>
> [New patch below]
>
> Othewise nested containers with user namespaces won't be possible.
>
> It's true that user namespaces are not yet fully isolated, but for
> that same reason there are far worse things that root in a child
> user ns can do.  Spawning a child user ns is not in itself bad.
(Continue reading)

Glauber Costa | 2 Aug 2011 14:46
Favicon

Re: [PATCH 2/4] limit nr_dentries per superblock

Hi David,

Thank you for your comments. I actually agree with most of the things 
you pointed out, so I'll focus my reply here only.

> Also, this doesn't address the problem I originally commented on -
> it only shrinks the dcache, not the inode or filesystem caches that
> hold all the memory.

I understand that the icache is a greater memory user than the dcache.
I also understand that the icache can be dirty due to pending IO.
I am also okay to shrink some of the icache every time we shrink the 
dcache - and see below, this might happen if I change they way I am 
calling the shrinker.

However, since the dcache precedes the icache in the cache hierarchy,
and is the actual entity pinning the icache into memory, I don't see a 
reason to *limit* the icache size, since once the dcache is limited,
anything holding the icache into memory will go away eventually, and the 
shrinkers that run during memory pressure will be able to get rid of it.

> I mentioned the new per-sb shrinkers for a
> reason - you should be calling them, not __shrink_dcache_sb(). i.e.
> building a scan_control structure and calling
> sb->shrinker.shrink(&sb->shrinker,&sc);
>

Would you be okay with having prune_super() being called whenever the
nr_dentries_max limit is reached? It will effectively shrink both
the icache and the dcache, using the dentry limit as a trigger only.
(Continue reading)

Glauber Costa | 2 Aug 2011 15:04
Favicon

Re: [PATCH 4/4] parse options in the vfs level

On 07/30/2011 10:34 PM, Dave Chinner wrote:
> On Fri, Jul 29, 2011 at 05:44:19PM +0400, Glauber Costa wrote:
>> This patch introduces a simple generic vfs option parser.
>> Right now, the only option we have is to limit the size of the dcache.
>>
>> So any user that wants to have a dcache entries limit, can specify:
>>
>>    mount -o whatever_options,vfs_dcache_size=XXX<dev>  <mntpoint>
>>
>> It is supposed to work well with remounts, allowing it to change
>> multiple over the course of the filesystem's lifecycle.
>>
>> I find mount a natural interface for handling filesystem options,
>> so that's what I've choosen. Feel free to yell at it at will if
>> you disagree.
>
> IMO, the whole point of having a configurable cache size maximum is
> that is can be changed at runtime. Tying it to mount options is a
> painful way to acheive that because the only way to change it would
> be via a remount command.
And what's wrong with a remount command? It's a quite natural operation.
Furthermore, changing it at runtime is important - and as you noted, 
quite doable, but it is not "the whole point of it".
The whole point of it is to allow a piece of the fs hierarchy to have
a limit on the cache sizes. So I expect the most common usage to be
at mount itself. Specifically for the use case I have in mind, when
a new container is created.

> I'm not sure what the best API is, but I'd prefer something that is
> specific to a superblock, not a vfs mount. Perhaps something in
(Continue reading)

Serge E. Hallyn | 2 Aug 2011 16:08
Favicon

Re: [PATCH 02/14] allow root in container to copy namespaces (v3)

Quoting Eric W. Biederman (ebiederm@...):
> The dangers of changing the namespace of a process remain the same,
> confused suid programs.  I don't believe there are any unique new
> dangers. 
> 
> Not allowing joining namespaces you already have a copy of is just
> a matter of making it hard to get things wrong.
> 
> I would feel more a bit more comfortable if the way we did this was
> to move all of the capable calls into the per namespace methods
> and then changed them one namespace at a time.  I don't think

The patch belows moves them into the per namespace methods, for
what it's worth.  If you like I can change them, for now, to
'capable(CAP_SYS_ADMIN)' targeted at init_user_ns, but if we're
targetting at the userns owning the destination namespace, it
seems this must be sufficient...

> there are any fundmanetal dangers of allowing unshare without
> the global CAP_SYS_ADMIN, but it would be good to be able to audit

If you have suspicions that there may in fact be dangers, then
perhaps this whole patch should be delayed, and copy_namespaces()
and unshare_nsproxy_namespaces() should continue to check global
CAP_SYS_ADMIN?  The only part which would remain would be the
moving of the setns capable check into the per-ns ->install
method, but it would check the global CAP_SYS_ADMIN?

> and make or revoke the decision one namespace at a time.
> 
(Continue reading)

Al Viro | 2 Aug 2011 16:18
Picon

Re: [PATCH 4/4] parse options in the vfs level

On Tue, Aug 02, 2011 at 10:04:06AM -0300, Glauber Costa wrote:

> I am not sure either, but I still believe my proposal is superior to
> write-to-a-file specifically. Writing to a file, be it in proc, sys,
> or wherever, leaves a window of opportunity open between mounting a
> filesystem and limiting its caches. Doing it on mount is atomic.
> 
> Effectively, I see this limit as a property of a particular instance
> of a mounted filesystem. Since all properties of a filesystem are
> specified during mount, this becomes a natural extension.

The trouble is, dentry tree is fundamentally a property of superblock.
It's shared between *all* instances of that fs in all mount trees...
Glauber Costa | 2 Aug 2011 16:43
Favicon

Re: [PATCH 4/4] parse options in the vfs level

On 08/02/2011 11:18 AM, Al Viro wrote:
> On Tue, Aug 02, 2011 at 10:04:06AM -0300, Glauber Costa wrote:
>
>> I am not sure either, but I still believe my proposal is superior to
>> write-to-a-file specifically. Writing to a file, be it in proc, sys,
>> or wherever, leaves a window of opportunity open between mounting a
>> filesystem and limiting its caches. Doing it on mount is atomic.
>>
>> Effectively, I see this limit as a property of a particular instance
>> of a mounted filesystem. Since all properties of a filesystem are
>> specified during mount, this becomes a natural extension.
>
> The trouble is, dentry tree is fundamentally a property of superblock.
> It's shared between *all* instances of that fs in all mount trees...

And how is it different from any fs-specific options, like the ones extX 
have, for instance ? Many of them seem to operate on a superblock.

If you mount a superblock somewhere, you can tweak specifics about its 
operation. If you mount it somewhere else, the assumption is you know 
what you're doing.
Eric W. Biederman | 3 Aug 2011 00:03
Favicon

Re: [PATCH 02/14] allow root in container to copy namespaces (v3)

"Serge E. Hallyn" <serge.hallyn@...> writes:

> Quoting Eric W. Biederman (ebiederm@...):
>> The dangers of changing the namespace of a process remain the same,
>> confused suid programs.  I don't believe there are any unique new
>> dangers. 
>> 
>> Not allowing joining namespaces you already have a copy of is just
>> a matter of making it hard to get things wrong.
>> 
>> I would feel more a bit more comfortable if the way we did this was
>> to move all of the capable calls into the per namespace methods
>> and then changed them one namespace at a time.  I don't think
>
> The patch belows moves them into the per namespace methods, for
> what it's worth.  If you like I can change them, for now, to
> 'capable(CAP_SYS_ADMIN)' targeted at init_user_ns, but if we're
> targetting at the userns owning the destination namespace, it
> seems this must be sufficient...

I like the was this was done.  I was mostly thinking of the non
setns case when I was talking about moving the calls.

>> there are any fundmanetal dangers of allowing unshare without
>> the global CAP_SYS_ADMIN, but it would be good to be able to audit
>
> If you have suspicions that there may in fact be dangers, then
> perhaps this whole patch should be delayed, and copy_namespaces()
> and unshare_nsproxy_namespaces() should continue to check global
> CAP_SYS_ADMIN?  The only part which would remain would be the
(Continue reading)

Prashanth Mohan | 3 Aug 2011 21:25
Picon
Gravatar

Escaping Containers

Hello,

I recently came across this blog posting
(http://blog.bofh.it/debian/id_413) which details a mechanism to
escape from Linux Containers. The posting seems to indicate that this
is because sysfs does not support namespaces. Is this a problem that
is being actively looked into? Is there a workaround for this and/or a
permanent solution in the works?

Thank you,
--

-- 
Prashanth Mohan
http://www.cs.berkeley.edu/~prmohan
Serge E. Hallyn | 3 Aug 2011 23:49
Favicon

Re: Escaping Containers

Quoting Prashanth Mohan (prashmohan@...):
> Hello,
> 
> I recently came across this blog posting
> (http://blog.bofh.it/debian/id_413) which details a mechanism to
> escape from Linux Containers. The posting seems to indicate that this
> is because sysfs does not support namespaces. Is this a problem that
> is being actively looked into? Is there a workaround for this and/or a
> permanent solution in the works?

For the latter, please read the whole blog post you cited.

For a workaround using Smack, see
   http://sourceforge.net/mailarchive/message.php?msg_id=27895058
from the lxc-users mailing list.

-serge

Gmane