Andrew Morton | 1 Mar 01:28 2011

Re: [PATCH 01/10] Add a user_namespace as creator/owner of uts_namespace

On Thu, 24 Feb 2011 15:01:51 +0000
"Serge E. Hallyn" <serge@...> wrote:

> Cc: oleg@..., dlezcano@...

I don't think those addresses do what you think they do.

> copy_process() handles CLONE_NEWUSER before the rest of the
> namespaces.  So in the case of clone(CLONE_NEWUSER|CLONE_NEWUTS)
> the new uts namespace will have the new user namespace as its
> owner.  That is what we want, since we want root in that new
> userns to be able to have privilege over it.

Well this sucks.  Anyone who is reading this patch series really won't
have a clue what any of it is for.  There's no context provided.

A useful way of thinking about this is to ask yourself "what will Linus
think when this stuff hits his inbox".  If the answer is "he'll say
wtf" then we're doing it wrong.


I shall (again) paste in the below text, which I snarfed from the wiki.
Please check that it is complete, accurate and adequate.  If not,
please send along replacement text.

: The expected course of development for user namespaces targeted
: capabilities is laid out at
(Continue reading)

Nathan Lynch | 1 Mar 02:08 2011

Re: [RFC 00/10] container-based checkpoint/restart prototype

On Mon, 2011-02-28 at 17:40 -0600, ntl@... wrote:
> This is the tradeoff we ask users
> to make - the ability to C/R and migrate is provided in exchange for
> accepting some isolation and slightly reduced ease of use.  A tool
> such as lxc ( can be used to isolate jobs.
> A patch against lxc is available which adds C/R capability.

Below is that patch (against the lxc-0.7.3 tag) and a usage example.

# lxc-execute -n foo -- /bin/cat </dev/zero &>/dev/null &
# ps
  PID TTY          TIME CMD
 8736 pts/1    00:00:00 bash
 8842 pts/1    00:00:00 lxc-execute
 8843 pts/1    00:00:00 lxc-init
 8844 pts/1    00:00:01 cat
 8845 pts/1    00:00:00 ps
# lxc-checkpoint -S /tmp/ckpt.img -n foo -k
[1]+  Exit 137                lxc-execute -n foo -- /bin/cat < /dev/zero &>/dev/null
# ps
  PID TTY          TIME CMD
 8736 pts/1    00:00:00 bash
 8849 pts/1    00:00:00 ps
# lxc-restart -n foo -S /tmp/ckpt.img

[whee, watch resurrected /bin/cat eat cpu]

 doc/rootfs/ |    2 +-            |    2 +-
(Continue reading)

Matt Helsley | 1 Mar 05:05 2011

[PATCH 00/10] Checkpoint/restart of open, unlinked files

This patch set implements the relink file operation and uses it to support
checkpoint and restart of open, unlinked files. During checkpoint,
sys_checkpoint relinks the files and returns. Userspace then checkpoints the
filesystem contents using any backup-like method prior to thawing. That
backup is then made available for use during an optional migration followed
by restore and sys_restart. In the case of network and cluster/distributed
filesystems copying the filesystem contents explicitly for migration may not
be necessary at all -- it would be part of normal file writes. For
non-migration uses of checkpoint/restart filesystems like btrfs a snapshot
could simply be taken during checkpoint and mounted during restart -- again
without requiring IO proportional to the aggregate size of filesystem
contents being checkpointed.

These IO savings are critical to the use of checkpoint/restart as a
fault mitigation solution in HPC environments where the probability of
component failure is very high simply due to the number of system
components. Incurring substantial IO for checkpoint/restart interferes
with the IO requirements of HPC jobs and thus reduces the frequency of
checkpoint/restart. That in turn means more processing time is lost
as a consequence of a fault -- the longer period between checkpoints
plus the IO required to re-establish hardlinks are simply not acceptable
for these environments.

Without relinking we would need to walk the entire filesystem to find out
that "b" is a path to the same inode (another variation on this case: "b"
would also have been unlinked). We'd need to do this for every
unlinked file that remains open in every task to checkpoint. Even then
there is no guarantee such a "b" exists for every unlinked file -- the
inodes could be "orphans" -- and we'd need to preserve their contents
some other way.
(Continue reading)

Matt Helsley | 1 Mar 05:05 2011

[PATCH 01/10] Create the .relink file_operation

Not all filesystems will necessarily be able to support relinking an
orphan inode back into the filesystem. Some offlist feedback suggested
that instead of overloading .link that relinking should be a separate
file operation for this reason.

Since .relink is a superset of .link make the VFS call .relink where
possible and .link otherwise.

The next commit will change ext3/4 to enable this operation.

Signed-off-by: Matt Helsley <matthltc@...>
Cc: Theodore Ts'o <tytso@...>
Cc: Andreas Dilger <adilger.kernel@...>
Cc: Jan Kara <jack@...>
Cc: linux-fsdevel@...
Cc: linux-ext4@...
Cc: Al Viro <viro@...>
 fs/namei.c         |    5 ++++-
 include/linux/fs.h |    1 +
 2 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 4ff7ca5..b6b7359 100644
--- a/fs/namei.c
+++ b/fs/namei.c
 <at>  <at>  -2430,7 +2430,10  <at>  <at>  int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_de
 		return error;

(Continue reading)

Matt Helsley | 1 Mar 05:05 2011

[PATCH 03/10] Split do_linkat() out of sys_linkat

Separate the __user pathname handling from the bulk of the syscall.
Since we're doing this to enable relinking of unlinked files by
sys_checkpoint and not sys_linkat we're not using a sys-wrapper.

Signed-off-by: Matt Helsley <matthltc@...>
Cc: containers@...
Cc: Oren Laadan <orenl@...>
Cc: Amir Goldstein <amir73il@...>
Cc: linux-fsdevel@...
Cc: Al Viro <viro@...>
Cc: Christoph Hellwig <hch@...>
Cc: Jamie Lokier <jamie@...>
 fs/namei.c |   77 +++++++++++++++++++++++++++++++++++++++---------------------
 1 files changed, 50 insertions(+), 27 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index b6b7359..52aa274 100644
--- a/fs/namei.c
+++ b/fs/namei.c
 <at>  <at>  -2440,6 +2440,51  <at>  <at>  int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_de
 	return error;

+/* If the file has been unlinked then old_dentry doesn't match old_path */
+static int do_linkat(struct path *old_path, struct dentry *old_dentry,
+		     struct nameidata *nd, int flags)
+	struct dentry *new_dentry;
+	int error = -EXDEV;
(Continue reading)

Matt Helsley | 1 Mar 05:05 2011

[PATCH 02/10] ext3/4: Allow relinking to unlinked files

Orphan inodes (nlink == 0) have been forbidden from being linked back
into the ext3/4 filesystems as a means of dealing with a link/unlink

This patch effectively reverts 2988a7740dc0dd9a0cb56576e8fe1d777dff0db3

	"return ENOENT from ext3_link when racing with unlink"

which was discussed in the lkml thread:

The reverted commit was selected because disallowing relinking was just a
simpler solution -- not because removing tasks from the orphan list
was deemed difficult or incorrect.

Instead this patch utilizes the original patch proposed by Eric Sandeen.
Testing this patch with the orphan-repro.tar.bz2 code linked in that
thread seems to confirm that this patch does not reintroduce the OOPs.
Nonetheless, Amir Goldstein pointed out that if ext3_add_entry() fails
we'll also be left with a corrupted orphan list. So I've moved the orphan
removal code down to the spot where a successful return has been assured.

Eric's Original description (indented):

	Remove inode from the orphan list in ext3_link() if we might have
	raced with ext3_unlink(), which potentially put it on the list.
	If we're on the list with nlink > 0, we'll never get cleaned up
	properly and eventually may corrupt the list.

(Continue reading)

Matt Helsley | 1 Mar 05:05 2011

[PATCH 04/10] Checkpoint/restart unlinked files

Implement checkpoint of unlinked files by relinking them into their
filesystem at:

	<fs root>/lost+found/checkpoint/≤file>

Relinking offers many advantages over other means of checkpointing unlinked
files. It's offers substantial performance improvements by leveraging the
snapshotting capabilities of various linux block devices, filesystems, or
differential copying tools like rsync.

In addition to the original path of the file we save the newly-linked
path. This newly-linked path is opened during restart instead of the
original path.

To understand why relinking is extremely useful for checkpoint/restart
consider this simple pseudocode program and a specific example checkpoint
of it:

	a_fd = open("a"); /* example: size of the file at "a" is 1GB */
	link("a", "b");
	             <---- example: checkpoint happens here
	write(a_fd, "bar");

The file "a" is unlinked and a different file has been placed at that
path. a_fd still refers to the inode shared with "b". When we restart
we must re-open the files such that writes to files opened via different
paths are visible. Using links makes this easy.

(Continue reading)

Matt Helsley | 1 Mar 05:05 2011

[PATCH 07/10] Add relink_dir superblock field

This patch adds the pointer for a relink mount(8) option which specifies a
path to a directory within the given filesystem where checkpoint/restart
may relink files. If the option is unset then checkpoint/restart tries
to use "lost+found".

A subsequent patch will enable userspace to set a relink=/foo/bar
option when mounting a filesystem.

Signed-off-by: Matt Helsley <matthltc@...>
 fs/namei.c         |    6 ++++--
 fs/namespace.c     |    6 ++++++
 include/linux/fs.h |    2 ++
 3 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 6dea3b1..fcf35b3 100644
--- a/fs/namei.c
+++ b/fs/namei.c
 <at>  <at>  -2535,7 +2535,7  <at>  <at>  SYSCALL_DEFINE2(link, const char __user *, oldname, const char __user *, newname

 /* Path relative to the mounted filesystem's root -- not a "global" root or even a namespace root. The
unique_name_count is unique for the entire checkpoint. */
-#define CKPT_RELINKAT_FMT "lost+found/checkpoint-%d/relinked-%u"
+#define CKPT_RELINKAT_FMT "%s/checkpoint-%d/relinked-%u"

 static int checkpoint_fill_relink_fname(struct ckpt_ctx *ctx,
 					struct file *for_file,
 <at>  <at>  -2569,7 +2569,9  <at>  <at>  static int checkpoint_fill_relink_fname(struct ckpt_ctx *ctx,
(Continue reading)

Matt Helsley | 1 Mar 05:05 2011

[PATCH 05/10] Enable c/r of unlinked fifos

Unlinked fifos are special files which share some of their
checkpoint/restart code with pipes. Re-use the code for normal
unlinked files for unlinked fifos too.

Signed-off-by: Matt Helsley <matthltc@...>
 fs/pipe.c                  |    5 ++++-
 include/linux/checkpoint.h |    1 +
 2 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index e66ba97..e9f3e64 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
 <at>  <at>  -917,6 +917,9  <at>  <at>  static int pipe_file_checkpoint(struct ckpt_ctx *ctx, struct file *file)
 		ret = checkpoint_fname(ctx, &file->f_path, &ctx->root_fs_path);
 		if (ret < 0)
 			goto out;
+		ret = checkpoint_file_links(ctx, file);
+		if (ret < 0)
+			goto out;

 	if (first)
 <at>  <at>  -1053,7 +1056,7  <at>  <at>  struct file *fifo_file_restore(struct ckpt_ctx *ctx, struct ckpt_hdr_file *ptr)
 	 * To avoid blocking, always open the fifo with O_RDWR;
 	 * then fix flags below.
-	file = restore_open_fname(ctx, 0, (ptr->f_flags & ~O_ACCMODE) | O_RDWR);
+	file = restore_open_fname(ctx, !!(ptr->f_restart_flags & CKPT_RESTART_FILE_F_UNLINK),
(Continue reading)

Matt Helsley | 1 Mar 05:05 2011

[PATCH 08/10] Parse the relink=%s mount option

Add a generic string mount option for relinking during checkpoint/restart.
It can be passed via mount commands. The specified path is relative to the
filesystem root and must remain within the filesystem being mounted.

Use of this mount option looks like (... for the uninteresting bits):

	mount ... -o ...,relink="qux/quux/" ...

Signed-off-by: Matt Helsley <matthltc@...>
 fs/super.c         |   77 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/fs.h |    2 +
 2 files changed, 77 insertions(+), 2 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index ca69615..2bda808 100644
--- a/fs/super.c
+++ b/fs/super.c
 <at>  <at>  -127,6 +127,9  <at>  <at>  static inline void destroy_super(struct super_block *s)
+	kfree(s->s_relink_dir);
 <at>  <at>  -952,12 +955,58  <at>  <at>  int get_sb_single(struct file_system_type *fs_type,

(Continue reading)