Anton Altaparmakov | 8 Feb 01:07
Favicon
Gravatar

Direct i/o changes break all non-GPL file systems

Hi Linus, Andrew, Christoph,

With kernel 3.1, Christoph removed i_alloc_sem and replaced it with calls (namely inode_dio_wait() and
inode_dio_done()) which are EXPORT_SYMBOL_GPL() thus they cannot be used by non-GPL file systems and
further inode_dio_wait() was pushed from notify_change() into the file system ->setattr() method but
no non-GPL file system can make this call.

That means non-GPL file systems cannot exist any more unless they do not use any VFS functionality related
to reading/writing as far as I can tell or at least as long as they want to implement direct i/o.

What are commercial file systems meant to do now?

For example Tuxera exFAT uses the generic write code which means that read/write use the generic direct_IO
functions however Tuxera exFAT's setattr() method cannot call inode_dio_wait() and there are places
where exfat_truncate() is called directly with i_alloc_sem held and now this needs to be replaced with
calls to inode_dio_wait() but we cannot do that as the function is GPL only.

Previously when APIs have been changed that were accessible to non-GPL modules it was made sure those APIs
remained that way but this does not appear to be the case here.

Do all non-GPL file systems now really have to re-implement the entire read-write code paths for
themselves for the sake of two GPL only exports in the direct i/o code?  That seems a bit harsh!

Have I missed something?  If not would you accept a patch to change the two needed GPL only symbols into
EXPORT_SYMBOL() ones?

Thanks a lot in advance!

Best regards,

(Continue reading)

Artem Bityutskiy | 7 Feb 15:44
Picon

[PATCH] UBIFS: do not use inc_link when i_nlink is zero

From: Artem Bityutskiy <artem.bityutskiy <at> linux.intel.com>

This patch changes the 'i_nlink' counter handling in 'ubifs_unlink()',
'ubifs_rmdir()' and 'ubifs_rename()'. In these function  'i_nlink' may become 0,
and if 'ubifs_jnl_update()' failed, we would use 'inc_nlink()' to restore
the previous 'i_nlink' value, which is incorrect from the VFS point of view and
would cause a 'WARN_ON()' (see 'inc_nlink() implementation).

This patches saves the previous 'i_nlink' value in a local variable and uses it
at the error path instead of calling 'inc_nlink()'. We do this only for the
inodes where 'i_nlink' may potentially become zero.

This change has been requested by Al Viro <viro <at> ZenIV.linux.org.uk>.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy <at> linux.intel.com>
---

Al, I assume this goes in via the UBIFS tree.

 fs/ubifs/dir.c |   18 +++++++++---------
 1 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c
index d6fe1c7..ec9f187 100644
--- a/fs/ubifs/dir.c
+++ b/fs/ubifs/dir.c
@@ -566,6 +566,7 @@ static int ubifs_unlink(struct inode *dir, struct dentry *dentry)
 	int sz_change = CALC_DENT_SIZE(dentry->d_name.len);
 	int err, budgeted = 1;
 	struct ubifs_budget_req req = { .mod_dent = 1, .dirtied_ino = 2 };
(Continue reading)

linux-3.2.2: BUG dentry: Poison overwritten after resume-from-disk

Hi,

my PC was suspended-to.disk over the weekend and crashed
about one minute after resume.  Luckily the BUG made
it into the log file before the box hung up hard.

I had a kvm running over suspend-resume in case it matters.

Feb  6 10:37:33 zzz kernel: [37877.172559] =============================================================================
Feb  6 10:37:33 zzz kernel: [37877.172564] BUG dentry: Poison overwritten
Feb  6 10:37:33 zzz kernel: [37877.172566] -----------------------------------------------------------------------------
Feb  6 10:37:33 zzz kernel: [37877.172567] 
Feb  6 10:37:33 zzz kernel: [37877.172569] INFO: 0xffff8801252e4a00-0xffff8801252e4a1f. First byte
0x0 instead of 0x6b
Feb  6 10:37:33 zzz kernel: [37877.172576] INFO: Allocated in __d_alloc+0x27/0x16a age=979868 cpu=1 pid=14697
Feb  6 10:37:33 zzz kernel: [37877.172580] 	__slab_alloc.isra.52.constprop.59+0x2e0/0x349
Feb  6 10:37:33 zzz kernel: [37877.172583] 	kmem_cache_alloc+0x6e/0x133
Feb  6 10:37:33 zzz kernel: [37877.172586] 	__d_alloc+0x27/0x16a
Feb  6 10:37:33 zzz kernel: [37877.172588] 	d_alloc+0x1e/0x69
Feb  6 10:37:33 zzz kernel: [37877.172591] 	d_alloc_and_lookup+0x2c/0x6c
Feb  6 10:37:33 zzz kernel: [37877.172593] 	__lookup_hash.part.36+0xa2/0xb4
Feb  6 10:37:33 zzz kernel: [37877.172596] 	lookup_hash+0x39/0x3e
Feb  6 10:37:33 zzz kernel: [37877.172598] 	do_last.isra.42+0x2a2/0x680
Feb  6 10:37:33 zzz kernel: [37877.172600] 	path_openat+0xd3/0x35d
Feb  6 10:37:33 zzz kernel: [37877.172602] 	do_filp_open+0x38/0x86
Feb  6 10:37:33 zzz kernel: [37877.172604] 	do_sys_open+0x104/0x19d
Feb  6 10:37:33 zzz kernel: [37877.172608] 	compat_sys_open+0x1a/0x1c
Feb  6 10:37:33 zzz kernel: [37877.172610] 	sysenter_dispatch+0x7/0x37
Feb  6 10:37:33 zzz kernel: [37877.172614] INFO: Freed in __d_free+0x56/0x5b age=28547 cpu=2 pid=12133
Feb  6 10:37:33 zzz kernel: [37877.172618] 	__slab_free+0x31/0x348
(Continue reading)

Anand Avati | 6 Feb 08:12
Picon
Favicon
Gravatar

[PATCH v3] fuse: O_DIRECT support for files

Implement ->direct_IO() method in aops. The ->direct_IO() method combines
the existing fuse_direct_read/fuse_direct_write methods to implement
O_DIRECT functionality.

Reaching ->direct_IO() in the read path via generic_file_aio_read ensures
proper synchronization with page cache with its existing framework.

Reaching ->direct_IO() in the write path via fuse_file_aio_write is made
to come via generic_file_direct_write() which makes it play nice with
the page cache w.r.t other mmap pages etc.

On files marked 'direct_io' by the filesystem server, IO always follows
the fuse_direct_read/write path. There is no effect of fcntl(O_DIRECT)
and it always succeeds.

On files not marked with 'direct_io' by the filesystem server, the IO
path depends on O_DIRECT flag by the application. This can be passed
at the time of open() as well as via fcntl().

Note that asynchronous O_DIRECT iocb jobs are completed synchronously
always (this has been the case with FUSE even before this patch)

Signed-off-by: Anand Avati <avati <at> redhat.com>
---

Tested with concurrent read and write DDs with oflag=direct and iflag=direct set
in a few writes and a few reads

 fs/fuse/dir.c  |    3 -
 fs/fuse/file.c |  124 ++++++++++++++++++++++++++++++++++++++++++++++++--------
(Continue reading)

Li Wang | 5 Feb 16:33
Picon

[PATCH v2] eCryptfs: Write optimization for non-uptodate page

ecryptfs_write_begin() grabs a page from page cache for writing. 
If the page does not contain valid data, or the data are older than 
the counterpart on the disk, eCryptfs will read out the corresponding data 
from the disk into the eCryptfs page cache, decrypt them, 
then perform writing. However, for current page, if the length of 
the data to be written into is equal to page size, that means the whole 
page of data will be overwritten, in which case, it does not matter whatever

the data were before, it is beneficial to perform writing directly.

This is especially useful while using eCryptfs in backup situation, user
copies file
out from eCryptfs folder, modifies, and copies the revised file back to
replace the original one. 

With this optimization, according to our test on a machine with IntelR Core
T 2 Duo processor, 
iozone 'write' operation on an existing file with write size being multiple
of page size will 
enjoy a steady 3x speedup.

Signed-off-by: Li Wang <liwang <at> nudt.edu.cn>
Signed-off-by: Yunchuan Wen <wenyunchuan <at> kylinos.com.cn>
Reviewed-by: Tyler Hicks <tyhicks <at> canonical.com>

--- 

Changes from v1:

The new version adds the code to handle short copy issue which may occure
(Continue reading)

Li Wang | 5 Feb 16:33
Picon

[PATCH v2] eCryptfs: Write optimization for non-uptodate page

ecryptfs_write_begin() grabs a page from page cache for writing. 
If the page does not contain valid data, or the data are older than 
the counterpart on the disk, eCryptfs will read out the corresponding data 
from the disk into the eCryptfs page cache, decrypt them, 
then perform writing. However, for current page, if the length of 
the data to be written into is equal to page size, that means the whole 
page of data will be overwritten, in which case, it does not matter whatever

the data were before, it is beneficial to perform writing directly.

This is especially useful while using eCryptfs in backup situation, user
copies file
out from eCryptfs folder, modifies, and copies the revised file back to
replace the original one. 

With this optimization, according to our test on a machine with IntelR Core
T 2 Duo processor, 
iozone 'write' operation on an existing file with write size being multiple
of page size will 
enjoy a steady 3x speedup.

Signed-off-by: Li Wang <liwang <at> nudt.edu.cn>
Signed-off-by: Yunchuan Wen <wenyunchuan <at> kylinos.com.cn>
Reviewed-by: Tyler Hicks <tyhicks <at> canonical.com>

--- 

Changes from v1:

The new version adds the code to handle short copy issue which may occure
(Continue reading)

Li Wang | 5 Feb 16:33
Picon

[PATCH v2] eCryptfs: Write optimization for non-uptodate page

ecryptfs_write_begin() grabs a page from page cache for writing. 
If the page does not contain valid data, or the data are older than 
the counterpart on the disk, eCryptfs will read out the corresponding data 
from the disk into the eCryptfs page cache, decrypt them, 
then perform writing. However, for current page, if the length of 
the data to be written into is equal to page size, that means the whole 
page of data will be overwritten, in which case, it does not matter whatever

the data were before, it is beneficial to perform writing directly.

This is especially useful while using eCryptfs in backup situation, user
copies file
out from eCryptfs folder, modifies, and copies the revised file back to
replace the original one. 

With this optimization, according to our test on a machine with IntelR Core
T 2 Duo processor, 
iozone 'write' operation on an existing file with write size being multiple
of page size will 
enjoy a steady 3x speedup.

Signed-off-by: Li Wang <liwang <at> nudt.edu.cn>
Signed-off-by: Yunchuan Wen <wenyunchuan <at> kylinos.com.cn>
Reviewed-by: Tyler Hicks <tyhicks <at> canonical.com>

--- 

Changes from v1:

The new version adds the code to handle short copy issue which may occure
(Continue reading)

Wu Fengguang | 5 Feb 16:00
Picon
Favicon

[LSF/MM][ATTEND] readahead and writeback

I would like to attend to participate in the readahead and writeback
discussions. My questions are

- readahead size and alignment, hope that we can reach some general
  agreements on the policy stuff

- async write I/O bandwidth controller, will it be a good complement
  feature to the current blk-cgroup I/O controller? Each seem to have
  its own strong areas and weak points.

- per-memcg dirty pages control, to be frank, it's non-trivial to
  implement and I'm not sure it will perform well in some cases.
  Before following that direction, I'm curious whether the much more
  simple scheme of moving dirty pages to the global LRU can magically
  satisfy the main user demands.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Richard Laager | 5 Feb 06:33
Favicon
Gravatar

Re: [RFC] The reflink(2) system call v5.

On Mon, 2009-09-14 at 22:24 -0500, Joel Becker wrote:
> Here's v5 of reflink().

So, what ever happened with reflink(2)? This [0] is the last message I
can find on the topic and it doesn't seem to have been merged.

[0] http://marc.info/?l=linux-fsdevel&m=125296717319013&w=2

--

-- 
Richard
Mark Moseley | 4 Feb 00:47
Picon

CIFS Unmount Issue

I've got a slew of Netapp Filers talking CIFS to some Debian Squeeze
64-bit boxes. I've noticed that in the kernel switch from 3.1 to 3.2,
the clients are no longer able to unmount a CIFS volume from an older
Filer. The Netapp versions in question are 7.2.7 and 7.0.6. I can
unmount on a 3.2.x kernel from a 7.2.7 Filer just fine. With a 7.0.6
Filer, I get the following error printed to /proc/kmsg:

<3>[  277.363460] CIFS VFS: RFC1001 size 35 smaller than SMB for mid=12
<7>[  277.363466] Bad SMB: : dump of 39 bytes of data at 0xffff880213e7e000
<7>[  277.363472]  23000000 424d53ff 00000074 00018800 . . . # � S M B
t . . . . . . .
<7>[  277.363478]  00000000 00000000 00000000 0e1000................<>
27338] 0c0000f0

but the umount call never returns, which makes reboots fun. I've
replicated this on 3.2.1 and 3.2.2. I've seen it print the same "Bad
SMB..." message as pasted above with 3.1.10 but the umount call
returns successfully. And unmounting from the 7.2.7 Filers does not
cause a "Bad SMB" message to get logged to /proc/kmsg.

The client is still responsive, and I can run whatever would helpful
to debug this. If I'm doing the unmount on the CLI, it hangs on the
'umount' syscall. If I kill the umount command, the mount is gone. As
far as I can see, the unmount is succeeding, but for whatever reason,
the umount system call isn't ever returning. Looking at a network
dump, the last client call is for a logoff, which seems to succeed.

There are no oops's or tracebacks logged.

I can post my whole .config if it's helpful, though for brevity sake,
(Continue reading)

Greg Thelen | 3 Feb 07:29
Picon
Favicon
Gravatar

[LSF/MM TOPIC][ATTEND] memcg writeback

I would like to attend to participate in general memcg discussions
(slab accounting, etc.) and specifically to discuss memcg dirty
throttling/writeback.
My big questions are for writeback people mostly about how to
integrate work with IO less dirty throttling flusher threads.

For reference, here's a recent discussion on this topic:
http://marc.info/?l=linux-mm&m=132825032709644
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Gmane