frankcmoeller | 18 May 2013 12:50
Picon
Favicon

Aw: Re: Ext4: Slow performance on first write after mount

Hi Andrei,

thanks for your quick answer!
Perhaps you understood me wrong. The general write performance is quite good. We can record more than 4 HD
channels at the same time without problems. Except the problems with the first write after mount. And
there are also some users which have problems 1-2 times during a recording.
I think the ext4 group initialization is the main problem, because it takes so long (as written before:
around 1300 groups per second). Why don't you store the gathered informations on disk when a umount takes place?

With fallocate the group initialization is partly made before first write. This helps, but it's no
solution, because the finally file size is unknown. So I cannot preallocate space for the complete file.
And after the preallocated space is consumed the same problem with the initialization arises until all
groups are initialized.

I also made some tests with O_DIRECT (my first tests ever). Perhaps I did something wrong, but it isn't very
fast. And you have to take care about alignment and there are several threads in the internet which explain
why you shouldn't use it (or only in very special situations and I don't think that my situation is one of
them). And ext4 group initialization takes also place when using O_DIRECT (as said before perhaps I did
something wrong).

Regards,
Frank

----- Original Nachricht ----
Von:     "Sidorov, Andrei" <Andrei.Sidorov <at> arrisi.com>
An:      "frankcmoeller <at> arcor.de" <frankcmoeller <at> arcor.de>, ext4 development <linux-ext4 <at> vger.kernel.org>
Datum:   17.05.2013 23:18
Betreff: Re: Ext4: Slow performance on first write after mount

> Hi Frank,
(Continue reading)

frankcmoeller | 17 May 2013 18:51
Picon
Favicon

Ext4: Slow performance on first write after mount

Hi,

we're using ext4 on satellite boxes (e.g. XTrend et9200) with mounted harddisks. The receiver uses linux
with kernel version 3.8.7. Some users (like me) have problems with the first recording right after boot.
Most of them have big partitions(1-2TB) and high disk usage (over 50%). The application signals in this
case a buffer overflow (the buffer is 4 MB big). We found out, that one of the first writes after boot or
remount is very slow. I have debugged it. The testcase was a umount then mount and then a write of 64MB data to
the disk:
The problem is the initialization of the buffer cache which takes very long(ext4_mb_init_group in
ext4_mb_good_group procedure). In my case it loads around 1300 groups per second (with the patch which
avoids loading full groups). My disk is at the beginning quite full, so it needs to read around 8200 groups
to find a "good" one. This takes over 6 seconds. Here is the output:
May 10 02:06:15 et9x00 user.info kernel: EXT4-fs (sda1): mounted filesystem with ordered data mode.
Opts: (null)
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator start                         time: 4284161251798
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before for group
loop cr: 0
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group;
group: 0  time: 4284161318465
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate
buffer_heads;         time: 4284161355983
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after
ext4_read_block_bitmap_nowait  time: 4284161440835
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after
ext4_wait_block_bitmap         time: 4284167134687
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate
buffer_heads;         time: 4284167180243
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after
ext4_read_block_bitmap_nowait  time: 4284167198909
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after
(Continue reading)

Dmitry Monakhov | 16 May 2013 14:28
Favicon
Gravatar

[PATCH 1/4] ext4: Fix fsync error handling after filesysteb abort.

If filesystem was aborted after inode's write back complete
but before it's metadata was updated we may return success
due to (sb->s_flags & MS_RDONLY) which is incorrect and
result in data loss.
In order to handle fs abort correctly we have to check
fs state once we discover that it is in MS_RDONLY state

Test case: http://patchwork.ozlabs.org/patch/244297/

Signed-off-by: Dmitry Monakhov <dmonakhov <at> openvz.org>
---
 fs/ext4/fsync.c |    8 ++++++--
 fs/ext4/super.c |   13 ++++++++++++-
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index e0ba8a4..d7df2f1 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
 <at>  <at>  -129,9 +129,13  <at>  <at>  int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 		return ret;
 	mutex_lock(&inode->i_mutex);

-	if (inode->i_sb->s_flags & MS_RDONLY)
+	if (inode->i_sb->s_flags & MS_RDONLY) {
+		/* Make shure that we read updated s_mount_flags value */
+		smp_rmb();
+		if (EXT4_SB(inode->i_sb)->s_mount_flags & EXT4_MF_FS_ABORTED)
+			ret = -EROFS;
 		goto out;
(Continue reading)

Dmitry Monakhov | 16 May 2013 14:07
Favicon
Gravatar

[PATCH] xfstests: test data integrity under disk failure

Parallels team have old good tool called hwflush-check which is server/client
application for testing data integrity under system/disk failure conditions.
Usually we run hwflush-check on two different hosts and use PMU to trigger real
power failure of the client as a whole unit. This tests may be used for
SSD checking (some of them are known to have probelms with hwflush).
I hope it will be good to share it with community.

This tests simulate just one disk failure while client system should
survive this failure. This test extend idea of shared/305.
1) Run hwflush-check server and client on same host as usual
2) Simulare disk failure via blkdev failt injection API aka 'make-it-fail'
3) Umount failed device
4) Makes disk operatable again
5) Mount filesystem
3) Check data integrity

Surprasingly this 'single disk failure test' uncover data loss on reccent kernel
(3.10-rc1) for EXT4 and XFS.

Signed-off-by: Dmitry Monakhov <dmonakhov <at> openvz.org>
---
 aclocal.m4           |    1 +
 configure.ac         |    2 +
 include/builddefs.in |    1 +
 src/Makefile         |    4 +-
 src/hwflush-check.c  |  775 ++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/shared/313     |  154 ++++++++++
 tests/shared/313.out |   64 ++++
 tests/shared/group   |    1 +
 8 files changed, 1000 insertions(+), 2 deletions(-)
(Continue reading)

EUNBONG SONG | 15 May 2013 09:15

Question about ext4 excessive stall time


Hello. 
I saw some commit messages about excessive stall times in ext4. 
The same problem was reproduced in my board. Actually, i'm not sure that is the same problem.
But the calltrace message is very similar with commit id: f783f091. 

My board configuration as follow.
Linux: 2.6.32.60
CPU: 8CPU
Disk: SSD
File system: ext4

I know my kernel version is so old. I just want to know why this problem is happened.
Because of my kernel version is old? or Because of disk ?,,
If anyone knows about this problem, Could you help me?

As i said,  the calltrace message is very similar with commit id: f783f091 as below.
Because i enabled hungtask detection and set the hungtask timeout to 120 seconds. 
the calltrace messages were printed by this. 

[  262.455615] INFO: task swm:1692 blocked for more than 120 seconds.
[  262.461715] Stack : 0000000000000000 ffffffff8011f684 0000000000000000 a8000001f8b33960
[  262.469438]         a8000001f7f2ce38 00000000ffffb6a9 0000000000000005 ffffffff80efa270
[  262.477424]         a8000001f7f2d0e8 0000000000000020 0000000000000000 ffffffff80ec69a0
[  262.485732]         00000001f90a3a50 a800000107c93020 a8000001f9062eb8 a8000001f9302950
[  262.493708]         a8000001f5121280 a8000001f90a3810 a8000001f7e20000 0000000000000000
[  262.501675]         0000000000000004 a800000107c93020 a8000001f8b339c0 ffffffff8011cd50
[  262.509661]         a8000001f9062eb8 ffffffff8011f684 0000000000000001 ffffffff8011cea8
[  262.517647]         a8000001f9302950 ffffffff804295a0 a8000001f9302950 0000001800000000
[  262.525652]         0000000000000000 a8000001f7f2ce38 ffffffff802e3f00 a8000001f4f73bb8
(Continue reading)

Theodore Ts'o | 14 May 2013 18:09
Picon
Picon
Favicon
Gravatar

[GIT PULL] ext4 updates for 3.10-rc2


The following changes since commit 0d606e2c9fccdd4e67febf1e2da500e1bfe9e045:

  ext4: fix type-widening bug in inode table readahead code (2013-04-23 08:59:35 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git tags/ext4_for_linus_stable

for you to fetch changes up to e2555fde4159467fb579e6ae3c0a8fc33015d0f5:

  jbd,jbd2: fix oops in jbd2_journal_put_journal_head() (2013-05-13 09:45:01 -0400)

----------------------------------------------------------------
Fixed regressions (two stability regressions and a performance
regression) introduced during the 3.10-rc1 merge window.  Also
included is a bug fix relating to allocating blocks after resizing an
ext3 file system when using the ext4 file system driver.

----------------------------------------------------------------
Jan Kara (1):
      jbd,jbd2: fix oops in jbd2_journal_put_journal_head()

Lachlan McIlroy (1):
      ext4: limit group search loop for non-extent files

Theodore Ts'o (1):
      ext4: revert "ext4: use io_end for multiple bios"

Yan, Zheng (1):
(Continue reading)

folkert | 14 May 2013 11:14
Gravatar

checksums

Hi,

Is it possible to "scrub" (check/verify) (the new-) checksums in ext4?

Also: is it planned to add an option to add checksums to the data as
well?

Folkert van Heusden

--

-- 
www.vanheusden.com/multitail - multitail is tail on steroids. multiple
               windows, filtering, coloring, anything you can think of
----------------------------------------------------------------------
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Li Wang | 14 May 2013 09:44
Favicon

[PATCH] Avoid unnecessarily writing back dirty pages before hole punching

For hole punching, currently ext4 will synchronously write back the
dirty pages fit into the hole, since the data on the disk responding
to those pages are to be deleted, it is benefical to directly release
those pages, no matter they are dirty or not, except the ordered case.

Signed-off-by: Li Wang <liwang <at> ubuntukylin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen <at> ubuntukylin.com>
Reviewed-by: Theodore Ts'o <tytso <at> mit.edu>
Reviewed-by: Dmitry Monakhov <dmonakhov <at> openvz.org>
---
 fs/ext4/inode.c       |   21 ++++++++++-------
 fs/jbd2/transaction.c |   62 +++++++++++++++++++++++++++++++------------------
 include/linux/jbd2.h  |    2 ++
 3 files changed, 55 insertions(+), 30 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0723774..3e00360 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
 <at>  <at>  -3578,6 +3578,16  <at>  <at>  int ext4_can_truncate(struct inode *inode)
 	return 0;
 }

+static inline int ext4_begin_ordered_fallocate(struct inode *inode,
+					      loff_t start, loff_t length)
+{
+	if (!EXT4_I(inode)->jinode)
+		return 0;
+	return jbd2_journal_begin_ordered_fallocate(EXT4_JOURNAL(inode),
+						   EXT4_I(inode)->jinode,
(Continue reading)

Radek Pazdera | 13 May 2013 17:43
Picon
Favicon

[PATCH v2 0/2] e2fsprogs: Adding itree feature/inode flags

Hi,

I created this very short series that adds the itree flags to e2fsprogs
mke2fs and tune2fs in case someone would like to try the itree patches.

A RFC patches were posted to linux-ext4 with this feature implemented,
however, it is still *highly* experimental. I am still working on it.

Changes from v1
===============

    * The feature + inode flag patches merged together
    * Added more detailed commit messages

-Radek

Radek Pazdera (2):
  libext2fs: Adding itree feature and inode flags
  mke2fs, tune2fs: Adding support for the itree flag

 lib/e2p/feature.c    |    2 ++
 lib/e2p/pf.c         |    1 +
 lib/ext2fs/ext2_fs.h |    2 ++
 misc/mke2fs.c        |    3 ++-
 misc/tune2fs.c       |    6 ++++--
 5 files changed, 11 insertions(+), 3 deletions(-)

--

-- 
1.7.7.6

(Continue reading)

Radek Pazdera | 13 May 2013 17:42
Picon
Favicon

[RFC v2 0/9] ext4: An Auxiliary Tree for the Directory Index

Hello everyone,

I am a university student from Brno /CZE/. I'm trying to optimise the
readdir/stat scenario in ext4 as the final project to school. I posted some
test results I got few months ago [1].

And following that, I tried to implement an additional tree for ext4's
directory index that would be sorted by inode numbers. The tree is used by
ext4_readdir() which should lead to substantial increase of performance of
operations that manipulate a whole directory at once.

The performance increase should be visible especially with large directories
or in case of low memory or cache pressure.

This patch series is what I've got so far. I must say, I originally thought
it would be *much* simpler :).

TLDR
====

The series contains the implementation of the new tree and several rather
small changes to the original directory index. I also added a new
implementation of ext4_readdir() that uses this new tree instead of the
original one.

It applies on 3.10-rc1, it should be basically working, however, it is
highly experimental at the moment. It doesn't survive XFS tests yet (I
still need to work on that).

Changes from v1
(Continue reading)

Jan Kara | 13 May 2013 15:31
Picon

[PATCH] jbd2: Fix oops in jbd2_journal_put_journal_head()

Commit ae4647fb (jbd2: reduce journal_head size.) introduced a
regression where we occasionally hit panic in
jbd2_journal_put_journal_head() because of wrong b_jcount. The bug is
caused by gcc making 64-bit access to 32-bit bitfield and thus
clobbering b_jcount.

At least for now, those 8 bytes saved in struct journal_head are not
worth the trouble with gcc bitfield handling so revert that part of the
patch.

Reported-by: EUNBONG SONG <eunb.song <at> samsung.com>
Signed-off-by: Jan Kara <jack <at> suse.cz>
---
 include/linux/journal-head.h | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

  Ted, this should fix the regression reported by EUNBONG SONG. Please apply.

diff --git a/include/linux/journal-head.h b/include/linux/journal-head.h
index 13a3da2..3594c74 100644
--- a/include/linux/journal-head.h
+++ b/include/linux/journal-head.h
 <at>  <at>  -30,15 +30,19  <at>  <at>  struct journal_head {

 	/*
 	 * Journalling list for this buffer [jbd_lock_bh_state()]
+	 * NOTE: We *cannot* combine this with b_modified into a bitfield
+	 * as gcc would then (correctly according to C++11) make 64-bit
+	 * accesses to the bitfield and clobber b_jcount if its update
+	 * races with bitfield modification.
(Continue reading)


Gmane