Simon Guilhot | 2 Dec 15:59 2007
Picon

Meta-data in Ext3

Hi to everyone,
 
I have a student projet quite interessting and quite hard and the informations found on internet about the ext3 aren't relevant. That's why I recquire you're help.
My aim is to add some métadatas to my files (like public,Private, draft ... the list must be extensible) and have the possibility to display it with the commands line (ls, rm, ...). 
There is lots of ways to do that. The more appropriate is, for me, to implement it directly in the i-node of the Ext3 (more easy in ext2 ?).
Concretly I see it like that (I'm probably wrong):
 
The standard i-node contain informations like Creation/modification dates, rights, number of links ... and I wanted to add a field string (or some bytes, its the same) where i could put my metadatas.
Of course the system won't be bootable (and wont be stable).
Here is a representation :
class|host|device|start_time ils|shirley||1151770485 st_ino|st_alloc|st_uid|st_gid|st_mtime|st_atime|st_ctime|st_mode|st_nlink|st_size|st_block0|st_block1 1|a|0|0|1151770448|1151770448|1151770448|0|0|0|0|0|MY_FIELD 2|a|0|0|1151770448|1151770448|1151770448|40755|3|1024|201|0|MY_FIELD 3|a|0|0|0|0|0|0|0|0|0|0|MY_FIELD  
Course, I'm maybe dreaming, it's probably very hard, but it's interesting to ask to someone more experienced.
 
Forgive me for my english.
 
Guilhot Simon
_______________________________________________
Ext3-users mailing list
Ext3-users <at> redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
Ross Boylan | 4 Dec 19:52 2007
Picon

Re: Ancient Very slow directory traversal thread

On Mon, 2007-12-03 at 14:37 -0500, Rashkae wrote:
> I just came across your message to a mailing list here:
[message concerned it taking hours to go through directories in a mail
spool on ext3]
> 
> https://www.redhat.com/archives/ext3-users/2007-October/msg00019.html
> 
> This might be a problem you resolved for yourself a long time ago, but I 
> thought you might be interested to know that Theodore's spd_readdir 
> library works great with star (even though I also cannot get it to work 
> with tar or even du).
That's interesting.  Since I couldn't get it to work with tar, got no
response, and wasn't sure how or if to get it to work with the daemon to
that really needed it, I haven't made and progress.

I wonder what determines whether the library helps or hurts.
> 
> star with spd_readdir is what I use to back up maildir spools, and 
> something I consider absolutely necessary for any such storage on ext3 
> or reiserfs filesystems.
I hadn't noticed this was an issue with reiser.

I also just noticed that e2fsck has an option, -D, to optimize
directories.  The man page says this will reindex directories.  Does
anyone know if that could help?

Ross
Andreas Dilger | 6 Dec 00:47 2007
Picon

Re: Ancient Very slow directory traversal thread

On Dec 04, 2007  10:52 -0800, Ross Boylan wrote:
> On Mon, 2007-12-03 at 14:37 -0500, Rashkae wrote:
> > I just came across your message to a mailing list here:
> [message concerned it taking hours to go through directories in a mail
> spool on ext3]
> > 
> > https://www.redhat.com/archives/ext3-users/2007-October/msg00019.html
> > 
> > This might be a problem you resolved for yourself a long time ago, but I 
> > thought you might be interested to know that Theodore's spd_readdir 
> > library works great with star (even though I also cannot get it to work 
> > with tar or even du).
>
> That's interesting.  Since I couldn't get it to work with tar, got no
> response, and wasn't sure how or if to get it to work with the daemon to
> that really needed it, I haven't made and progress.
>
> I wonder what determines whether the library helps or hurts.

Maybe it depends if the app is using normal readdir() calls, or is
maybe implementing the directory traversal itself?

> I also just noticed that e2fsck has an option, -D, to optimize
> directories.  The man page says this will reindex directories.  Does
> anyone know if that could help?

That will compress empty space from directories and rebuild the hash
table for directories that are not indexed (e.g. older dirs created
before DIR_INDEX feature was enabled in fs).

It will keep them in hash order so it won't help this issue.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
Sven Rudolph | 11 Dec 13:29 2007
Picon

Ext3 Performance Tuning - the journal

Hello,

I have some performance problems in a file server system. It is used
as Samba and NFS file server. I have some ideas what might cause the
problems, and I want to try step by step. First I have to learn more
about these areas.

First I have some questions about tuning/sizing the ext3 journal.

The most extensive list I found on ext3 performance tuning is
<http://marc.info/?l=ext3-users&m=117943306605949&w=2> .

I learned that the ext3 journal is flushed when either the journal is
full or the commit interval is over (set by the mount option
"commit=<number of seconds>"). So started trying these settings.

I didnt manage to determine the size of the journal of an already
existing filesystem. tunefs tells me the inode:

  ~# tune2fs -l  /dev/vg0/lvol0 | grep -i journal
  Filesystem features:      has_journal resize_inode dir_index filetype needs_recovery sparse_super large_file
  Journal inode:            8
  Journal backup:           inode blocks

Is there a way to get the size of the journal?

And how do I find out how much of the journal is used? Or how often a
journal flush actually happens? Or whether the journal flushes happen
because the commit interval has finished or because the journal was
full? This would give me hints for the sizing of the journal.

And I tried to increase the journal flush interval.

  ~# umount /data/
  ~# mount -o commit=30 /dev/vg0/lvol0 /data/
  ~# grep /data /proc/mounts 
  /dev/vg0/lvol0 /data ext3 rw,data=ordered 0 0
  ~#

Watching the disk activity LEDs makes me believe that this works, but
I expected the mount option "commit=30" to be listed in
/proc/mounts. Did I do something wrong, or is there another way to
explain it?

As you see above in /proc/mounts I use data=ordered. The fileserver
offers both NFS and Samba. "data=journal" might be better for NFS, but
I believe that NFS is the smaller part of the fileserver load. Is
there a way to measure or estimate how large the impact of NFS on the
journal size and transfer rate is?

If I used "data=journal" I would need a larger journal and the journal
data transfer rate would increase. I fear this might induce a new
bottleneck, but I have no idea how to measure this or how to estimate
it in advance.

Currently I have an internal journal, the filesystem resides on
RAID6. I guess this is another potential performance problem.  When
discussions on external journals appeared some years ago it was
mentioned that the external journal code was quite new (see
<http://marc.info/?l=ext3-users&m=101466148203469&w=2>).

I think nowadays I have the option to use an external journal and
place it on a dedicated RAID1. Did anyone experience performance
advantages by doing this? Even while using "data=journal"?

Thats all. Thanks for reading that far ;-)

	Sven
Bruce Guenter | 11 Dec 23:15 2007
Picon

PROBLEM: Duplicated entries in large NFS shared directory

Hi.

I have a large directory (almost 40,000 entries) on an ext3 filesystem
that is shared over NFS.  I discovered recently when listing the
directory on the client, one of the files appears twice.  The same file
does not appear twice on the server.

I did a capture using WireShark, and discovered that the offending file
name is being sent twice -- once as the last entry in a readdir reply
packet and then again as the first entry in the next readdir reply.

If I'm reading the trace right, the readdir call sends the cookie for
the last entry in the previous readdir reply and the server responds
with the next set of entries.  In this case, the server responds with
the entry containing the same cookie again.

The server is running vanilla 2.6.23.8.  I would be happy to provide any
further information that would help resolve this bug.

I posted this to the NFS maintainers, and Neil Brown suggested:

> My guess is that you have lucked-out and got two directory entries
> that hash to the same value, and they appear either side of a readdir
> block boundary.
>
> It is an awkward designed limitation of ext3 that is rarely a problem
> and could possibly be worked around to some extent...
--

-- 
Bruce Guenter <bruce <at> untroubled.org>                http://untroubled.org/
_______________________________________________
Ext3-users mailing list
Ext3-users <at> redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
Andreas Dilger | 13 Dec 10:20 2007
Picon

Re: Ext3 Performance Tuning - the journal

On Dec 11, 2007  13:29 +0100, Sven Rudolph wrote:
> I didnt manage to determine the size of the journal of an already
> existing filesystem. tunefs tells me the inode:
> 
>   ~# tune2fs -l  /dev/vg0/lvol0 | grep -i journal
>   Filesystem features:      has_journal resize_inode dir_index filetype needs_recovery sparse_super large_file
>   Journal inode:            8
>   Journal backup:           inode blocks
> 
> Is there a way to get the size of the journal?

dumpe2fs -c -R "stat <8>" /dev/vg0/lvol0

> And how do I find out how much of the journal is used? Or how often a
> journal flush actually happens? Or whether the journal flushes happen
> because the commit interval has finished or because the journal was
> full? This would give me hints for the sizing of the journal.

There is a patch for jbd2 (part of the ext4 patch queue, based on a
patch for jbd from Lustre) that records transactions and journal stats.

> And I tried to increase the journal flush interval.
> 
>   ~# umount /data/
>   ~# mount -o commit=30 /dev/vg0/lvol0 /data/
>   ~# grep /data /proc/mounts 
>   /dev/vg0/lvol0 /data ext3 rw,data=ordered 0 0
>   ~#
> 
> Watching the disk activity LEDs makes me believe that this works, but
> I expected the mount option "commit=30" to be listed in
> /proc/mounts. Did I do something wrong, or is there another way to
> explain it?

No, /proc/mounts doesn't report all of the mount options correctly.

> As you see above in /proc/mounts I use data=ordered. The fileserver
> offers both NFS and Samba. "data=journal" might be better for NFS, but
> I believe that NFS is the smaller part of the fileserver load. Is
> there a way to measure or estimate how large the impact of NFS on the
> journal size and transfer rate is?
> 
> If I used "data=journal" I would need a larger journal and the journal
> data transfer rate would increase. I fear this might induce a new
> bottleneck, but I have no idea how to measure this or how to estimate
> it in advance.

Increasing the journal size is a good idea for any metadata-heavy load.
We use a journal size of 400MB for Lustre metadata servers.

> Currently I have an internal journal, the filesystem resides on
> RAID6. I guess this is another potential performance problem.

For the journal this doesn't make much difference since the IO is
sequential writes.  The RAID6 is bad for metadata performance because
it has to do read-modify-write on the RAID stripes.

> When discussions on external journals appeared some years ago it was
> mentioned that the external journal code was quite new (see
> <http://marc.info/?l=ext3-users&m=101466148203469&w=2>).
> 
> I think nowadays I have the option to use an external journal and
> place it on a dedicated RAID1. Did anyone experience performance
> advantages by doing this? Even while using "data=journal"?

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
Brice Figureau | 13 Dec 17:22 2007

Re: Ext3 Performance Tuning - the journal

Hi,

On Tue, 2007-12-11 at 13:29 +0100, Sven Rudolph wrote:
> I have some performance problems in a file server system. It is used
> as Samba and NFS file server. I have some ideas what might cause the
> problems, and I want to try step by step. First I have to learn more
> about these areas.
> 
> First I have some questions about tuning/sizing the ext3 journal.
> 
> The most extensive list I found on ext3 performance tuning is
> <http://marc.info/?l=ext3-users&m=117943306605949&w=2> .
> 
> 
> I learned that the ext3 journal is flushed when either the journal is
> full or the commit interval is over (set by the mount option
> "commit=<number of seconds>"). So started trying these settings.

Are your filesystem mounted noatime ?

It does a huge difference, especially if your workload is mainly read
over write.
Without noatime, each access to a file generates a write to change the
metadata which will fill your journal.

If you are not using noatime, it is worth trying it.
See it for a thorough discussion of the topic:
http://thread.gmane.org/gmane.linux.kernel/565148

Hope that helps,
--

-- 
Brice Figureau
Days of Wonder http://www.daysofwonder.com/
Bart | 18 Dec 22:11 2007
Picon

how ext3 works

n the past few days, I've been reading about ext3/journalling/... In order to fully understand how it works I have a few questions.
 
*Imagine the following situation: you opened a file in vi(m), you are editing, but haven't yet saved your work. The system crashes: what will be the result? Will the metadata be modified (assume both atime and noatime)? Will the data itself be corrupted? Or will there be no modification whatsoever because you hadn't saved yet (your work will simply be lost)?
*What happens when the system crashes during a write to the journal? Can the journal be corrupted?
 
*About ext3's ordered mode
[quote]from Wikipedia:
Ordered
    (medium speed, medium risk) Ordered is as with writeback, but forces file contents to be written before its associated metadata is marked as committed in the journal.[/quote]

What's the sequence of events here?
1. user issues command to write his work to disk
2. metadata is recorded in the journal, but is marked as "not yet executed" (or something similar)
3. data (file contents) and metadata are written to disk
4. metadata flag is set as "executed"
 
If a crash happens between step 1 and 2, we are in the situation as described above (first situation): not yet written
If a crash happens between step 2 and 3, isn't this the same as writeback? Or is this impossible (I read something about a single transaction, but I forgot where)?
Crash between 3 and 4, can be corrected by replaying the journal.
 
Is this a correct view of things?
_______________________________________________
Ext3-users mailing list
Ext3-users <at> redhat.com
https://www.redhat.com/mailman/listinfo/ext3-users
wienerschnitzel | 19 Dec 06:44 2007
Picon

ext3 journaling on flash disk

Hello folks,

I'm using a rather old kernel (2.4.27) that has been working quite well in
an embedded system.

Currently, I am conducting some unclean shutdown tests with different flash
disks and I'm running into fs corruption. I'm using the data=journal mode
for the root and data partitions. I'm mounting with the 'noatime' option.

Would it make sense to go to the latest 2.4 kernel or should I move on to
2.6?

I have three different flash disks and none of them seem to have write
caching, however one of them has the 'Mandatory FLUSH_CACHE' support - what
exactly does that mean?
Christian Kujau | 19 Dec 11:57 2007
Picon

Re: ext3 journaling on flash disk

On Wed, December 19, 2007 06:44, wienerschnitzel wrote:
> Currently, I am conducting some unclean shutdown tests with different
> flash disks and I'm running into fs corruption.
> I'm using the data=journal mode for the root and data partitions.

Well, what kind of corruptions do you get? 2.4 is still somewhat
supported, I think. And if it turns out to be a bug, maybe someone will
fix it.

> Would it make sense to go to the latest 2.4 kernel or should I move on to
> 2.6?

If upgrading to 2.6 is feasible for you, it's worth a try.

C.
--

-- 
BOFH excuse #442:

Trojan horse ran out of hay

Gmane