David Abrahams | 3 May 2007 19:34
Picon
Picon
Favicon
Gravatar

seeking JFS recovery wisdom

OK, so my ubuntu laptop locked up after a resume recently with no recent 
backup <insert requisite self-flagellation> :(

After the first reboot, I was getting all kinds of funky problems with 
gnome, so I did a "sudo touch /forcefsck" and tried to reboot without 
success.  The fsck failed to completely recover everything on my JFS 
root partition and I lost my /etc directory.  lost+found has many 
recognizable files from /etc, but it seems as though  putting them all 
back where they belong is probably a huge amount of work.

Seems like my /home filesystem is still in decent shape.

After I back up what I have and choose a more reliable backup strategy, 
I can see a few options for getting back to a "known good configuration"

1. try to put the lost+found files back and hope for the best
2. reinstall ubuntu and use my old /home partition.  Attempt to restore 
installed software and configurations to the previous state
3. anything else?

Also, did I do the wrong thing by trying to force a fsck?  If so, what 
should I have done instead?

Finally, is there any reason to believe that some other filesysytem 
would have been less likely to fail in this way?

Thanks in advance,
Dave

(Continue reading)

Scott Lockwood | 3 May 2007 19:49

Re: seeking JFS recovery wisdom

On Thu, 2007-05-03 at 13:34 -0400, David Abrahams wrote:
> 2. reinstall ubuntu and use my old /home partition.

This one. There's no way to know for sure that the system is 100%
otherwise from what you describe.

I don't think jfs is any better, or worse than ext3 for this. Use
whatever you like.

Also, if your system is un-swiss-cheesed enough, try to get the current
packages installed thusly:

sudo dpkg -l | grep ii

Or, for a version of the list that is JUST the package names, saved to a
file called pkglist:

dpkg -l | grep ii | cut -d ' ' -f3 >pkglist

That way, after you reinstall, you have the list of packages you need.

--

-- 
Regards,
Scott Lockwood

Dave Kleikamp | 4 May 2007 17:37
Picon

[PATCH] JFS: Fix race waking up jfsIO kernel thread

I've been looking at a hang that was reported off-list, and I believe I
found the cause.  I'm still waiting for confirmation on whether the
patch does fix the problem, but I wanted to distribute the patch in case
anyone else is seeing a similar hang.  It looks like the problem is in
kernels from 2.6.17 to the present.

Thanks,
Shaggy

JFS: Fix race waking up jfsIO kernel thread

It's possible for a journal I/O request to be added to the log_redrive
queue and the jfsIO thread to be awakened after the thread releases
log_redrive_lock but before it sets its state to TASK_INTERRUPTIBLE.

The jfsIO thread should set its state before giving up the spinlock, so
the waking thread will really wake it.

Signed-off-by: Dave Kleikamp <shaggy <at> linux.vnet.ibm.com>

diff --git a/fs/jfs/jfs_logmgr.c b/fs/jfs/jfs_logmgr.c
index ff7f1be..16c6268 100644
--- a/fs/jfs/jfs_logmgr.c
+++ b/fs/jfs/jfs_logmgr.c
 <at>  <at>  -2354,12 +2354,13  <at>  <at>  int jfsIOWait(void *arg)
 			lbmStartIO(bp);
 			spin_lock_irq(&log_redrive_lock);
 		}
-		spin_unlock_irq(&log_redrive_lock);

(Continue reading)

Marcel Lanz | 9 May 2007 23:22
Picon

File system object DF12704 has invalid descriptor (C)

Hi

This afternoon my box lost power and a volume group with two USB
drives chained together went down with it. On the VG one big JFS
Partition (around 500GB) was installed.

After a fsck.jfs I was not able to mount the partition again. I get
the following messages using the verbose switch.

Can anyone help me a bit to interpret the following lines ? I don't
see any device (sd<x>) related errors in the kernel log, so I hope to
fix those errors somehow ?

Thanks and regards, Marcel

defiant:~# jfs_fsck /dev/external_vg/data_01  -v
jfs_fsck version 1.1.11, 05-Jun-2006
processing started: 5/9/2007 23.9.26
Using default parameter: -p
The current device is:  /dev/external_vg/data_01
Open(...READ/WRITE EXCLUSIVE...) returned rc = 0
Primary superblock is valid.
The type of file system for the device is JFS.
Block size in bytes:  4096
Filesystem size in blocks:  127927296
**Phase 0 - Replay Journal Log
LOGREDO:  Log already redone!
logredo returned rc = 0
**Phase 1 - Check Blocks, Files/Directories, and  Directory Entries
File system object DF12704 has invalid descriptor (C).
(Continue reading)

Daniel Drake | 10 May 2007 15:57

Detecting when device has been remounted read-only from userspace

Hi,

Occasionally, JFS detects filesystem corruption during use and and
remounts the filesystem read-only. (the corruption is probably warranted
for, e.g. unexpected power loss happened some time previously)

Is there a clean way to detect when this happens from userspace, other
than checking every single I/O operation? In our case, we'd like to pop
up a warning and recommend a reboot.

Thanks.
--

-- 
Daniel Drake
Brontes Technologies, A 3M Company

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Dave Kleikamp | 10 May 2007 18:13
Picon

Re: File system object DF12704 has invalid descriptor (C)

On Wed, 2007-05-09 at 23:22 +0200, Marcel Lanz wrote:
> Hi
> 
> This afternoon my box lost power and a volume group with two USB
> drives chained together went down with it. On the VG one big JFS
> Partition (around 500GB) was installed.
> 
> After a fsck.jfs I was not able to mount the partition again. I get
> the following messages using the verbose switch.
> 
> Can anyone help me a bit to interpret the following lines ? I don't
> see any device (sd<x>) related errors in the kernel log, so I hope to
> fix those errors somehow ?

Most of those errors should be recoverable.  Some of them are more
severe and should result in a directory being lost and its contents put
in lost+found.  The worst problem is the termination message:

> processing terminated:  5/9/2007 23:09:31  with return code: 16711680
>  exit code: 8.

return code 16711680 isn't valid.  In hex, it's 0xff0000.  I suspect
that a return code is being overwritten in memory.  Probably some data
structure containing an array on the stack is going outside its bounds.
I'm looking at the code now, but I haven't found anything yet.

Until I figure out how to fix jfs_fsck, you may be able to access the
volume by mounting it read-only (if that is of any use to you).  I'm not
sure why you're seeing so much file system damage.  Could the journal
replay have failed on an earlier invocation of jfs_fsck?
(Continue reading)

Dave Kleikamp | 10 May 2007 18:36
Picon

Re: Detecting when device has been remounted read-only from userspace

On Thu, 2007-05-10 at 09:57 -0400, Daniel Drake wrote:
> Hi,
> 
> Occasionally, JFS detects filesystem corruption during use and and
> remounts the filesystem read-only. (the corruption is probably warranted
> for, e.g. unexpected power loss happened some time previously)

jfs really should recover cleanly after a power loss, but caching by the
disk drives can undermine the journal's integrity.  There may be some
undiscovered bugs as well.  If there are no i/o errors on the device,
file-system corruption should be a pretty rare occurrence.

> Is there a clean way to detect when this happens from userspace, other
> than checking every single I/O operation? In our case, we'd like to pop
> up a warning and recommend a reboot.

jfs prints a message to the system log when this happens, so maybe you
could somehow monitor that?  The message contains "remounting filesystem
as read-only".

I guess another method might be to occasionally check the mount point
in /proc/mounts to see if it is mounted read only ("ro" rather than
"rw").

Shaggy
--

-- 
David Kleikamp
IBM Linux Technology Center

-------------------------------------------------------------------------
(Continue reading)

William Marshall | 10 May 2007 19:21
Picon
Favicon

Re: Detecting when device has been remounted read-only from userspace


Shaggy wrote on 05/10/2007 11:36:14 AM:

> On Thu, 2007-05-10 at 09:57 -0400, Daniel Drake wrote:
> > Hi,
> >
> > Occasionally, JFS detects filesystem corruption during use and and
> > remounts the filesystem read-only. (the corruption is probably warranted
> > for, e.g. unexpected power loss happened some time previously)
>
> jfs really should recover cleanly after a power loss, but caching by the
> disk drives can undermine the journal's integrity.  There may be some
> undiscovered bugs as well.  If there are no i/o errors on the device,
> file-system corruption should be a pretty rare occurrence.
>
> > Is there a clean way to detect when this happens from userspace, other
> > than checking every single I/O operation? In our case, we'd like to pop
> > up a warning and recommend a reboot.
>
> jfs prints a message to the system log when this happens, so maybe you
> could somehow monitor that?  The message contains "remounting filesystem
> as read-only".

There are a couple of tools like swatch that will watch /var/log/messages and then take actions.

And you probably don't need to reboot. You can:
fsck /home
mount -o remount,rw /home

Bill
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Jfs-discussion mailing list
Jfs-discussion <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/jfs-discussion
Marcel Lanz | 12 May 2007 20:36
Picon

Re: File system object DF12704 has invalid descriptor (C)

>Most of those errors should be recoverable. Some of them are more
>severe and should result in a directory being lost and its contents put
>in lost+found. The worst problem is the termination message:

>> processing terminated: 5/9/2007 23:09:31 with return code: 16711680
>> exit code: 8.

>return code 16711680 isn't valid. In hex, it's 0xff0000. I suspect
>that a return code is being overwritten in memory. Probably some data
>structure containing an array on the stack is going outside its bounds.
>I'm looking at the code now, but I haven't found anything yet.

>Until I figure out how to fix jfs_fsck, you may be able to access the
>volume by mounting it read-only (if that is of any use to you). I'm not
>sure why you're seeing so much file system damage. Could the journal
>replay have failed on an earlier invocation of jfs_fsck?

yes, I can mount it read-only, this helps a lot since I tought to
loose all the data. A read-write mount didn't work so until your
suggestion I never tried to mount it RO. For a RW mount I get the
following:

defiant:/# mount  /dev/external_vg/data_01 /mnt/
mount: wrong fs type, bad option, bad superblock on /dev/external_vg/data_01,
       missing codepage or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

marcel

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
npiggin | 14 May 2007 08:07
Picon

[patch 41/41] jfs convert to new aops.

Cc: shaggy <at> austin.ibm.com
Cc: jfs-discussion <at> lists.sourceforge.net
Cc: Linux Filesystems <linux-fsdevel <at> vger.kernel.org>
Signed-off-by: Nick Piggin <npiggin <at> suse.de>

 fs/jfs/inode.c |   19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

Index: linux-2.6/fs/jfs/inode.c
===================================================================
--- linux-2.6.orig/fs/jfs/inode.c
+++ linux-2.6/fs/jfs/inode.c
 <at>  <at>  -255,7 +255,7  <at>  <at>  int jfs_get_block(struct inode *ip, sect

 static int jfs_writepage(struct page *page, struct writeback_control *wbc)
 {
-	return nobh_writepage(page, jfs_get_block, wbc);
+	return block_write_full_page(page, jfs_get_block, wbc);
 }

 static int jfs_writepages(struct address_space *mapping,
 <at>  <at>  -275,10 +275,13  <at>  <at>  static int jfs_readpages(struct file *fi
 	return mpage_readpages(mapping, pages, nr_pages, jfs_get_block);
 }

-static int jfs_prepare_write(struct file *file,
-			     struct page *page, unsigned from, unsigned to)
-{
-	return nobh_prepare_write(page, from, to, jfs_get_block);
+static int jfs_write_begin(struct file *file, struct address_space *mapping,
+				loff_t pos, unsigned len, unsigned flags,
+				struct page **pagep, void **fsdata)
+{
+	*pagep = NULL;
+	return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
+				jfs_get_block);
 }

 static sector_t jfs_bmap(struct address_space *mapping, sector_t block)
 <at>  <at>  -302,8 +305,8  <at>  <at>  const struct address_space_operations jf
 	.writepage	= jfs_writepage,
 	.writepages	= jfs_writepages,
 	.sync_page	= block_sync_page,
-	.prepare_write	= jfs_prepare_write,
-	.commit_write	= nobh_commit_write,
+	.write_begin	= jfs_write_begin,
+	.write_end	= generic_write_end,
 	.bmap		= jfs_bmap,
 	.direct_IO	= jfs_direct_IO,
 };
 <at>  <at>  -356,7 +359,7  <at>  <at>  void jfs_truncate(struct inode *ip)
 {
 	jfs_info("jfs_truncate: size = 0x%lx", (ulong) ip->i_size);

-	nobh_truncate_page(ip->i_mapping, ip->i_size);
+	block_truncate_page(ip->i_mapping, ip->i_size, jfs_get_block);

 	IWRITE_LOCK(ip, RDWRLOCK_NORMAL);
 	jfs_truncate_nolock(ip, ip->i_size);

--

-- 

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

Gmane