Valerie Aurora | 1 Sep 2010 03:56
Picon
Favicon

Re: [PATCH 0/5] hybrid union filesystem prototype

On Tue, Aug 31, 2010 at 04:19:47PM -0400, Trond Myklebust wrote:
> On Tue, 2010-08-31 at 15:18 -0400, Valerie Aurora wrote:
> > On Mon, Aug 30, 2010 at 02:20:47PM +0200, Miklos Szeredi wrote:
> > > On Mon, 30 Aug 2010, Neil Brown wrote:
> > > 
> > > > Val has been following that approach and asking if it is possible to make an
> > > > NFS filesystem really-truly read-only. i.e. no changes.
> > > > I don't believe it is.
> > > 
> > > Perhaps it doesn't matter.  The nasty cases can be prevented by just
> > > disallowing local modification.  For the rest NFS will return ESTALE:
> > > "though luck, why didn't you follow the rules?"
> > 
> > I agree: Ask the server to keep it read-only, but also detect if it
> > lied to prevent kernel bugs on the client.
> > 
> > Is detecting ESTALE and failing the mount sufficient to detect all
> > cases of a cached directory being altered?
> 
> No. Files can be altered without being unlinked.

Argh.  Do generation numbers and/or mtime help us here?

> >   I keep trying to trap an
> > NFS developer and beat the answer out of him but they usually get hung
> > up on the impossibility of 100% enforcement of the read-only server
> > option. (Agreed, impossible, just give the sysadmin a mount option so
> > that it doesn't happen accidentally.)
> 
> Remind me again why mounting the filesystem '-oro' on the server (and
(Continue reading)

Neil Brown | 1 Sep 2010 06:33
Picon
Gravatar

Re: [PATCH 0/5] hybrid union filesystem prototype

On Mon, 30 Aug 2010 21:40:27 +1000
Neil Brown <neilb <at> suse.de> wrote:

> > > > I think it's best to leave that stuff until someone actually cares.
> > > > The "people might find it useful" argument is not strong enough to put
> > > > nontrivial effort into thinking about all the weird corner cases.
> > > 
> > > My thought on this is that in order to describe the behaviour of the
> > > filesystem accurately you need to think about all the weird corner cases.
> > > 
> > > Then it becomes a question of: is it easier to document the weird behaviour,
> > > or change the code so the documentation will be easier to understand?
> > > Some cases will go one way, some the other.
> > > 
> > > But as you suggest this isn't urgent.
> > 
> > You didn't mention one possibility: add limitations that prevent the
> > weird corner cases arising.  I believe this is the simplest option.
> 
> Val has been following that approach and asking if it is possible to make an
> NFS filesystem really-truly read-only. i.e. no changes.
> I don't believe it is.
> 
> But I won't pursue this further without patches.
> 

And here is a patch.  It isn't really complete, but I have done enough for
today.  It at least shows what I am trying to do.

To see a clear indication of the effect, try the following script before and
(Continue reading)

Stefan Richter | 1 Sep 2010 08:37
Picon

2.6.36-rc3: inconsistent lock state (iprune_sem, shrink_icache_memory)

I switched from 2.6.35 to 2.6.36-rc3 a few days ago and while doing so
enabled lockdep.  Just got the following report during normal desktop
usage:

=================================
[ INFO: inconsistent lock state ]
2.6.36-rc3 #3
---------------------------------
inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
kswapd0/436 [HC0[0]:SC0[0]:HE1:SE1] takes:
 (iprune_sem){+++++-}, at: [<ffffffff810bde0b>] shrink_icache_memory+0x48/0x217
{RECLAIM_FS-ON-W} state was registered at:
  [<ffffffff810598b3>] mark_held_locks+0x4d/0x6b
  [<ffffffff8105995f>] lockdep_trace_alloc+0x8e/0xa7
  [<ffffffff810a5a7f>] kmem_cache_alloc+0x2a/0xc8
  [<ffffffff810d4f8a>] fsnotify_create_event+0x39/0x160
  [<ffffffff810d49e8>] send_to_group+0x108/0x13f
  [<ffffffff810d4b86>] fsnotify+0x167/0x266
  [<ffffffff810d57d0>] fsnotify_unmount_inodes+0xb8/0xf9
  [<ffffffff810bda9b>] invalidate_inodes+0x4b/0x13a
  [<ffffffff810ad3e5>] generic_shutdown_super+0x42/0xd2
  [<ffffffff810ad497>] kill_block_super+0x22/0x3a
  [<ffffffff810aca61>] deactivate_locked_super+0x21/0x41
  [<ffffffff810acc39>] deactivate_super+0x40/0x44
  [<ffffffff810c1c29>] mntput_no_expire+0xdc/0x10a
  [<ffffffff810c2e5d>] sys_umount+0x2a4/0x2d1
  [<ffffffff81001feb>] system_call_fastpath+0x16/0x1b
irq event stamp: 281663
hardirqs last  enabled at (281663): [<ffffffff81072d32>] __call_rcu+0x119/0x125
hardirqs last disabled at (281662): [<ffffffff81072c41>] __call_rcu+0x28/0x125
(Continue reading)

Hannes Reinecke | 1 Sep 2010 09:32
Picon

Re: safety of retrying SYNCHRONIZE CACHE [was: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush]

Mike Snitzer wrote:
> Hi Kiyoshi,
> 
> On Mon, Aug 30 2010 at  2:13am -0400,
> Kiyoshi Ueda <k-ueda <at> ct.jp.nec.com> wrote:
> 
>>> That does seem like a valid concern.  But I'm not seeing why its unique
>>> to SYNCHRONIZE CACHE.  Any IO that fails on the target side should be
>>> passed up once the error gets to DM.
>> See the Tejun's explanation again:
>>     http://marc.info/?l=linux-kernel&m=128267361813859&w=2
>> What I'm concerning is whether the same thing as Tejun explained
>> for ATA can happen on other types of devices.
>>
>>
>> Normal write command has data and no data loss happens on error.
>> So it can be retried cleanly, and if the result of the retry is
>> success, it's really success, no implicit data loss.
>>
>> Normal read command has a sector to read.  If the sector is broken,
>> all retries will fail and the error will be reported upwards.
>> So it can be retried cleanly as well.
> 
> I reached out to Fred Knight on this, to get a more insight from a pure
> SCSI SBC perspective, and he shared the following:
> 
> ----- Forwarded message from "Knight, Frederick" <Frederick.Knight <at> netapp.com> -----
> 
>> Date: Tue, 31 Aug 2010 13:24:15 -0400
>> From: "Knight, Frederick" <Frederick.Knight <at> netapp.com>
(Continue reading)

Hannes Reinecke | 1 Sep 2010 09:38
Picon

Re: safety of retrying SYNCHRONIZE CACHE [was: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush]

Hannes Reinecke wrote:
> Mike Snitzer wrote:
>> Hi Kiyoshi,
>>
>> On Mon, Aug 30 2010 at  2:13am -0400,
>> Kiyoshi Ueda <k-ueda <at> ct.jp.nec.com> wrote:
>>
>>>> That does seem like a valid concern.  But I'm not seeing why its unique
>>>> to SYNCHRONIZE CACHE.  Any IO that fails on the target side should be
>>>> passed up once the error gets to DM.
>>> See the Tejun's explanation again:
>>>     http://marc.info/?l=linux-kernel&m=128267361813859&w=2
>>> What I'm concerning is whether the same thing as Tejun explained
>>> for ATA can happen on other types of devices.
>>>
>>>
>>> Normal write command has data and no data loss happens on error.
>>> So it can be retried cleanly, and if the result of the retry is
>>> success, it's really success, no implicit data loss.
>>>
>>> Normal read command has a sector to read.  If the sector is broken,
>>> all retries will fail and the error will be reported upwards.
>>> So it can be retried cleanly as well.
>> I reached out to Fred Knight on this, to get a more insight from a pure
>> SCSI SBC perspective, and he shared the following:
>>
>> ----- Forwarded message from "Knight, Frederick" <Frederick.Knight <at> netapp.com> -----
>>
>>> Date: Tue, 31 Aug 2010 13:24:15 -0400
>>> From: "Knight, Frederick" <Frederick.Knight <at> netapp.com>
(Continue reading)

Hubert Kario | 1 Sep 2010 13:56
Picon
Favicon

Re: [PATCH 2/2] Btrfs-progs: Add hot data support in mkfs

On Friday 13 August 2010 16:10:24 Ben Chociej wrote:
> It's a good point, of course. Ideally we would be able to prioritize
> data and place them on 15k versus 7.2krpm disks, etc. However you get
> to a point where's there's only incremental benefit. For that reason,
> the scope of this project was simply to take advantage of SSD and HDD
> in hybrid. Of course, you could register the same complaint about the
> ZFS SSD caching: why not take advantage of faster vs. slower spinning
> disks? Unfortunately it just wasn't in the scope of our 12-week
> project here.
> 
> That's not to say it *shouldn't* be done in the future, of course!
> And, incidentally, you could hack it together at this point by setting
> the /sys/block/≤blockdev>/queue/rotational flag to 0 and using it like
> an SSD. :)

Then why not make the devices with rotational at 0 be the default "SSDs" but 
allow the admin to set manually others, without hacking?

You can have SSDs in a hardware RAID array, this way the kernel may not know 
if the block device is on flash media or on rotational media...

--

-- 
Hubert Kario
QBS - Quality Business Software
ul. Ksawerów 30/85
02-656 Warszawa
POLAND
tel. +48 (22) 646-61-51, 646-74-24
fax +48 (22) 646-61-50
--
(Continue reading)

Mike Snitzer | 1 Sep 2010 15:43
Picon
Favicon

Re: [PATCH 2/5] dm: implement REQ_FLUSH/FUA support for bio-based dm

On Mon, Aug 30 2010 at  5:58am -0400,
Tejun Heo <tj <at> kernel.org> wrote:

> This patch converts bio-based dm to support REQ_FLUSH/FUA instead of
> now deprecated REQ_HARDBARRIER.
> 
> * -EOPNOTSUPP handling logic dropped.

Can you expand on _why_ -EOPNOTSUPP handling is no longer needed?  And
please at it to the final patch header.

This removal isn't unique to DM's conversion to FLUSH+FUA but I couldn't
easily find the justification for its removal in the larger 30+ patch
patchset either -- other patches are terse on the removal too.

Other than that.

Acked-by: Mike Snitzer <snitzer <at> redhat.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Miklos Szeredi | 1 Sep 2010 22:11
Picon

Re: [PATCH 0/5] hybrid union filesystem prototype

On Wed, 1 Sep 2010, Neil Brown wrote:
> On Mon, 30 Aug 2010 21:40:27 +1000
> And here is a patch.  It isn't really complete, but I have done enough for
> today.  It at least shows what I am trying to do.

Thanks.

Okay, I see what it's trying to do.  And I can accept the part which
validates that the underlying dentry trees still match the union tree
(wouldn't it be simpler to just d_lookup() and check if the child
matches?)

However redoing the lookup and changing the upperpath and lowerpath
for directories is no good.  Upperpath and lowerpath are constant,
changing them would be like changing dentry->d_inode, the dentry would
suddenly become something else.  That's crazy.

Making sure that evething is in a sane state and erroring out if not
sounds a workable alternative to enforcing no change blindly.
Although no change should probably be the default (e.g. unless a 
"-o dont_enforce_no_change" mount option is given).

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

(Continue reading)

Dave Chinner | 2 Sep 2010 08:11

Re: 2.6.36-rc3: inconsistent lock state (iprune_sem, shrink_icache_memory)

On Wed, Sep 01, 2010 at 08:37:43AM +0200, Stefan Richter wrote:
> I switched from 2.6.35 to 2.6.36-rc3 a few days ago and while doing so
> enabled lockdep.  Just got the following report during normal desktop
> usage:
> 
> =================================
> [ INFO: inconsistent lock state ]
> 2.6.36-rc3 #3
> ---------------------------------
> inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
> kswapd0/436 [HC0[0]:SC0[0]:HE1:SE1] takes:
>  (iprune_sem){+++++-}, at: [<ffffffff810bde0b>] shrink_icache_memory+0x48/0x217
> {RECLAIM_FS-ON-W} state was registered at:
>   [<ffffffff810598b3>] mark_held_locks+0x4d/0x6b
>   [<ffffffff8105995f>] lockdep_trace_alloc+0x8e/0xa7
>   [<ffffffff810a5a7f>] kmem_cache_alloc+0x2a/0xc8
>   [<ffffffff810d4f8a>] fsnotify_create_event+0x39/0x160
>   [<ffffffff810d49e8>] send_to_group+0x108/0x13f
>   [<ffffffff810d4b86>] fsnotify+0x167/0x266
>   [<ffffffff810d57d0>] fsnotify_unmount_inodes+0xb8/0xf9
>   [<ffffffff810bda9b>] invalidate_inodes+0x4b/0x13a
>   [<ffffffff810ad3e5>] generic_shutdown_super+0x42/0xd2
>   [<ffffffff810ad497>] kill_block_super+0x22/0x3a
>   [<ffffffff810aca61>] deactivate_locked_super+0x21/0x41
>   [<ffffffff810acc39>] deactivate_super+0x40/0x44
>   [<ffffffff810c1c29>] mntput_no_expire+0xdc/0x10a
>   [<ffffffff810c2e5d>] sys_umount+0x2a4/0x2d1
>   [<ffffffff81001feb>] system_call_fastpath+0x16/0x1b
> irq event stamp: 281663
> hardirqs last  enabled at (281663): [<ffffffff81072d32>] __call_rcu+0x119/0x125
(Continue reading)

Miklos Szeredi | 2 Sep 2010 11:19
Picon

Re: [PATCH 5/5] union: hybrid union filesystem prototype

On Wed, 1 Sep 2010, Valerie Aurora wrote:
> > +
> > +		err = vfs_create(upperdir, newdentry, attr->ia_mode, NULL);
> 
> Passing a NULL namiedata pointer to vfs_create() is a convenient
> temporary hack, but unfortunately NFS, ceph, etc. still use the
> nameidata passed to vfs_create() and other ops.
> 
> The way union mounts gets a valid nameidata is by doing the create in
> the VFS before calling file system ops that may trigger a copyup,
> while we still have the original nameidata.  This is one of the major
> reasons union mounts lives in the VFS.

Not a big deal, just set up nd as if this was a single component
lookup.  The previous version did it like this:

+       struct nameidata nd = {
+               .last_type = LAST_NORM,
+               .last = *name,
+       };
+
+       nd.path = pue->upperpath;
+       path_get(&nd.path);
+
+       newdentry = lookup_create(&nd, S_ISDIR(attr->ia_mode));

But that's not a solution to the NFS suckage, it's just a workaround.

"Fortunately" NFS isn't good for a writable layer of a union for other
reasons, so this isn't a big concern at the moment.
(Continue reading)


Gmane