Yan, Zheng | 1 Jun 2010 04:32
Favicon

Re: Disk space accounting and subvolume delete

On Tue, Jun 1, 2010 at 3:01 AM, Bruce Guenter <bruce <at> untroubled.org> wrote:
> On Wed, May 12, 2010 at 01:02:07PM +0800, Yan, Zheng  wrote:
>> Dropping a tree can be lengthy. It's not good to let sync wait for hours.
>> For most linux FS, 'sync' just force an transaction/journal commit. I don't
>> think they wait for large operations that can span multiple transactions to
>> complete.
>
> What happens to the consistency of the filesystem if a crash happens
> during this process?
>

This does not break the consistency of the filesystem. Next mount will find the
partial dropped tree and restart the dropping process.

Yan, Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason | 1 Jun 2010 15:19
Picon
Favicon

Re: [PATCH 6/6] Btrfs: do aio_write instead of write

On Sun, May 30, 2010 at 07:33:33PM +1000, Dmitri Nikulin wrote:
> On Sat, May 22, 2010 at 3:03 AM, Josef Bacik <josef <at> redhat.com> wrote:
> > +       while (1) {
> > +               lock_extent(tree, start, end, GFP_NOFS);
> > +               ordered = btrfs_lookup_ordered_extent(inode, start);
> > +               if (!ordered)
> > +                       break;
> > +               unlock_extent(tree, start, end, GFP_NOFS);
> 
> Is it ok not to unlock_extent if !ordered?
> I don't know if you fixed this in a later version but it stuck out to me :)

The construct is confusing.  Ordered extents track things that we have
allocated on disk and need to write.  New ones can't be created while we
have the extent range locked.  But we can't force old ones to disk with
the lock held.

So, we lock then lookup and if we find nothing we can safely do our
operation because no io is in progress.  We unlock a little later on, or
at endio time.

If we find an ordered extent we drop the lock and wait for that IO to
finish, then loop again.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
(Continue reading)

Martin K. Petersen | 1 Jun 2010 15:39
Picon
Favicon

Re: A couple of questions

>>>>> "Paul" == Paul Millar <paul.millar <at> desy.de> writes:

Paul> My concern is that, if the server-software doesn't push the
Paul> client-provided checksum then the FS checksum (plus T-10 DIF/DIX)
Paul> would not provide a rigorous assurance that the bytes are the
Paul> same.  Without this assurance, corruption could still occur; for
Paul> example, within the server's memory.

For DIX we allow integrity metadata conversion.  Once the data is
received, the server generates appropriate IMD for the next layer.  Then
the server verifies that the original IMD matches the data buffer.  That
way there's no window of error.  But obviously the ideal case is where
the same IMD can be passed throughout the stack without conversion.

Not sure what you use for file service?  I believe NFSv4 allows for
checksums to be passed along. I have not looked at them closely yet,
though.

--

-- 
Martin K. Petersen	Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dmitri Nikulin | 2 Jun 2010 01:20
Picon

Re: [PATCH 6/6] Btrfs: do aio_write instead of write

On Tue, Jun 1, 2010 at 11:19 PM, Chris Mason <chris.mason <at> oracle.com> wrote:
>> Is it ok not to unlock_extent if !ordered?
>> I don't know if you fixed this in a later version but it stuck out to me :)
>
> The construct is confusing.  Ordered extents track things that we have
> allocated on disk and need to write.  New ones can't be created while we
> have the extent range locked.  But we can't force old ones to disk with
> the lock held.
>
> So, we lock then lookup and if we find nothing we can safely do our
> operation because no io is in progress.  We unlock a little later on, or
> at endio time.
>
> If we find an ordered extent we drop the lock and wait for that IO to
> finish, then loop again.

Ok, that's fair enough. Maybe it's worth commenting, I'm sure I'm not
the only one surprised.

Thanks,

--

-- 
Dmitri Nikulin

Centre for Synchrotron Science
Monash University
Victoria 3800, Australia
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo <at> vger.kernel.org
(Continue reading)

Miao Xie | 2 Jun 2010 04:59
Favicon

Re: [GIT PULL] Btrfs updates

Hi, Chris

on 2010-5-27 23:15, Chris Mason wrote: 
> My commits here are just integrating the two and fixing a few bugs in
> the result.
> 
> Zheng Yan (13) commits (+4076/-2679):
>     Btrfs: Integrate metadata reservation with start_transaction (+678/-528)
>     Btrfs: Update metadata reservation for delayed allocation (+232/-415)
>     Btrfs: Shrink delay allocated space in a synchronized (+88/-121)
>     Btrfs: Introduce contexts for metadata reservation (+853/-385)
>     Btrfs: Link block groups of different raid types (+121/-55)
>     Btrfs: Metadata ENOSPC handling for balance (+1163/-747)
>     Btrfs: Metadata ENOSPC handling for tree log (+156/-128)
>     Btrfs: Metadata reservation for orphan inodes (+365/-66)
>     Btrfs: Pre-allocate space for data relocation (+92/-45)
>     Btrfs: Introduce global metadata reservation (+241/-76)
>     Btrfs: Fix block generation verification race (+1/-1)
>     Btrfs: Kill allocate_wait in space_info (+58/-76)
>     Btrfs: Kill init_btrfs_i() (+28/-36)
> 
> Chris Mason (9) commits (+314/-80):
>     Btrfs: fix preallocation and nodatacow checks in O_DIRECT (+140/-16)
>     Btrfs: move O_DIRECT space reservation to btrfs_direct_IO (+8/-6)
>     Btrfs: don't walk around with task->state != TASK_RUNNING (+4/-3)
>     Btrfs: use async helpers for DIO write checksumming (+52/-17)
>     Btrfs: add more error checking to btrfs_dirty_inode (+13/-2)
>     Btrfs: avoid ENOSPC errors in btrfs_dirty_inode (+11/-3)
>     Btrfs: rework O_DIRECT enospc handling (+49/-30)
>     Btrfs: drop verbose enospc printk (+2/-0)
(Continue reading)

Miao Xie | 2 Jun 2010 05:07
Favicon

Re: [PATCH 01/10] btrfs: add a return value for readahead_tree_block()

Hi, Chris

Could you review this patchset for us? 

Regards
Miao

on 2010-5-20 15:13, Miao Xie wrote:
> From: Liu Bo <liubo2009 <at> cn.fujitsu.com>
> 
> Fix return value to get from read_extent_buffer_pages().
> 
> Signed-off-by: Liu Bo <liubo2009 <at> cn.fujitsu.com>
> Signed-off-by: Miao Xie <miaox <at> cn.fujitsu.com>
> ---
>  fs/btrfs/disk-io.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 6632e5c..35916d5 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
>  <at>  <at>  -794,7 +794,7  <at>  <at>  int readahead_tree_block(struct btrfs_root *root, u64 bytenr, u32 blocksize,
>  	buf = btrfs_find_create_tree_block(root, bytenr, blocksize);
>  	if (!buf)
>  		return 0;
> -	read_extent_buffer_pages(&BTRFS_I(btree_inode)->io_tree,
> +	ret = read_extent_buffer_pages(&BTRFS_I(btree_inode)->io_tree,
>  				 buf, 0, 0, btree_get_extent, 0);
>  	free_extent_buffer(buf);
(Continue reading)

ulmo | 2 Jun 2010 04:29
Favicon

Is there a more aggressive fixer than btrfsck?

Is there a more aggressive filesystem restorer than btrfsck?  It simply
gives up immediately with the following error:

btrfsck: disk-io.c:739: open_ctree_fd: Assertion `!(!tree_root->node)'
failed.

Yet, the filesystem has plenty of data on it, and the discs are good and I
didn't do anything to the data except regular btrfs commands and normal
mounting.  That's a wildly unreliable filesystem.

BTW, is there a way to improve delete and copy performance of btrfs?  I'm
getting about 50KB/s-500KB/s (per size of file being deleted) in deleting
and/or copying files on a disc that usually can go about 80MB/s.  I think
it's because they were fragmented.  That implies btrfs is too accepting of
writing data in fragmented style when it doesn't have to.  Almost all the
files on my btrfs partitions are around a gig, or 20 gigs, or a third of a
gig, or stuff like that.  The filesystem is 1.1TB.

Brad

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stephen | 2 Jun 2010 10:57
Picon
Gravatar

Quota Support

Sorry about emailing the list about this but after doing some googling i
can't seem to find the answer.

Im just wondering if subvolumes or snap shot can have quotas imposed on
them.

The wiki says that:

"Subvolumes can be given a quota of blocks, and once this quota is
reached no new writes are allowed."

Although many posts to this mailing list and on other mailing list seem
to indicate that quotas are not implemented, im just wondering if
someone can clean this up.

Thanks Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paul Millar | 2 Jun 2010 13:56
Picon
Favicon

Re: A couple of questions

Hi Mike,

On Monday 31 May 2010 22:33:23 Mike Fedyk wrote:
> On Mon, May 31, 2010 at 11:06 AM, Paul Millar <paul.millar <at> desy.de> wrote:
> > [...] My concern is that, if the server-software doesn't push the
> > client-provided checksum then the FS checksum (plus T-10 DIF/DIX) would
> > not provide a rigorous assurance that the bytes are the same [...]
> 
> Have you taken into account the boundaries of the data checksums?
> Your app may checksum per file or some logical partition in the file
> format. 

I'm thinking specifically of the case when the user creates a file, writes the 
file's contents and closes it;  for us, this is the only use-case when writing 
data.  In this scenario, the checksum would be of the file's complete data 
rather than any particular logical partition.

> Btrfs does the checksum per-extent so unless you keep track
> of where the extent boundaries are, that checksum will be useless to
> the userspace app. 

Sure, this is true with how things are currently.

However, I was hoping that it would be possible to add code within btrfs to 
obtain the checksum over the all the file's data.  Since btrfs knows the 
extend sizes and per-extend checksum values, I believe this is tractable and 
relatively easy.

> Also the app would be tied specifically to a storage technology.  No
> matter how great foo might be, not everyone's going to use it.
(Continue reading)

Paul Millar | 2 Jun 2010 15:40
Picon
Favicon

Re: A couple of questions

On Tuesday 01 June 2010 15:39:52 Martin K. Petersen wrote:
> >>>>> "Paul" == Paul Millar <paul.millar <at> desy.de> writes:
> Paul> My concern is that, if the server-software doesn't push the
> Paul> client-provided checksum then the FS checksum (plus T-10 DIF/DIX)
> Paul> would not provide a rigorous assurance that the bytes are the
> Paul> same.  Without this assurance, corruption could still occur; for
> Paul> example, within the server's memory.
> 
> For DIX we allow integrity metadata conversion.  Once the data is
> received, the server generates appropriate IMD for the next layer.  Then
> the server verifies that the original IMD matches the data buffer.  That
> way there's no window of error.  But obviously the ideal case is where
> the same IMD can be passed throughout the stack without conversion.

I think we may be talking slightly at cross-purposes here: in my case, one of 
the end-points (for "end-to-end data integrity") is a remote computer, that is 
uploading a file with a corresponding checksum.

Please correct me if I'm wrong here, but T10 DIF/DIX refers only to data 
integrity protection from the OS's FS-level down to the block device: a 
userland application doesn't know that it is writing into a FS that is 
utilising DIX with a DIF-enabled storage system.

When a file is uploaded from a remote client to an application with the 
checksum, the app can verify this checksum internally.  However, there's then 
a (logical) gap between userland and FS where data integrity is no longer 
assured.  For example, corruption that occurs after the app has verified the 
checksum value would not be picked up, even with T10 DIX/DIF, since the FS 
would receive and store the already-corrupted data "in good faith".

(Continue reading)


Gmane