Wang Xiaoguang | 25 Jul 09:51 2016

[PATCH v2 0/4] update bytes_may_use timely to avoid false ENOSPC issue

Currently in btrfs, for data space reservation, it does not update
bytes_may_use in btrfs_update_reserved_bytes() and the decrease operation
will be delayed to be done in extent_clear_unlock_delalloc(), for
fallocate(2), decrease operation is even delayed to be done in end
of btrfs_fallocate(), which is too late. Obviously such delay will
cause unnecessary pressure to enospc system.

So in this patch set, we will remove RESERVE_FREE, RESERVE_ALLOC and
RESERVE_ALLOC_NO_ACCOUNT, and always update bytes_may_use timely.

I already have sent a fstests test case for this issue, and I can send
[Patch 4/4] as a independent patch, but its bug also can be revealed by
the same reproduce scripts, so I include it here.

Changelog:
v2:
  Fix a trace point issue.

Wang Xiaoguang (4):
  btrfs: use correct offset for reloc_inode in
    prealloc_file_extent_cluster()
  btrfs: divide btrfs_update_reserved_bytes() into two functions
  btrfs: update btrfs_space_info's bytes_may_use timely
  btrfs: should block unused block groups deletion work when allocating
    data space

 fs/btrfs/ctree.h       |   3 +-
 fs/btrfs/disk-io.c     |   1 +
 fs/btrfs/extent-tree.c | 171 ++++++++++++++++++++++++++++---------------------
 fs/btrfs/extent_io.h   |   1 +
(Continue reading)

Wang Xiaoguang | 25 Jul 09:43 2016

[PATCH v2] generic/371: run write(2) and fallocate(2) in parallel

Currently in btrfs, there is something wrong with fallocate(2)'s data
space reservation, it'll temporarily occupy more data space thant it
really needs, which in turn will impact other operations' data request.

In this test case, it runs write(2) and fallocate(2) in parallel and the
total needed data space for these two operations don't exceed whole fs
free data space, to see whether we will get any unexpected ENOSPC error.

Signed-off-by: Wang Xiaoguang <wangxg.fnst <at> cn.fujitsu.com>
---
v2: adopt Eryu Guan's suggestions to make this reproducer cleaner, thanks
---
 tests/generic/371     | 75 +++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/generic/371.out |  2 ++
 tests/generic/group   |  1 +
 3 files changed, 78 insertions(+)
 create mode 100755 tests/generic/371
 create mode 100644 tests/generic/371.out

diff --git a/tests/generic/371 b/tests/generic/371
new file mode 100755
index 0000000..7955856
--- /dev/null
+++ b/tests/generic/371
 <at>  <at>  -0,0 +1,75  <at>  <at> 
+#! /bin/bash
+# FS QA Test 371
+#
+# Run write(2) and fallocate(2) in parallel and the total needed data space
+# for these operations don't exceed whole fs free data space, to see whether
(Continue reading)

Kurt Seo | 25 Jul 09:25 2016
Picon

Any suggestions for thousands of disk image snapshots ?

 Hi all

 I am currently running a project for building servers with btrfs.
Purposes of servers are exporting disk images through iscsi targets
and disk images are generated from btrfs subvolume snapshot.
Maximum number of clients is 500 and each client uses two snapshots of
disk images. the first disk image's size is about 50GB and second one
is about 1.5TB.
Important thing is that the original 1.5TB disk image is mounted with
loop device and modified real time - eg. continuously downloading
torrents in it.
snapshots are made when clients boot up and deleted when they turned off.

So server has two original disk images and about a thousand of
snapshots in total.
I made a list of factors affect server's performance and stability.

1. Raid Configuration - Mdadm raid vs btrfs raid, configuration and
options for them.
2. How to format btrfs - nodesize, features
3. Mount options - nodatacow and compression things.
4. Kernel parameter tuning.
5. Hardware specification.

My current setups are

1. mdadm raid10 with 1024k chunk and 12 disks of 512GB ssd.
2. nodesize 32k and nothing else.
3. nodatacow, noatime, nodiratime, nospace_cache, ssd, compress=lzo
4. Ubuntu with 4.1.27 kernel without additional configurations.
(Continue reading)

Qu Wenruo | 25 Jul 09:19 2016

[PATCH RFC] btrfs: send: Disable clone detection

This patch will disable clone detection of send.

The main problem of clone detetion in send is its memory usage and long
execution time.

The clone detection is done by iterating all backrefs and adding backref
whose root is the source.

However iterating all backrefs is already quite a bad idea, we should
never try to do it in a loop, and unfortunately in-band/out-of-band and
reflink can easily create a file whose file extents are point to the
same extent.

In that case, btrfs will do backref walk for the same extent again and
again, until either OOM or soft lockup is triggered.

So disabling clone detection until we find a method that iterates
backrefs of one extent only once, just like what balance/qgroup is doing.

Cc: Filipe Manana <fdmanana <at> gmail.com>
Reported-by: Tsutomu Itoh <t-itoh <at> jp.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo <at> cn.fujitsu.com>
---
 fs/btrfs/send.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 2db8dc8..eed3f1c 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
(Continue reading)

Goffredo Baroncelli | 24 Jul 13:03 2016
Picon

[BTRFS-PROGS][PATCH] Add two new commands: 'btrfs insp physical-find' and 'btrfs insp physical-dump'

Hi all,

the following patches add two new commands: 
1) btrfs inspect-internal physical-find
2) btrfs inspect-internal physical-dump

The aim of these two new commands is to locate (1) and dump (2) the stripe elements
stored on the disks. I developed these two new command to simplify the
debugging of some RAID5 bugs (but this is another discussion).

An example of 'btrfs inspect-internal physical-find' is the following:

# btrfs inspect physical-find mnt/out.txt
mnt/out.txt: 0
        devid: 3 dev_name: /dev/loop2 offset: 61931520 type: DATA
        devid: 2 dev_name: /dev/loop1 offset: 61931520 type: OTHER
        devid: 1 dev_name: /dev/loop0 offset: 81854464 type: PARITY
        devid: 4 dev_name: /dev/loop3 offset: 61931520 type: PARITY

In the output above, DATA is the stripe elemnt conaining data. OTHER
is the sibling stripe elemnt: it contains data related to or other files
or to the same file but different position. The two stripe elements contain
the RAID6 parity (P and Q).

It is possible to pass the offset of the file to inspect.

An example of 'btrfs inspect-internal physical-dump' is the following

# btrfs insp physical-find mnt/out.txt 
mnt/out.txt: 0
(Continue reading)

Tomasz Melcer | 24 Jul 04:03 2016

Force recalculation of a data block checksum

Hi,

I've got a USB-connected HDD with a btrfs partition. The partition 
contains a 1TB file, a disk image. The first `btrfs scrub` after writing 
that file found 3 logical bad blocks that developed somewhere in the 
middle of that file (logs below).

The full area of the btrfs partition can be read without I/O error, so I 
think there are two possible cases: either the data block was written 
incorrectly or an incorrect checksum is stored. The first case is 
obviously unrecoverable, but if it's the second case, fixing the problem 
should be as simple as recomputing the checksum for what is already stored.

How can I ask btrfs to recompute the checksum of a data block as it is 
stored on the drive? I don't see any command doing an operation like 
that, and I couldn't find anything on the topic on the internet.

Thanks,

Logs:

#v+
[ 7702.964265] BTRFS warning (device sdd1): checksum error at logical
5473719291904 on dev /dev/sdd1, sector 222940168, root 5, inode 1245769,
offset 97110921216, length 4096, links 1 (path: dysk/dysk.bin)
[ 7702.964274] BTRFS error (device sdd1): bdev /dev/sdd1 errs: wr 0,
rd 0, flush 0, corrupt 17, gen 0
[ 7702.964278] BTRFS error (device sdd1): unable to fixup (regular) 
error at logical 5473719291904 on dev /dev/sdd1
[…]
(Continue reading)

Janos Toth F. | 23 Jul 15:20 2016
Picon

Re: Btrfs/RAID5 became unmountable after SATA cable fault

It seems like I accidentally managed to break my Btrfs/RAID5
filesystem, yet again, in a similar fashion.
This time around, I ran into some random libata driver issue (?)
instead of a faulty hardware part but the end result is quiet similar.

I issued the command (replacing X with valid letters for every
hard-drives in the system):
# echo 1 > /sys/block/sdX/device/queue_depth
and I ended up with read-only filesystems.
I checked dmesg and saw write errors on every disks (not just those in RAID-5).

I tried to reboot immediately without success. My root filesystem with
a single-disk Btrfs (which is an SSD, so it has "single" profile for
both data and metadata) was unmountable, thus the kernel was stuck in
a panic-reboot cycle.
I managed to fix this one by booting from an USB stick and trying
various recovery methods (like mounting it with "-o
clear_cache,nospace_cache,recovery" and running "btrfs rescue
chunk-recovery") until everything seemed to be fine (it can now be
mounted read-write without error messages in the kernel-log, can be
fully scrubbed without errors reported, it passes in "btrfs check",
files can be actually written and read, etc).

Once my system was up and running (well, sort of), I realized my /data
is also un-mountable. I tried the same recovery methods on this RAID-5
filesystem but nothing seemed to help (there is an exception with the
recovery attempts: the system drive was a small and fast SSD so
"chunk-recovery" was a viable option to try but this one consists of
huge slow HDDs - so, I tried to run it as a last-resort over-night but
I found an unresponsive machine on the morning with the process stuck
(Continue reading)

Hendrik Friedel | 23 Jul 13:15 2016

Chances to recover with bad partition table?

Hello,

this morning I had to face an unusual prompt on my machine.

I found that the partition table of /dev/sda had vanished.

I restored it with testdisk. It found one partition, but I am quite sure 
there was a /boot partition in front of that which was not found.

Now, running btrfsck fails:

root <at> homeserver:~# fdisk -l /dev/sda

WARNING: GPT (GUID Partition Table) detected on '/dev/sda'! The util 
fdisk doesn't support GPT. Use GNU Parted.

Disk /dev/sda: 120.0 GB, 120034123776 bytes
255 heads, 63 sectors/track, 14593 cylinders, total 234441648 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

    Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *     1026048   234440703   116707328   83  Linux
root <at> homeserver:~# btrfsck /dev/sda1
checksum verify failed on 20987904 found E4E3BDB6 wanted 00000000
checksum verify failed on 20987904 found E4E3BDB6 wanted 00000000
checksum verify failed on 20987904 found E4E3BDB6 wanted 00000000
checksum verify failed on 20987904 found E4E3BDB6 wanted 00000000
(Continue reading)

Hendrik Friedel | 23 Jul 10:37 2016

Chances to recover with bad partition table?

Hello,

this morning I had to face an unusual prompt on my machine.

I found that the partition table of /dev/sda had vanished.

I restored it with testdisk. It found one partition, but I am quite sure 
there was a /boot partition in front of that which was not found.

Now, running btrfsck fails:

root <at> homeserver:~# fdisk -l /dev/sda

WARNING: GPT (GUID Partition Table) detected on '/dev/sda'! The util 
fdisk doesn't support GPT. Use GNU Parted.

Disk /dev/sda: 120.0 GB, 120034123776 bytes
255 heads, 63 sectors/track, 14593 cylinders, total 234441648 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

    Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *     1026048   234440703   116707328   83  Linux
root <at> homeserver:~# btrfsck /dev/sda1
checksum verify failed on 20987904 found E4E3BDB6 wanted 00000000
checksum verify failed on 20987904 found E4E3BDB6 wanted 00000000
checksum verify failed on 20987904 found E4E3BDB6 wanted 00000000
checksum verify failed on 20987904 found E4E3BDB6 wanted 00000000
(Continue reading)

Chris Murphy | 22 Jul 23:16 2016

qcow2 becomes 37P in size while qemu crashes

Here is the bug write up so far, which contains most of the relevant details.
https://bugzilla.redhat.com/show_bug.cgi?id=1359325

Here are three teasers to get you to look at the bug:

1.
[root <at> f24m ~]# ls -lsh /var/lib/libvirt/images
total 57G
1.5G -rw-r-----. 1 qemu qemu 1.5G Jul 21 10:54
Fedora-Workstation-Live-x86_64-24-1.2.iso
1.4G -rw-r--r--. 1 qemu qemu 1.4G Jul 20 13:28
Fedora-Workstation-Live-x86_64-Rawhide-20160718.n.0.iso
4.4G -rw-r-----. 1 qemu qemu 4.4G Jul 22 10:43
openSUSE-Leap-42.2-DVD-x86_64-Build0109-Media.iso
 50G -rw-r--r--. 1 root root  37P Jul 22 13:23 uefi_opensuseleap42.2a3-1.qcow2
196K -rw-r--r--. 1 root root 193K Jul 22 08:46 uefi_opensuseleap42.2a3-2.qcow2
[root <at> f24m ~]#

Yes, it's using 50G worth of sectors on the drive. But then it's 37
Petabytes?! That's really weird.

[root <at> f24m ~]# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda5       104G   67G   36G  65% /

2.
Btrfs mounts, reads, writes, just fine, no messages in dmesg other
than the usual mount messages; all before, during, and after the qemu
crash, and rebooting. I rebooted to do an offline btrfs check, which
has no complaints. Scrub has no complaints. Yes the qcow2 has +C xattr
(Continue reading)

Sanidhya Solanki | 22 Jul 15:42 2016
Picon

[PATCH] btrfs: Change RAID stripesize to a user-configurable option

Adds the kernel component of making the RAID stripesize user configurable.
Updates the kernel ioctl interface to account for new options.
Updates the existing implementations of RAID stripesize in metadata.
Make the stripesize an user-configurable option.
Convert the existing metadata option of stripesize into the basis for
this option.
Updates the kernel component of RAID stripesize management.
Update the RAID stripe block management.

Signed-off-by: Sanidhya Solanki <lkml.page <at> gmail.com>
---
 fs/btrfs/ctree.h                | 21 ++++++++++++++++++--
 fs/btrfs/disk-io.c              | 12 ++++++-----
 fs/btrfs/extent-tree.c          |  2 ++
 fs/btrfs/ioctl.c                |  2 ++
 fs/btrfs/raid56.c               | 19 ++++++++++++++++++
 fs/btrfs/scrub.c                |  6 ++++--
 fs/btrfs/super.c                | 12 ++++++-----
 fs/btrfs/volumes.c              | 44 ++++++++++++++++++++++++++++++++++-------
 fs/btrfs/volumes.h              |  3 +--
 include/trace/events/btrfs.h    |  2 ++
 include/uapi/linux/btrfs.h      | 13 ++++++++++--
 include/uapi/linux/btrfs_tree.h | 10 ++++++++--
 12 files changed, 119 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4274a7b..3fa4723 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
 <at>  <at>  -2139,6 +2139,25  <at>  <at>  static inline void btrfs_set_balance_data(struct extent_buffer *eb,
(Continue reading)


Gmane