Stan Hoeppner | 1 Feb 2012 01:20

Re: File system remain unresponsive until the system is rebooted.

On 1/31/2012 2:50 PM, Dave Chinner wrote:
> On Tue, Jan 31, 2012 at 03:04:18AM -0600, Stan Hoeppner wrote:
>> On 1/30/2012 11:04 PM, Supratik Goswami wrote:
>>> We are using Amazon EC2 instances.
>>
>>                ^^^^^^^^^^
>> I'd have never thought I would see those words on this list, except
>> maybe as a joke, or as an example of one of the the worst possible
>> platforms for XFS.
> 
> I don't agree with you there. If the workload works best on XFs, it
> doesn't matter what the underlying storage device is. e.g. if it's a
> fsync heavy workload, it will still perform better on XFS on EC2
> than btrfs on EC2...
> 
>> I wish EC2 had been asked about during the QA session after Dave's
>> presentation.  I'm guessing some laughter would have been involved. ;)
> 
> You'd be wrong about that. There are as many good uses of cloud
> services as there are bad ones, yet the same decisions about storage
> need to be made even when services are remotely hosted....

Maybe I should have elaborated a bit.  My thinking is that workloads
that would require XFS, or benefit most from it, are probably going to
need more guarantees WRT bandwidth and IOPS being available
consistently, vs sharing said resources with other systems in the cloud
infrastructure.  Additionally, you have driven the point home many times
WRT tuning XFS to the underlying hardware, specifically stripe
alignment.  I'd bet alignment would be a bit tricky to achieve in a
cloud environment such as EC2.
(Continue reading)

Raghavendra D Prabhu | 1 Feb 2012 01:50
Picon
Gravatar

Re: Performance problem - reads slower than writes

Hi,

* On Tue, Jan 31, 2012 at 09:52:10PM +0000, Brian Candler <brian <at> soundmouse.com> wrote:
>On Tue, Jan 31, 2012 at 09:52:05AM -0500, Christoph Hellwig wrote:
>> You don't just read a single file at a time but multiple ones, don't
>> you?
>
>It's sequential at the moment, although I'll do further tests with the -c
>(concurrency) option to bonnie++
>
>> Try playing with the following tweaks to get larger I/O to the disk:
>>
>>  a) make sure you use the noop or deadline elevators
>>  b) increase /sys/block/sdX/queue/max_sectors_kb from its low default
>>  c) dramatically increase /sys/devices/virtual/bdi/≤major>:<minor>/read_ahead_kb
>
>Thank you very much: I will do further tests with these.
>
>Is the read_ahead_kb knob aware of file boundaries? That is, is there any
>risk that if I set it too large it would read useless blocks past the end of
>the file?

     The read_ahead_kb knob is used the by memory subsystem 
     readahead code to set the initial readahead to scale from (it 
     uses a dynamic scaling window). It is set by default based on 
     device readahead value (probably obtained in a way similar to 
     hdparm -I). 

     Setting it higher will be beneficial for sequential workloads 
     and the risk you mentioned is not there since it file 
(Continue reading)

Eric Sandeen | 1 Feb 2012 03:27
Favicon
Gravatar

Re: [PATCH] xfstests: several 274 fixups

On 1/31/12 4:41 PM, Dave Chinner wrote:
> On Mon, Jan 30, 2012 at 04:27:51PM -0600, Eric Sandeen wrote:
>> This changes quite a few things about 274 to make it more robust
>> and useful.
>>
>> * More comments
>> * Use xfs_io for falloc (not all systems have /usr/bin/fallocate)
>> * use _require_xfs_io_falloc to be sure system & fs support preallocation
>> * Do not remove all of the files in $SCRATCH_MNT/ post-mkfs
>> * Do not remove all of the files in $SCRATCH_MNT/ on completion
>>   (this breaks e2fsck when lost+found/ goes missing)
> 
> FWIW, can't e2fsck be fixed to handle this case?

perhaps some day.

but it'd return "modified" if it adds lost+found ...

I suppose it could just be left missing unless otherwise needed.

> ......
>>  _cleanup()
>>  {
>>  	cd /
>> -	rm -f $SCRATCH_MNT/* $tmp.*
>> +	rm -f $tmp.*
>>  	_scratch_unmount
>>  }
>>  
>>  <at>  <at>  -46,6 +48,7  <at>  <at>  _cleanup()
(Continue reading)

Dave Chinner | 1 Feb 2012 04:59

Re: Performance problem - reads slower than writes

On Tue, Jan 31, 2012 at 09:52:10PM +0000, Brian Candler wrote:
> On Tue, Jan 31, 2012 at 09:52:05AM -0500, Christoph Hellwig wrote:
> > You don't just read a single file at a time but multiple ones, don't
> > you?
> 
> It's sequential at the moment, although I'll do further tests with the -c
> (concurrency) option to bonnie++
> 
> > Try playing with the following tweaks to get larger I/O to the disk:
> > 
> >  a) make sure you use the noop or deadline elevators
> >  b) increase /sys/block/sdX/queue/max_sectors_kb from its low default
> >  c) dramatically increase /sys/devices/virtual/bdi/≤major>:<minor>/read_ahead_kb
> 
> Thank you very much: I will do further tests with these.
> 
> Is the read_ahead_kb knob aware of file boundaries? That is, is there any
> risk that if I set it too large it would read useless blocks past the end of
> the file?

Yes, readahead only occurs within the file, and won't readahead past
EOF.

Cheers,

Dave.
--

-- 
Dave Chinner
david <at> fromorbit.com

(Continue reading)

Dave Chinner | 1 Feb 2012 05:07

Re: [PATCH] xfstests: several 274 fixups

On Tue, Jan 31, 2012 at 08:27:56PM -0600, Eric Sandeen wrote:
> On 1/31/12 4:41 PM, Dave Chinner wrote:
> > On Mon, Jan 30, 2012 at 04:27:51PM -0600, Eric Sandeen wrote:
> >> -dd if=/dev/zero of=tmp1 bs=1M >/dev/null 2>&1
> >> -dd if=/dev/zero of=tmp2 bs=4K >/dev/null 2>&1
> >> +# Fill the rest of the fs completely
> >> +dd if=/dev/zero of=$SCRATCH_MNT/tmp1 bs=1M >>$seq.full 2>&1
> >> +dd if=/dev/zero of=$SCRATCH_MNT/tmp2 bs=4K >>$seq.full 2>&1
> >>  sync
> >> +# Last effort, use O_SYNC
> >> +dd if=/dev/zero of=$SCRATCH_MNT/tmp3 bs=4K oflag=sync >>$seq.full 2>&1
> >> +# Save space usage info
> >> +echo "Post-fill space:" >> $seq.full
> >> +df $SCRATCH_MNT >>$seq.full 2>&1
> >>  
> >> -dd if=/dev/zero of=test seek=1 bs=4K count=2 conv=notrunc >/dev/null 2>&1
> >> +# Now attempt a write into all of the preallocated space 
> >> +dd if=/dev/zero of=$SCRATCH_MNT/test seek=1 bs=4K count=1024 conv=notrunc >>$seq.full 2>&1
> >>  if [ $? -ne 0 ]
> >>  then
> >> -	echo "fill prealloc range err"
> >> +	echo "fill prealloc range error"
> >>  	status=1
> >>  	exit
> >>  fi
> > 
> > I'd still like to see this write attempt to trigger nasty behaviours
> > like needing to allocate a metadata block for the extent list. I
> > suggested randholes, but perhaps this would be easier:
> > 
(Continue reading)

yangguoquan | 1 Feb 2012 06:46
Picon
Favicon

RE: xfs: validate inode numbers in file handles correctly--NFS Stale File Handle Again

The handles appear in wireshark is the same as in the /proc/net/rpc/nfsd.fh/content info
but there no corresponding directory path. just like this:
# * 6 0x06fd0000000000000000000000000000
root <at> Dahua_Storage:~# cat /proc/net/rpc/nfsd.fh/content
#domain fsidtype fsid [path]
* 6 0x00fd0000000000000000000000000000 /mnt/storage_pool/testnfs00
* 6 0x03fd0000000000000000000000000000 /mnt/storage_pool/testnfs03
* 6 0x02fd0000000000000000000000000000 /mnt/storage_pool/testnfs02
* 6 0x01fd0000000000000000000000000000 /mnt/storage_pool/testnfs01
# * 6 0x06fd0000000000000000000000000000
* 6 0x05fd0000000000000000000000000000 /mnt/storage_pool/testnfs05
* 6 0x04fd0000000000000000000000000000 /mnt/storage_pool/testnfs04
# * 6 0x09fd0000000000000000000000000000
# * 6 0x08fd0000000000000000000000000000
# * 6 0x07fd0000000000000000000000000000
 
more info, this problem happens when rebooting the storage server while the nfs clients is still accessing the mounting points.
> Date: Tue, 24 Jan 2012 12:58:26 -0500
> From: hch <at> infradead.org
> To: ygq51 <at> hotmail.com
> Subject: Re: xfs: validate inode numbers in file handles correctly--NFS Stale File Handle Again
> CC: hch <at> infradead.org; linux-xfs <at> oss.sgi.com; pengxihan <at> gmail.com
>
> On Wed, Jan 04, 2012 at 10:20:31AM +0800, yangguoquan wrote:
> >
> > Yes, They are mount points.
> >
> > /dev/mapper/storage_pool-testnfs00 on /mnt/storage_pool/testnfs00 type xfs (rw)
>
> In that case I really have no idea. Did you make sure you no
> outstanding "old" filehandles before the fix on clients? How do the
> handles look in wireshark traces?
>
> _______________________________________________
> xfs mailing list
> xfs <at> oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
_______________________________________________
xfs mailing list
xfs <at> oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
Stan Hoeppner | 1 Feb 2012 08:29

Re: Performance problem - reads slower than writes

On 1/31/2012 2:25 PM, Dave Chinner wrote:
> On Tue, Jan 31, 2012 at 02:16:04PM +0000, Brian Candler wrote:

>> Here we appear to be limited by real seeks. 225 seeks/sec is still very good
> 
> That number indicates 225 IOs/s, not 225 seeks/s.

Yeah, the voice coil actuator and spindle rotation limits the peak
random seek rate of good 7.2k drive/controller combos to about 150/s.
15k drives do about 250-300 seeks/s max.  Simple tool to test max random
seeks/sec for a device:

32bit binary:  http://www.hardwarefreak.com/seekerb
source:        http://www.hardwarefreak.com/seeker_baryluk.c

I'm not the author.  The original seeker program is single threaded.
Baryluk did the thread hacking.  Background info:
http://www.linuxinsight.com/how_fast_is_your_disk.html

Usage:   ./seekerb device [threads]

Results for a single WD 7.2K drive, no NCQ, deadline elevator:

  1 threads Results: 64 seeks/second, 15.416 ms random access time
 16 threads Results: 97 seeks/second, 10.285 ms random access time
128 threads Results: 121 seeks/second, 8.208 ms random access time

Actual output:
$ seekerb /dev/sda 128
Seeker v3.0, 2009-06-17,
http://www.linuxinsight.com/how_fast_is_your_disk.html
Benchmarking /dev/sda [976773168 blocks, 500107862016 bytes, 465 GB,
476940 MB, 500 GiB, 500107 MiB]
[512 logical sector size, 512 physical sector size]
[128 threads]
Wait 30 seconds.............................
Results: 121 seeks/second, 8.208 ms random access time (52614775 <
offsets < 499769984475)

Targeting array devices (mdraid or hardware, or FC SAN LUN) with lots of
spindles, and/or SSDs should yield some interesting results.

--

-- 
Stan

_______________________________________________
xfs mailing list
xfs <at> oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

Christoph Hellwig | 1 Feb 2012 11:24
Favicon

Re: [PATCH] xfs_quota: remove calls to XFS_QSYNC

On Wed, Feb 01, 2012 at 08:01:10AM +1100, Dave Chinner wrote:
> So effectively what that says to me is that quota only exports the
> real block usage, even though it internally tracks delalloc
> reservations. Perhaps an additionaly change to make in this case is
> to fold the reserved blocks into what is reported to the quota
> utilities?
> 
> Indeed, what is exported to userspace via xfs_qm_export_dquot() is
> the information in the dquot core - the on-disk information - so
> perhaps all we need to do is export dqp->q_res_bcount (the count of
> real + reserved blocks) instead of the on-disk info?

That seems like a good idea, given that enforcement takes the
reservation into account.  To retain compatbility for the case of new
userspace and an old kernel I'd have to disable Q_XQUOTASYNC in the
kernel intead of in the tool, though.

_______________________________________________
xfs mailing list
xfs <at> oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

Christoph Hellwig | 1 Feb 2012 11:45
Favicon

Re: [patch] xfs: remove an unneeded NULL check

On Wed, Feb 01, 2012 at 09:56:01AM +1100, Dave Chinner wrote:
> >  	/* lock out background commit */
> >  	down_read(&log->l_cilp->xc_ctx_lock);
> > -	if (commit_lsn)
> > -		*commit_lsn = log->l_cilp->xc_ctx->sequence;
> > +	*commit_lsn = log->l_cilp->xc_ctx->sequence;
> >  
> >  	xlog_cil_insert_items(log, log_vector, tp->t_ticket);
> 
> There's a set of reviewed patches (for 3.3) that change all this
> code. The null check might still be there, but that needs to be
> checked.

Which series is that?  I mut have to admit I've lost track by now.

_______________________________________________
xfs mailing list
xfs <at> oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

Peter Grandi | 1 Feb 2012 12:40
X-Face
Picon

Re: File system remain unresponsive until the system is rebooted.

[ ... ]

>>> We are using Amazon EC2 instances.

>>> [ ... ]  one of the the worst possible platforms for XFS.

>> I don't agree with you there. If the workload works best on
>> XFs, it doesn't matter what the underlying storage device is.
>> e.g. if it's a fsync heavy workload, it will still perform
>> better on XFS on EC2 than btrfs on EC2...

There are special cases, but «fsync heavy» is a bit of bad
example.

In general file system designs are not at all independent of the
expected storage platform, and some designs are far better than
others for specific storage platforms, and viceversa. This goes
all the way back to the 4BSD filesystem being specifically
optimized for rotational latency.

[ ... ]

>> You'd be wrong about that. There are as many good uses of
>> cloud services as there are bad ones,

VMs are not "cloud" services, those are more like remotely
hosted services, used via SOAP/REST. VMs are more like
colocation on the cheap.

>> yet the same decisions about storage need to be made even
>> when services are remotely hosted....

The basic problem with VM platforms is that they have completely
different latency (and somewhat different bandwidth) and
scheduling characteristics from "real" hardware, in particular
the relative costs of several operations are very different than
on "real" hardware, and the design tradeoffs that are good for
"real" hardware may not be relevant or may even be bad for VMs.

In addition VM "disks" can be implemented in crazy ways, like
with sparse files, and those impact severely achievable
performance levels.

> [ ... ] workloads that would require XFS, or benefit most from
> it, are probably going to need more guarantees WRT bandwidth
> and IOPS being available consistently, vs sharing said
> resources with other systems in the cloud infrastructure.

This is almost there, but «consistently» is a bit of an
understatement. It is not just that in VMs resources are
shared and subject to externally induced loads.

What matters is that the storage layer performance envelope have
roughly the same tradeoffs as those for which a certain design
has been aimed at. Even differently shaped hardware, like flash
SSD, can have very different performance envelopes than rotating
disks, or sets of rotating disks. A VM running on its own on
a certain platform still has different latencies and tradeoffs
than the underlying platform.

> Additionally, you have driven the point home many times WRT
> tuning XFS to the underlying hardware, specifically stripe
> alignment.

That as usual only matters for RMW-oriented storage layers, and
we don't really know what storage layer EC2 uses (hopefully not
one with RMW problems as parity RAID is known to be quite ill
suited to VM disks).

[ ... ]

> [ ... ] EC2 is probably bad for the typical workloads where
> XFS best flexes its muscles.

That's probably a good point but not quite the apposite one
here.

In the case raised by the OP, he had a large delay and "forgot"
to say he was running the system under layers (of unknown
structure) of virtualization.

In that case the latency (and bandwidth) profiles of both the
computing and the storage platforms can be very different from
those XFS has been aimed at, and I would not be surprised by
starvation or locking problems. Eventually DaveC pointed out a
known locking one during 'growfs', so not dependent on the
latency profile of the platform.

_______________________________________________
xfs mailing list
xfs <at> oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs


Gmane