Kent Overstreet | 3 Dec 07:27 2011
Picon

Re: New version up with fix for md and other block devices

Just reproduced it - raid10 did the trick.

I'll try and debug it this weekend.

On Tue, Nov 29, 2011 at 12:45 AM, Kent Overstreet
<koverstreet@...> wrote:
> On Tue, Nov 29, 2011 at 04:30:22PM +0800, Brad Campbell wrote:
>> On 29/11/11 15:54, Kent Overstreet wrote:
>> >Ok, that's weird. It shouldn't be able to register the second time
>> >because the first register opens it exclusively, and the second open
>> >will fail with -EBUSY.
>>
>> Can reproduce it at will here.
>
> I don't doubt you, just surprised. I'll try it out first thing tomorrow.
> No reason I shouldn't be able to reproduce it.
>
>> I doubt its the driver as the RAID is on the same card as the single
>> drive I tested with last time.
>
> Ok, raid10 is the one I didn't test - I tried raid0 and 6 and those
> worked, but each raid layer has its own code to process bios so it
> sounds like there's a corner case we're tripping.
>
>> I can probably try some other RAID levels this evening if it would
>> help. I trashed the RAID recently anyway so I need to re-build it
>> from scratch.
>
> Let me see if I can reproduce it first, I'll try it first thing in the
> morning. Hopefully it'll be something easy.
(Continue reading)

Kent Overstreet | 6 Dec 04:45 2011
Picon

Re: New version up with fix for md and other block devices

Fixed the raid10 issue - it's working for me, and the code's up. Haven't
fixed the duplicate registration issue yet, though - I'll take a look at
that next...
Kent Overstreet | 6 Dec 05:02 2011
Picon

Re: New version up with fix for md and other block devices

Argh. I spoke too soon, it just exploded after the second bonnie run.
The block layer is badly in need of some cleanup... *grumble*

On Mon, Dec 5, 2011 at 7:45 PM, Kent Overstreet
<kent.overstreet@...> wrote:
> Fixed the raid10 issue - it's working for me, and the code's up. Haven't
> fixed the duplicate registration issue yet, though - I'll take a look at
> that next...
>
Kent Overstreet | 6 Dec 05:41 2011
Picon

Re: New version up with fix for md and other block devices

So, it should work in writethrough mode. I discovered a really
annoying issue with background writeback that's going to take me a bit
to decide how to solve...

On Mon, Dec 5, 2011 at 8:02 PM, Kent Overstreet
<kent.overstreet@...> wrote:
> Argh. I spoke too soon, it just exploded after the second bonnie run.
> The block layer is badly in need of some cleanup... *grumble*
>
> On Mon, Dec 5, 2011 at 7:45 PM, Kent Overstreet
> <kent.overstreet@...> wrote:
>> Fixed the raid10 issue - it's working for me, and the code's up. Haven't
>> fixed the duplicate registration issue yet, though - I'll take a look at
>> that next...
>>
>
Kent Overstreet | 6 Dec 06:11 2011
Picon

Possible changes to bio cloning and some related stuff

So, I finally got around to debugging various bcache on md issues, and I
ran into a rather sticky problem:

bio_alloc() can fail if nr_iovecs > BIO_MAX_PAGES. That itself is not a
problem, but then when a bio is cloned it's always done by cloning the
_entire_ original bio vec, from 0 to max_vecs - not the range from
bi_idx to bi_vcnt.

Basically, whenever bcache generates some io internally it uses a single
bio to describe the entire io - regardless of whether or not the bio
would be too big for the underlying device; it then splits the bio as
many times as need be when it's actually submitted.

This works beautifully for dumb drivers - I'm actually planning on
making my code generic and integrating it with the block layer so that
the same approach could be easily used by other code that generates
bios, it would allow a _lot_ of code to be deleted from the kernel.

But for stacking drivers, the mere existence of a bio with max_vecs >
BIO_MAX_PAGES breaks things, regardless of how many pages are actually
being used in this bio.

So, IMO __bio_clone(), bio_clone_mddev(), and whatever other code ought
to be changed to only copy bi_idx to bi_vcnt from the original bio -
it'd make it consistent with how bios are used elsewhere. Thoughts? The
actual patches should be trivial, it'll mostly just be a matter of
grepping around and finding everything, I think.
NeilBrown | 6 Dec 06:36 2011
Picon

Re: Possible changes to bio cloning and some related stuff

On Mon, 5 Dec 2011 21:11:01 -0800 Kent Overstreet <koverstreet@...>
wrote:

> So, I finally got around to debugging various bcache on md issues, and I
> ran into a rather sticky problem:
> 
> bio_alloc() can fail if nr_iovecs > BIO_MAX_PAGES. That itself is not a
> problem, but then when a bio is cloned it's always done by cloning the
> _entire_ original bio vec, from 0 to max_vecs - not the range from
> bi_idx to bi_vcnt.
> 
> Basically, whenever bcache generates some io internally it uses a single
> bio to describe the entire io - regardless of whether or not the bio
> would be too big for the underlying device; it then splits the bio as
> many times as need be when it's actually submitted.
> 
> This works beautifully for dumb drivers - I'm actually planning on
> making my code generic and integrating it with the block layer so that
> the same approach could be easily used by other code that generates
> bios, it would allow a _lot_ of code to be deleted from the kernel.

Sounds promising.

> 
> But for stacking drivers, the mere existence of a bio with max_vecs >
> BIO_MAX_PAGES breaks things, regardless of how many pages are actually
> being used in this bio.
> 
> So, IMO __bio_clone(), bio_clone_mddev(), and whatever other code ought
> to be changed to only copy bi_idx to bi_vcnt from the original bio -
(Continue reading)

Kent Overstreet | 6 Dec 07:01 2011
Picon

Re: New version up with fix for md and other block devices

Ok, it looks like as long as your cache's bucket size is not greater
than 1 mb everything should work, including writeback. Look forward to
hearing if it works for you :)

On Mon, Dec 5, 2011 at 8:41 PM, Kent Overstreet
<kent.overstreet@...> wrote:
> So, it should work in writethrough mode. I discovered a really
> annoying issue with background writeback that's going to take me a bit
> to decide how to solve...
>
> On Mon, Dec 5, 2011 at 8:02 PM, Kent Overstreet
> <kent.overstreet@...> wrote:
>> Argh. I spoke too soon, it just exploded after the second bonnie run.
>> The block layer is badly in need of some cleanup... *grumble*
>>
>> On Mon, Dec 5, 2011 at 7:45 PM, Kent Overstreet
>> <kent.overstreet@...> wrote:
>>> Fixed the raid10 issue - it's working for me, and the code's up. Haven't
>>> fixed the duplicate registration issue yet, though - I'll take a look at
>>> that next...
>>>
>>
Kent Overstreet | 6 Dec 09:22 2011
Picon

Quick bcache benchmark

I've been very remiss in posting benchmarks; this isn't much, but if
anyone has suggestions for what they want I'll see if I can run it.

This is on an old corsair nova - bcache can go something like 10x faster
but this is what I have at home. The profile is still interesting,
though.

The benchmark is 4k random O_DIRECT reads on a 16 gb file, all in cache
- the idea is to push the b+tree.

Also, the backing device is a md raid10 - so that's working, provided
you format your cache with buckets not greater than 1 mb.

root <at> utumno:/mnt# perf record -afg fio ~/rw4k
randwrite: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio 1.59
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [69914K/0K /s] [17.7K/0  iops] [eta 00m:00s]
randwrite: (groupid=0, jobs=1): err= 0: pid=1247
  read : io=16384MB, bw=68713KB/s, iops=17178 , runt=244169msec
  cpu          : usr=5.66%, sys=22.93%, ctx=4198688, majf=0, minf=85
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued r/w/d: total=4194367/0/0, short=0/0/0

Run status group 0 (all jobs):
   READ: io=16384MB, aggrb=68712KB/s, minb=70361KB/s, maxb=70361KB/s, mint=244169msec, maxt=244169msec

Disk stats (read/write):
(Continue reading)

Kent Overstreet | 6 Dec 12:56 2011
Picon

Re: Quick bcache benchmark

On Tue, Dec 06, 2011 at 11:39:57AM +0100, Bostjan Skufca wrote:
> Random write test?

Sure.

That corsair was giving me /terrible/ write performance, pulled the
intel SSD out of my other machine (unregistered the cache from the
backing device and attached the new SSD all without unmounting the
filesystem :)

Write performance with the intel is not /awesome/, but much more
reasonable:

root <at> utumno:/mnt# perf record -afg fio ~/rw4k 
randwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio 1.59
Starting 1 process
Jobs: 1 (f=1): [w] [100.0% done] [0K/98365K /s] [0 /24.2K iops] [eta 00m:00s]
randwrite: (groupid=0, jobs=1): err= 0: pid=1560
  write: io=16384MB, bw=89547KB/s, iops=22386 , runt=187359msec
  cpu          : usr=3.94%, sys=14.82%, ctx=300435, majf=0, minf=19
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued r/w/d: total=0/4194367/0, short=0/0/0

Run status group 0 (all jobs):
  WRITE: io=16384MB, aggrb=89547KB/s, minb=91696KB/s, maxb=91696KB/s, mint=187359msec, maxt=187359msec

Disk stats (read/write):
(Continue reading)

Bostjan Skufca | 6 Dec 15:10 2011
Picon

Re: Quick bcache benchmark

Nice, 22k iops for random write is not bad at all (especially compared
to spinning disks:)
I have a couple of questions, can you please confirm that I am
understanding bcache correctly:
1. When you issue that many random write requests, they get written to
SSD first. Then they are slowly propagated from SSD to spinning disk,
right? In original order or is the order optimized?2. What about when
I unregister bcache from a device? Does it flush changes from SSD to
platter?3. Same question (2) for unmounting a drive?4. If machine
crashes, will bcache replay changes from SSD to platter at mount
time?5. Does it export a number of writes that are pending on SSD, via
some /proc or /sys interface?6. Is read cache hot or cold at boot
time?
I know that is an overkill for wording "couple of questions", sorry:)

b.

On 6 December 2011 12:56, Kent Overstreet
<kent.overstreet@...> wrote:
>
> On Tue, Dec 06, 2011 at 11:39:57AM +0100, Bostjan Skufca wrote:
> > Random write test?
>
> Sure.
>
> That corsair was giving me /terrible/ write performance, pulled the
> intel SSD out of my other machine (unregistered the cache from the
> backing device and attached the new SSD all without unmounting the
> filesystem :)
>
(Continue reading)


Gmane