Bill Studenmund | 1 Feb 2003 01:11
Picon

Re: DEV_B_SIZE

On Fri, 31 Jan 2003, Steve Byan wrote:

> I keep getting a response that reads like "we'll detect the larger
> block size and run with it".  I'm concerned that I'm not being clear
> that IDEMA is thinking of proposing a backward-compatibility mode with
> the presumption that it will work fine (albeit slowly) with existing
> binaries, i.e. code that hasn't been modified to be aware of the larger
> block size.
>
> If you think there are no functional problems with this
> backwards-compatibility scenario, including during recovery (fsck or
> journal roll-forward), I'd be happy to hear a clear "no problem".

I think Stephan Uphof hit on the main issues. I think there are functional
problems with this, but that it may be usefull in some situations. It just
needs a BIG warning.

Note I am assuming that if there's an error writing a 512-byte sector the
full 4k sector will have issues. If that is avoided (say only the 512-byte
area actually has an issue) then things are fine.

I think the main place that problems will arrise is that methods to reduce
error exposure won't necessarily work. Methods that try to resist single-
sector errors, say by making multiple copies of data, will need to know
that the single-sector error size (how much data goes away) is 4k, not
512 bytes. Exactly how may programs use these methods is not something I
know, so I can't tell you exactly what the exposure is.

The fact that the errors from a 4k re-write failing are not unheard of
isn't the issie. phk is right that that just looks like multiple sectors
(Continue reading)

Julian Elischer | 1 Feb 2003 01:13

Re: DEV_B_SIZE


On Fri, 31 Jan 2003, David Schultz wrote:

> Thus spake Julian Elischer <julian <at> elischer.org>:
> > > contents of the sector are there.  I think we would need to
> > > implement journalling to ensure integrity if hard drives were
> > > likely to corrupt sectors on power failure.  (How often do they do
> > > this right now, and how often would they with 4K sectors?)
> > 
> > 
> > in this case teh journel would have to not only include the block being
> > written, but data on each side of it that may be in teh same 4k.
> > that implies a read..
> 
> If you had to do that, then nearly every write would be a
> read-modify-write cycle.  It would be far less painful
> to use 4K blocks or larger and align filesystem blocks
> to disk sectors.

exactly..

But this is a case where "a filesystem using 512 byte blocks
would behave significanlty differently with one of these drives"

which is what he was asking.

> 

Terry Lambert | 1 Feb 2003 01:32
Picon

Re: DEV_B_SIZE

Steve Byan wrote:
> There's a notion afoot in IDEMA to enlarge the underlying physical
> block size of disks to 4096 bytes while keeping a 512-byte logical
> block size for the interface. Unaligned accesses would involve either a
> read-modify-write or some proprietary mechanism that provides
> persistence without the latency cost of a read-modify-write.
> 
> Performance issues aside, it occurs to me that hiding the underlying
> physical block size may break many careful-write and
> transaction-logging mechanisms, which may depend on no more than one
> block being corrupted during a failure. In IDEMA's proposal, a power
> failure during a write of a single 512-byte logical block could result
> in the corruption of the full 4K block, i.e. reads of any of the
> 512-byte logical blocks in that 4K physical block  would return an
> uncorrectable ECC error.
> 
> I'd appreciate hearing examples where hiding the underlying physical
> block size would break a file system, database, transaction processing
> monitor, or whatever.  Please let me know if I may forward your reply
> to the committee. Thanks.

UFS directory operations are on the basis of physical disk blocks,
which are assumed to be DEVBSIZE in size (512b).  Minimally, the I/O
path would be broken by this change by changing the atomic unit size
to 4096.

The reason this would break is that the atomic write guarantee is
used to ensure that a single sector changes are recorded atomically.
This is important in rename operations from a short name to a longer
name, where the new name is allocated as a hard link in the new block;
(Continue reading)

Terry Lambert | 1 Feb 2003 01:44
Picon

Re: DEV_B_SIZE

Julian Elischer wrote:
> I presume that if such a drive were made, thre would be some way to
> identify it?
> 
> It would be very easy to configure a filesystem to have a minimum
> writable unit size of 4k, and I assume that doing so would be
> slightly advantageous. (no Read/modify/write). it would however
> be good if we could easily identify when doing so was a good idea.

Substantial modifications would be required to the UFS directory
management code to support both old and new disks in the same
machine with the same FS code.

Assuming that was addressed by making the DEVBSIZE define into a
variable based on the underlying device, there's the problem of
device concatenation.  Your devices would have to be made up of
homogeneous components, too, so once you got them to coexist with
old disks, you would still not be able to get them to aggregate
with them, in, e.g., a RAID 0, and maybe not in any RAID set.

> I'd say that this means that the drive should hold the active 4k block
> in nvram or something..

This would be very useful, but unlikely in the extreme, I think,
because of the associated costs.  8-(.  But it would be very, very
useful.

-- Terry

To Unsubscribe: send mail to majordomo <at> FreeBSD.org
(Continue reading)

Terry Lambert | 1 Feb 2003 02:10
Picon

Re: DEV_B_SIZE

Nathan Hawkins wrote:
> You might want to talk with Veritas. I'm pretty sure their Volume
> Manager's log subdisks assume 512-byte sectors.

Yes, this is true.  It would cause a problem for VXFS, at least
the VXFS whose source code I disked around with for USL's use on
UnixWare; almost all the directory entry management code is
verbatim from the USL UFS sources.

I know that AIX *would not* have a problem on the old HPFS, but the
OS/2 HPFS might have a problem.  I think Solaris, and anyone else
using a UFS derived FS would probably have a problem with directory
entry management, and for those areas I've already noted.  I don't
know if the NXFS I wrote for Novell's NetWare for UNIX product is
still in use anywhere, or not, these days, but if it is, the it
would have a problem, too, both in directory ops, and in secondary
inode management for EA's and resource forks.

The SGI XFS people, Novell, and the GFS people would also be good
ones to ask for input.  Microsoft and Apple, too, if it weren't
obvious.  8-).

> More generally, what impact would this have on existing RAID
> implementations, hardware or software? This is a potentially more
> damaging impact than filesystem semantics.

The real question is sector sparing, when it comes to that, and
whether it's on 4K boundaries or not, etc..  For the most part,
RAID that does parity should not care, but RAID 0 and 1 may be
a problem during a power failure, unless PHK's issue about the
(Continue reading)

Terry Lambert | 1 Feb 2003 02:19
Picon

Re: DEV_B_SIZE

phk <at> freebsd.org wrote:
> In message <Pine.BSF.4.21.0301311144370.45015-100000 <at> InterJet.elischer.org>, Ju
> lian Elischer writes:
> >One thign I thought of is that it is not uncommon to 'dd' an entire
> >filesystem from one partition to another.
> >If we create a filesystem that is 'aligned' and we copy it to be
> >'unalligned', we'd have a sudden performance drop for no immediatly
> >obvious reason. What was one write, would become a 2-sector read,
> >modify and 2-sector write. Especially when copying from one failing
> >drive to another with slightly different characteristics.
> 
> If you run dd without bs=ALOT you deserve bad throughput.

I think he means that the performance of the resulting FS, if
it had expectations of running on a 4K block size, and got a
512b one instead, would be unexpected (e.g. the only difference
between the disks is a "Q" or "R" at the end of the disk model
number, etc.).

The real answer, if that's what you mean, Julian, is that the
FS is not likely to be transportable between the devices, or,
minimally, from a 512b to a 4K, because of the existing data
not having taken the 4K alignment issues into account (e.g.
directories would be an even multiple of 512b in length, rather
than an even multiple of 4K in length).  From a 4K to a 512b,
there's might also be an offset issue, if they were not treated
internally as if they were 512b on 4K systems, for data storage,
and only treated as 4K for atomicity.

My recommendation would be to indicate doing this is no longer
(Continue reading)

Greg 'groggy' Lehey | 1 Feb 2003 03:27
Picon
Favicon

Track buffering (was: DEV_B_SIZE)

On Friday, 31 January 2003 at 12:03:44 -0500, Steve Byan wrote:
>
> On Friday, January 31, 2003, at 11:50  AM, phk <at> freebsd.org wrote:
>> It was my impression that already many drives write entire tracks
>> as atomic units, at least we have had plenty of anecdotal evidence
>> to this effect ?
>
> I'm not aware of any SCSI or ATA disks which do this; certainly no
> Maxtor disk does. Count-key-data mainframe disks can be formatted to do
> so, but such disks probably don't run Unix. Caching in ATA disks might
> lead one to believe that the disk could corrupt an entire track, in the
> sense that a panic ( aka bluescreen) or a power-failure would cause all
> pending writes in its buffer to be lost, but even in ATA-land I don't
> believe a power failure would result in more than one disk block
> returning an uncorrectable read error.

A couple of years back I did some power fail testing on IBM IDE
drives.  On one occasion I managed to blow out a whole range of
sectors (about 80), which I attributed to trashing a track buffer.

Greg
--
See complete headers for address and phone numbers

To Unsubscribe: send mail to majordomo <at> FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message

Stephan Uphoff | 1 Feb 2003 05:02

i386 pmap bug


If a (user space) Page Table Page (PTP) is no longer used, because the last
valid Page Table Entry (PTE) has been removed, the associated page structure 
is put on the free page list before the associated TLB for the PTP 
(TLB that caches the PDE) is flushed.

Any modification of the freed page (interrupt, multiprocessor) might cause 
loading of invalid TLB entries that could:
	- cause immediate problems for multiprocessor systems.
	- stay in the TLB even after the call to pmap_tlb_shootnow().
          (and as such cause problems later).

This is not only a problem for multiprocessor systems as Intel states
in one of their "Pentium II Processor Application Notes":
	Memory Ordering On Dynamic Execution (Pentium Pro Family) Processors
 	2.7. Page Table Walking Accesses:
  	     [...]
  	  *  Page table walks can occur at any time, randomly
   	     [...]
		Page table walking to satisfy TLB (Translation Lookaside
		Buffer) misses can be performed speculatively and out-of-order;
 		page table walks are subject to speculative cacheability.
  	     [...]

The easiest way to fix this problem, would be to add extra calls 
to pmap_tlb_shootnow(), before each call of uvm_pagefree(ptp).
However, this is clearly not desirable for performance.

I am willing to work on a better solution if there is any interest.
(and if I am not stepping on someone's toes).
(Continue reading)

David Laight | 1 Feb 2003 09:44
Picon

Re: DEV_B_SIZE

The only reason I can see for supporting 512byte reads is to allow
them to to be used as system disks without requiring a BIOS update.

I suspect that the only reason that the BSD systems don't support
sector sizes other than 512 is a lack of test media.
Indeed someone has recently gone through the netbsd code getting
it to work with (IIRC) 1k blocks for a specific disk.

With a test sample the ffs support would be fixed in a few days,
and probably backported to recent releases within a few weeks.

No one using windows will care :-) you could lock the ATA bus
a few times a day and they'd just reset and continue. :-)

	David

--

-- 
David Laight: david <at> l8s.co.uk

To Unsubscribe: send mail to majordomo <at> FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message

Wojciech Puchar | 1 Feb 2003 10:58

Re: DEV_B_SIZE

> them to to be used as system disks without requiring a BIOS update.
>
> I suspect that the only reason that the BSD systems don't support
> sector sizes other than 512 is a lack of test media.

this is not true. older SCSI drives allow formatting with 1K sectors,
CDROM's are 2K and EMULATED usually as 512b by netbsd, magneto-opticals
are up to 4KB (and doesn't work in NetBSD because of that).

> Indeed someone has recently gone through the netbsd code getting
> it to work with (IIRC) 1k blocks for a specific disk.
>
> With a test sample the ffs support would be fixed in a few days,
> and probably backported to recent releases within a few weeks.
>
> No one using windows will care :-) you could lock the ATA bus
exactly

To Unsubscribe: send mail to majordomo <at> FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message


Gmane