Barry K. Nathan | 1 Feb 05:06 2006

Re: [RFC] VM: I have a dream...

On 1/31/06, Al Boldi <a1426z <at>> wrote:
> Faulty, because we are currently running a legacy solution to workaround an
> 8,16,(32) arch bits address space limitation, which does not exist in
> 64bits+ archs for most purposes.

In the early 1990's (and maybe even the mid 90's), the typical hard
disk's storage could theoretically be byte-addressed using 32-bit
addresses -- just as (if I understand you correctly) you are arguing
that today's hard disks can be byte-addressed using 64-bit addresses.

If this was going to be practical ever (on commodity hardware anyway),
I would have expected someone to try it on a 32-bit PC or Mac when
hard drives were in the 100MB-3GB range... That suggests to me that
there's a more fundamental reason (i.e. other than lack of address
space) that caused people to stick with the current scheme.

> There is a lot to gain, for one there is no more swapping w/ all its related
> side-effects.  You're dealing with memory only.  You can also run your fs
> inside memory, like tmpfs, which is definitely faster.  And there may be
> lots of other advantages, due to the simplified architecture applied.

tmpfs isn't "definitely faster". Remember those benchmarks where Linux
ext2 beat Solaris tmpfs?

Also, the only way I see where "there is no more swapping" and
"[y]ou're dealing with memory only" is if the disk *becomes* main
memory, and main memory becomes an L3 (or L4) cache for the CPU [and
as a consequence, main memory also becomes the main form of long-term
(Continue reading)

Al Boldi | 1 Feb 14:58 2006

Re: [RFC] VM: I have a dream...

Thanks for your detailed responses!

Kyle Moffett wrote:
> BTW, unless you have a patch or something to propose, let's take this
> off-list, it's getting kind of OT now.

No patches yet, but even if there were, would they get accepted?

> On Jan 31, 2006, at 10:56, Al Boldi wrote:
> > Kyle Moffett wrote:
> >> Is it necessarily faulty?  It seems to me that the current way
> >> works pretty well so far, and unless you can prove a really strong
> >> point the other way, there's no point in changing.  You have to
> >> remember that change introduces bugs which then have to be located
> >> and removed again, so change is not necessarily cheap.
> >
> > Faulty, because we are currently running a legacy solution to
> > workaround an 8,16,(32) arch bits address space limitation, which
> > does not exist in 64bits+ archs for most purposes.
> There are a lot of reasons for paging, only _one_ of them is/was to
> deal with too-small address spaces.  Other reasons are that sometimes
> you really _do_ want a nonlinear mapping of data/files/libs/etc.  It
> also allows easy remapping of IO space or video RAM into application
> address spaces, etc.  If you have a direct linear mapping from
> storage into RAM, common non-linear mappings become _extremely_
> complex and CPU-intensive.
> Besides, you never did address the issue of large changes causing
> large bugs.  Any large change needs to have advantages proportional
(Continue reading)

Jamie Lokier | 1 Feb 15:38 2006

Re: [RFC] VM: I have a dream...

Al Boldi wrote:
> > Presumably you will want access to more data than you have RAM,
> > because RAM is still limited to a few GB these days, whereas a typical
> > personal data store is a few 100s of GB.
> >
> > 64-bit architecture doesn't change this mismatch.  So how do you
> > propose to avoid swapping to/from a disk, with all the time delays and
> > I/O scheduling algorithms that needs?
> This is exactly what a linear-mapped memory model avoids.
> Everything is already mapped into memory/disk.

Having everything mapped to memory/disk *does not* avoid time delays
and I/O scheduling.  At some level, whether it's software or hardware,
something has to schedule the I/O to disk because there isn't enough RAM.

How do you propose to avoid those delays?

In my terminology, I/O of pages between disk and memory is called
swapping.  (Or paging, or loading, or virtual memory I/O...)

Perhaps you have a different terminology?

> Would you call reading and writing to memory/disk swapping?

Yes, if it involves the disk and heuristic paging decisions.  Whether
that's handled by software or hardware.

> > Applications don't currently care if they are swapped to disk or in
> > physical memory.  That is handled by the OS and is transparent to the
(Continue reading)

Evgeniy Dushistov | 1 Feb 16:40 2006


On Wed, Feb 01, 2006 at 02:46:34AM +0300, Alexey Dobriyan wrote:
> OpenBSD doesn't see "." correctly in directories created by Linux.
The problem is in dir.c:ufs_make_empty, which create "." and ".."
entires, in this function i_size isn't updated, 
so result directory has zero size.
This patch should solve the problem, can you try it?

Signed-off-by: Evgeniy Dushistov <dushistov <at>>


--- linux-2.6.16-rc1-mm4/fs/ufs/dir.c.orig	2006-02-01 18:29:28.943878250 +0300
+++ linux-2.6.16-rc1-mm4/fs/ufs/dir.c	2006-02-01 18:12:24.043826000 +0300
 <at>  <at>  -539,6 +539,7  <at>  <at>  int ufs_make_empty(struct inode * inode,
 		return err;

 	inode->i_blocks = sb->s_blocksize / UFS_SECTOR_SIZE;
+	inode->i_size = sb->s_blocksize;
 	de = (struct ufs_dir_entry *) dir_block->b_data;
 	de->d_ino = cpu_to_fs32(sb, inode->i_ino);
 	ufs_set_de_type(sb, de, inode->i_mode);



Evgeniy Dushistov | 1 Feb 21:04 2006


On Wed, Feb 01, 2006 at 02:46:34AM +0300, Alexey Dobriyan wrote:
> Copying files over several KB will buy you infinite loop in
> __getblk_slow(). Copying files smaller than 1 KB seems to be OK.
> Sometimes files will be filled with zeros. Sometimes incorrectly copied
> file will reappear after next file with truncated size.
The problem as can I see, in very strange code in
balloc.c:ufs_new_fragments. b_blocknr is changed without "restraint".

This patch just "workaround", not a clear solution. But it helps me
copy files more than 4K. Can you try it and say is it really help?

Signed-off-by: Evgeniy Dushistov <dushistov <at>>


--- linux-2.6.16-rc1-mm4/fs/ufs/balloc.c.orig	2006-02-01 22:55:28.245272250 +0300
+++ linux-2.6.16-rc1-mm4/fs/ufs/balloc.c	2006-02-01 22:47:33.455599750 +0300
 <at>  <at>  -241,7 +241,7  <at>  <at>  unsigned ufs_new_fragments (struct inode
 	struct super_block * sb;
 	struct ufs_sb_private_info * uspi;
 	struct ufs_super_block_first * usb1;
-	struct buffer_head * bh;
+	struct buffer_head * bh, *bh1;
 	unsigned cgno, oldcount, newcount, tmp, request, i, result;
 	UFSD(("ENTER, ino %lu, fragment %u, goal %u, count %u\n", inode->i_ino, fragment, goal, count))
 <at>  <at>  -359,17 +359,23  <at>  <at>  unsigned ufs_new_fragments (struct inode
 	if (result) {
 		for (i = 0; i < oldcount; i++) {
 			bh = sb_bread(sb, tmp + i);
(Continue reading)

linux | 2 Feb 10:03 2006


Solaris 10 has added a moderately useful new feature...  lseek now
supports whence = 3 (SEEK_DATA) and 4 (SEEK_HOLE).  What these do is
advance the file pointer to the start of the next run of the appropriate
kind past the given (absolute) offset.

This is, of course, to make backing up and copying sparse files more

I'm still figuring out the fine details of semantics.  EOF is considered
the start of a hole.  If the seek position is past EOF, they return ENXIO.

I'm still trying to figure out if they search > the given offset or >=.
Reading the code, it actually looks like lseek(fd, 13, SEEK_DATA) will
return 0 on a non-sparse file, because they round down to blocks and
then search by blocks.

Not that this affects the usual case where you start at offset 0 and
alternate SEEK_DATA/SEEK_HOLE to find ranges to copy.

I was just wondering if it's an extension worth adopting.
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo <at>
More majordomo info at

Al Boldi | 2 Feb 13:26 2006

Re: [RFC] VM: I have a dream...

Jamie Lokier wrote:
> If I understand your scheme, you're suggesting the kernel accesses
> disks, filesystems, etc. by simply reading and writing somewhere in
> the 64-bit address space.
> At some level, that will involve page faults to move data between RAM and
> disk.
> Those page faults are relatively slow - governed by the CPU's page
> fault mechanism.  Probably slower than what the kernel does now:
> testing flags and indirecting through "struct page *".

Is there a way to benchmark this difference?



To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo <at>
More majordomo info at

Alan Cox | 2 Feb 16:11 2006

Re: [RFC] VM: I have a dream...

On Maw, 2006-01-31 at 18:56 +0300, Al Boldi wrote:
> So with 64bits widely available now, and to let Linux spread its wings and 
> really fly, how could tmpfs merged w/ swap be tweaked to provide direct 
> mapped access into this linear address space?

Why bother. You can already create a private large file and mmap it if
you want to do this, and you will get better performance than being
smeared around swap with everyone else.

Currently swap means your data is mixed in with other stuff. Swap could
do preallocation of each vma when running in limited overcommit modes
and it would run a lot faster if you did but you would pay a lot in
flexibility and efficiency, as well as needing a lot more swap.

Far better to let applications wanting to work this way do it
themselves. Just mmap and the cache balancing and pager will do the rest
for you.

To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo <at>
More majordomo info at

Badari Pulavarty | 2 Feb 17:12 2006

[PATCH 0/3] VFS changes to collapse all the vectored and AIO support


This work was originally suggested & started by Christoph Hellwig, 
when Zack Brown tried to add vectored support for AIO. These series
of changes collapses all the vectored IO support into single
file-operation method using aio_read/aio_write. 

Christoph & Zack, comments/suggestions ? If you are happy with the
work, can you add your Sign-off or Ack ?

Here is the summary:

[PATCH 1/3] Vectorize aio_read/aio_write methods

[PATCH 2/3] Remove readv/writev methods and use aio_read/aio_write

[PATCH 3/3] Zack's core aio changes to support vectored AIO.

To Do/Issues:

1) Since aio_read/aio_write are vectorized now, need to modify
nfs AIO+DIO and usb/gadget to handle vectors. Is it needed ?
For now, it handles only single vector. Christoph, should I
loop over all the vectors ?

2) AIO changes need careful review & could be cleaned up further.
Zack, can you take a look at those ?

3) Ben's suggestion of kernel iovec to hold precomputed information
(Continue reading)

Badari Pulavarty | 2 Feb 17:14 2006

[PATCH 1/3] Vectorize aio_read/aio_write methods

This patch vectorizes aio_read() and aio_write() methods to prepare
for colapsing all the vectored operations into one interface -
which is aio_read()/aio_write().

Attachment (aiovector.patch): text/x-patch, 30 KiB