Dimitris Andreou | 1 Apr 2010 01:28
Picon

Re: Re: Vector and Seq



2010/4/1 Jesper Nordenberg <megagurka <at> yahoo.com>
Daniel Sobral skrev 2010-03-31 23:59:

O(log32 n) is different than O(log n). You only ignore multiplicative
constant factors and constant factors. Perhaps you meant to say that
O(log 32 * n) == O(log n)?

You really need to read up on big O notation. log32(n) = log(n) / log(32), so O(log32(n)) = O(log(n) / log(32)) = O(log(n)).


As for O() notation being irrelevant for length limited data types,
given that all computers can only deal with limited size collections in
their memories, does that make O() notation irrelevant for in-memory
algorithms? That's non-sensical, and untrue anyway.

Big O notation is a mathematical concept to describe for example running time and memory usage of algorithms when the number elements grows towards infinity. If you have a limited maximum number of elements it can only be used as a rough estimate to compare data structures and algorithms.


Right now, 2^30 elements on an in-memory collection of any kind is
downright impossible on 32 bits JVM, and very tough on current 64 bits
system. When the access time ratio between 100 and 2^30 elements is
about 1/2, it doesn't make any sense nit-picking the difference. The
slowest rate would have been good enough even for the smaller
collection, given the other nice properties of this collection, and it
is by this token that we can say it is O(1). If you want _the_ fastest
access speed possible, then just go with arrays.

I'm not arguing that Vector is a useful and fast implementation, but if we agree that it's branching factor is at most 32, element access time is O(log n) not O(1) as the number of elements grows towards infinity. If you talk about a length limited implementation, the big O notation is not applicable.

Now, enter the world of computers, where everything is limited, yet we somehow manage to make use of big oh notation!

While theory is important, one should also keep an eye to practical considerations. Here is a well known example of that. Disjoint sets - the optimal solution is O(ma(m, n)), where a(m, n) is the inverse Ackerman's function. Theoretically, this is by no means O(1). Yet, "for all practical purposes, a(m, n) is a constant not larger than four" (quoting Tarjan). Four! Even if we had a problem the size of the universe, this would remain four! This important insight is not apparent through the glasses of big oh notation, which could not run any deeper than a (completely meaningless) comparison of O(1) and O(a(m, n)). These are mere tools, one has to know when it makes sense to use them.

Dimitris

PS:  Random linguistic fact: "Big Omicron" hides an oxymoron in Greek. It literally means "big oh, small". "Big Omega" is also fun: "Big oh, big" (micron = small, mega = big). Very big indeed.


/Jesper Nordenberg


Daniel Sobral | 1 Apr 2010 01:30
Picon
Gravatar

Re: Re: Vector and Seq



On Wed, Mar 31, 2010 at 7:43 PM, Jesper Nordenberg <megagurka <at> yahoo.com> wrote:
Daniel Sobral skrev 2010-03-31 23:59:

O(log32 n) is different than O(log n). You only ignore multiplicative
constant factors and constant factors. Perhaps you meant to say that
O(log 32 * n) == O(log n)?

You really need to read up on big O notation. log32(n) = log(n) / log(32), so O(log32(n)) = O(log(n) / log(32)) = O(log(n)).

Nah, just on logs.
 


As for O() notation being irrelevant for length limited data types,
given that all computers can only deal with limited size collections in
their memories, does that make O() notation irrelevant for in-memory
algorithms? That's non-sensical, and untrue anyway.

Big O notation is a mathematical concept to describe for example running time and memory usage of algorithms when the number elements grows towards infinity. If you have a limited maximum number of elements it can only be used as a rough estimate to compare data structures and algorithms.

Everyone has a limited number of elements, unless you happen to own a real turing machine, but whatever.
 


Right now, 2^30 elements on an in-memory collection of any kind is
downright impossible on 32 bits JVM, and very tough on current 64 bits
system. When the access time ratio between 100 and 2^30 elements is
about 1/2, it doesn't make any sense nit-picking the difference. The
slowest rate would have been good enough even for the smaller
collection, given the other nice properties of this collection, and it
is by this token that we can say it is O(1). If you want _the_ fastest
access speed possible, then just go with arrays.

I'm not arguing that Vector is a useful and fast implementation, but if we agree that it's branching factor is at most 32, element access time is O(log n) not O(1) as the number of elements grows towards infinity. If you talk about a length limited implementation, the big O notation is not applicable.

/Jesper Nordenberg




--
Daniel C. Sobral

I travel to the future all the time.
Blair Zajac | 1 Apr 2010 04:15
Gravatar

Splitting scala.collections.immutable into .persistent and .immutable

Please excuse any misused terms here and even suggesting this change before 2.8 
is released :), but here goes:

After reading this thread today on Google Guava:

http://groups.google.com/group/guava-discuss/msg/94af57687167c443

and after writing an (unreleased) wrapper to some of the Google 
Collections/Guava classes it appears that it could be useful to distinguish 
between immutable and persistent data structures:

http://en.wikipedia.org/wiki/Persistent_data_structure

The immutable Google Collection/Guava classes are not persistent so when I 
wrapped them I overrode the + and - operators to throw a 
UnsupportedOperationException because they would be very inefficient to build an 
entire new collection to add or subtract a single item and I wanted to ensure 
people wouldn't use such an inefficient operation.

For my server based code there are many places where I build an immutable data 
structure and don't care about the previous versions of it that were built to 
generate the final version and plus with the Google collections there is no 
synchronization at all so I don't need to pay for that either.

So it seems like further separating the Scala Collections from mutable and 
immutable into immutable, persistent and mutable may be useful.  The persistent 
traits would extend the immutable traits and would supply the relatively 
efficient + and - operators.

Blair

adriaN kinG | 1 Apr 2010 04:40
Picon
Favicon

RE: Splitting scala.collections.immutable into .persistent and .immutable

> it could be useful to distinguish
> between immutable and persistent data structures:
>
> http://en.wikipedia.org/wiki/Persistent_data_structure

Sigh. If only "persistent" didn't also mean "persistent"...

A


Hotmail is redefining busy with tools for the New Busy. Get more from your inbox. Sign up now.
Jesper Nordenberg | 1 Apr 2010 08:32
Picon
Favicon

Re: Splitting scala.collections.immutable into .persistent and .immutable

Blair Zajac skrev 2010-04-01 04:15:
> So it seems like further separating the Scala Collections from mutable
> and immutable into immutable, persistent and mutable may be useful. The
> persistent traits would extend the immutable traits and would supply the
> relatively efficient + and - operators.

The problem is that just because a collection is persistent doesn't mean 
that all operations are efficient. Take Scala List for example, it's 
typically considered persistent, but most operations on it is horribly 
inefficient. So I really don't see how a persistent interface would be 
useful in practise.

/Jesper Nordenberg

Tiark Rompf | 1 Apr 2010 13:52
Picon
Picon
Favicon

Re: Re: Vector and Seq

O(1), O(log32 n), O(log n) are all valid characterizations, as are O(n), O(n^2), O(n^n). And they all don't
tell you much because big-O is just an upper bound and it hides a constant that can be 0.00001 as well as 100000.

The important piece of information is that iterating over a large vector by indexed access can be about 10
times slower than using an iterator. Ismael Juma posted some nice performance diagrams here:
http://blog.juma.me.uk/2009/10/26/new-jvm-options-and-scala-iteration-performance/ and you
can also compare numbers directly here: http://lamp.epfl.ch/~rompf/vector2/vector-summary.txt
(look for BenchBVectorForeachFast and BenchBVectorIndexedFast). Of course it also depends on whether
-optimize is able to inline things well.

The reason why Vector is not an IndexedSeq is that IndexedSeqLike implements most basic collection
operations using indexed access. This is the fastest method for arrays and strings, but not for Vectors.
So if Vector was an IndexedSeq, we would have had to override most of the methods from IndexedSeqLike with
verbatim copies of the corresponding IterableLike code, which would hardly be a good design.

- Tiark
James Iry | 1 Apr 2010 16:33
Picon
Gravatar

Re: Splitting scala.collections.immutable into .persistent and .immutable

Probably not all that useful. Take List for example.


List is persistent for tail and :: (prepend)
List is not persistent for init and :+(append)
List is half persistent for ++ (concatenate)

Other structures will make different tradeoffs.  It's really a slight misnomer to say a structure is persistent - we can only say it is or isn't persistent under a particular operation.


On Wed, Mar 31, 2010 at 7:15 PM, Blair Zajac <blair <at> orcaware.com> wrote:
Please excuse any misused terms here and even suggesting this change before 2.8 is released :), but here goes:

After reading this thread today on Google Guava:

http://groups.google.com/group/guava-discuss/msg/94af57687167c443

and after writing an (unreleased) wrapper to some of the Google Collections/Guava classes it appears that it could be useful to distinguish between immutable and persistent data structures:

http://en.wikipedia.org/wiki/Persistent_data_structure

The immutable Google Collection/Guava classes are not persistent so when I wrapped them I overrode the + and - operators to throw a UnsupportedOperationException because they would be very inefficient to build an entire new collection to add or subtract a single item and I wanted to ensure people wouldn't use such an inefficient operation.

For my server based code there are many places where I build an immutable data structure and don't care about the previous versions of it that were built to generate the final version and plus with the Google collections there is no synchronization at all so I don't need to pay for that either.

So it seems like further separating the Scala Collections from mutable and immutable into immutable, persistent and mutable may be useful.  The persistent traits would extend the immutable traits and would supply the relatively efficient + and - operators.

Blair


Dimitris Andreou | 1 Apr 2010 16:46
Picon

Re: Splitting scala.collections.immutable into .persistent and .immutable

I can't understand what you mean by "persistent", "not persistent", "half persistent" here. Who defines these terms and where?


(Partially/fully/confluently persistent, as defined by Robert Tarjan and others).

2010/4/1 James Iry <jamesiry <at> gmail.com>
Probably not all that useful. Take List for example.

List is persistent for tail and :: (prepend)
List is not persistent for init and :+(append)
List is half persistent for ++ (concatenate)

Other structures will make different tradeoffs.  It's really a slight misnomer to say a structure is persistent - we can only say it is or isn't persistent under a particular operation.


On Wed, Mar 31, 2010 at 7:15 PM, Blair Zajac <blair <at> orcaware.com> wrote:
Please excuse any misused terms here and even suggesting this change before 2.8 is released :), but here goes:

After reading this thread today on Google Guava:

http://groups.google.com/group/guava-discuss/msg/94af57687167c443

and after writing an (unreleased) wrapper to some of the Google Collections/Guava classes it appears that it could be useful to distinguish between immutable and persistent data structures:

http://en.wikipedia.org/wiki/Persistent_data_structure

The immutable Google Collection/Guava classes are not persistent so when I wrapped them I overrode the + and - operators to throw a UnsupportedOperationException because they would be very inefficient to build an entire new collection to add or subtract a single item and I wanted to ensure people wouldn't use such an inefficient operation.

For my server based code there are many places where I build an immutable data structure and don't care about the previous versions of it that were built to generate the final version and plus with the Google collections there is no synchronization at all so I don't need to pay for that either.

So it seems like further separating the Scala Collections from mutable and immutable into immutable, persistent and mutable may be useful.  The persistent traits would extend the immutable traits and would supply the relatively efficient + and - operators.

Blair



Rex Kerr | 1 Apr 2010 17:03
Picon
Gravatar

Re: Splitting scala.collections.immutable into .persistent and .immutable

I think it depends on whether you require that persistence cost less than copying everything to really count.

In that context, James' comments make sense:

val a = Nil
val b = 1 :: a
val c = 2 :: b
c.tail == b  // Of course they're the same
c.tail eq b  // Returns true => persistent, we didn't copy everything

val a = Nil
val b = 1 :: a
val c = b :+ 2
c.init eq b  // Of course they're the same
c.init eq b  // Returns false; we actually copied everything twice here :(
// This isn't what I'd call persistent, at least not if I'm distinguishing it from immutable!

val a = Nil
val b = 1 :: a
val c = 2 :: a
val d = b ++ c
d.drop(1) == c  // True, of course
d.drop(1) eq c  // True, c wasn't copied, yay!
d.take(1) == b  // True, of course
d.take(1) eq b  // False, we actually copied twice :(
// This is parsimonious in c but not have in b => "half-persistent"

Whether we call the failures "not persistent" or "persistent but maximally inefficient", the point is that saying that a _persistent_ data structure is usefully distinct from an _immutable_ data structure relies upon deeper knowledge of which operations you're going to support.  You can't really map "performs better but is safe" onto "persistent", but the former is what library users are more likely to care about.

  --Rex


On Thu, Apr 1, 2010 at 10:46 AM, Dimitris Andreou <jim.andreou <at> gmail.com> wrote:
I can't understand what you mean by "persistent", "not persistent", "half persistent" here. Who defines these terms and where?

(Partially/fully/confluently persistent, as defined by Robert Tarjan and others).

2010/4/1 James Iry <jamesiry <at> gmail.com>

Probably not all that useful. Take List for example.

List is persistent for tail and :: (prepend)
List is not persistent for init and :+(append)
List is half persistent for ++ (concatenate)

Other structures will make different tradeoffs.  It's really a slight misnomer to say a structure is persistent - we can only say it is or isn't persistent under a particular operation.


On Wed, Mar 31, 2010 at 7:15 PM, Blair Zajac <blair <at> orcaware.com> wrote:
Please excuse any misused terms here and even suggesting this change before 2.8 is released :), but here goes:

After reading this thread today on Google Guava:

http://groups.google.com/group/guava-discuss/msg/94af57687167c443

and after writing an (unreleased) wrapper to some of the Google Collections/Guava classes it appears that it could be useful to distinguish between immutable and persistent data structures:

http://en.wikipedia.org/wiki/Persistent_data_structure

The immutable Google Collection/Guava classes are not persistent so when I wrapped them I overrode the + and - operators to throw a UnsupportedOperationException because they would be very inefficient to build an entire new collection to add or subtract a single item and I wanted to ensure people wouldn't use such an inefficient operation.

For my server based code there are many places where I build an immutable data structure and don't care about the previous versions of it that were built to generate the final version and plus with the Google collections there is no synchronization at all so I don't need to pay for that either.

So it seems like further separating the Scala Collections from mutable and immutable into immutable, persistent and mutable may be useful.  The persistent traits would extend the immutable traits and would supply the relatively efficient + and - operators.

Blair




Dimitris Andreou | 1 Apr 2010 17:31
Picon

Re: Splitting scala.collections.immutable into .persistent and .immutable

2010/4/1 Rex Kerr <ichoran <at> gmail.com>
I think it depends on whether you require that persistence cost less than copying everything to really count.

The thing is, "persistent" (as in persistent data structures) says/requires nothing about performance. Even a persistent structure with slow operations is persistent. Overloading the meaning of technical terms with ad-hoc interpretations makes the terms less useful. 

scala.List is a <fully persistent data structure>. Simple. One can even look up the term and immediately and precisely understand what I'm talking about. "On a separate account, some operations of it are fast / O(1) and some operations are slow / O(N)". Also simple. 
 

In that context, James' comments make sense:

val a = Nil
val b = 1 :: a
val c = 2 :: b
c.tail == b  // Of course they're the same
c.tail eq b  // Returns true => persistent, we didn't copy everything

val a = Nil
val b = 1 :: a
val c = b :+ 2
c.init eq b  // Of course they're the same
c.init eq b  // Returns false; we actually copied everything twice here :(
// This isn't what I'd call persistent, at least not if I'm distinguishing it from immutable!

val a = Nil
val b = 1 :: a
val c = 2 :: a
val d = b ++ c
d.drop(1) == c  // True, of course
d.drop(1) eq c  // True, c wasn't copied, yay!
d.take(1) == b  // True, of course
d.take(1) eq b  // False, we actually copied twice :(
// This is parsimonious in c but not have in b => "half-persistent"

Whether we call the failures "not persistent" or "persistent but maximally inefficient", the point is that saying that a _persistent_ data structure is usefully distinct from an _immutable_ data structure relies upon deeper knowledge of which operations you're going to support.  You can't really map "performs better but is safe" onto "persistent", but the former is what library users are more likely to care about.

  --Rex



On Thu, Apr 1, 2010 at 10:46 AM, Dimitris Andreou <jim.andreou <at> gmail.com> wrote:
I can't understand what you mean by "persistent", "not persistent", "half persistent" here. Who defines these terms and where?

(Partially/fully/confluently persistent, as defined by Robert Tarjan and others).

2010/4/1 James Iry <jamesiry <at> gmail.com>

Probably not all that useful. Take List for example.

List is persistent for tail and :: (prepend)
List is not persistent for init and :+(append)
List is half persistent for ++ (concatenate)

Other structures will make different tradeoffs.  It's really a slight misnomer to say a structure is persistent - we can only say it is or isn't persistent under a particular operation.


On Wed, Mar 31, 2010 at 7:15 PM, Blair Zajac <blair <at> orcaware.com> wrote:
Please excuse any misused terms here and even suggesting this change before 2.8 is released :), but here goes:

After reading this thread today on Google Guava:

http://groups.google.com/group/guava-discuss/msg/94af57687167c443

and after writing an (unreleased) wrapper to some of the Google Collections/Guava classes it appears that it could be useful to distinguish between immutable and persistent data structures:

http://en.wikipedia.org/wiki/Persistent_data_structure

The immutable Google Collection/Guava classes are not persistent so when I wrapped them I overrode the + and - operators to throw a UnsupportedOperationException because they would be very inefficient to build an entire new collection to add or subtract a single item and I wanted to ensure people wouldn't use such an inefficient operation.

For my server based code there are many places where I build an immutable data structure and don't care about the previous versions of it that were built to generate the final version and plus with the Google collections there is no synchronization at all so I don't need to pay for that either.

So it seems like further separating the Scala Collections from mutable and immutable into immutable, persistent and mutable may be useful.  The persistent traits would extend the immutable traits and would supply the relatively efficient + and - operators.

Blair






Gmane