Jon Harrop | 1 Apr 21:19
Favicon

Unusual techniques in HLVM and


I just wrote a blog post detailing some of the techniques I adopted in HLVM
that appear to be unusual:

http://flyingfrogblog.blogspot.co.uk/2012/03/gc-safe-points-mutator-suspensi
on-and.html

In particular, HLVM uses many techniques that are much simpler than existing
alternatives but appear to be just as efficient.

In the process of writing that I thought of an interesting way to avoid
unnecessary pushes to the shadow stack. A previous benchmark indicated that
shadow stack manipulations can account for up to 70% of the total running
time of a program:

http://flyingfrogblog.blogspot.co.uk/2009/03/current-shadow-stack-overheads-
in-hlvm.html

So this theoretically has the potential to dramatically improve HLVM's
performance under certain circumstances (although I don't want to exaggerate
the potential advantages because HLVM's current GC is poorly suited for
allocation-intensive programs like that benchmark).

I wanted to avoid both caller and callee pushing the same value onto the
shadow stack. Obviously, there is no harm in just having the caller push. I
thought of an interesting solution: add the ability to mark a reference type
as "prepushed" and instantiate functions for any combination of prepushed
reference arguments as required by callers. This is a conservative
approximation so references default to non-prepushed. Between the push and
pop of a reference inside the body of a function it becomes prepushed and,
(Continue reading)

Richard Jones | 18 Mar 17:55
Picon
Favicon

Mmmm... 2012 MMnet workshop on Memory Management for Many- and Multicore


I'd be grateful if you could draw this to the attention of any colleagues or students who might be interested

Richard

                              2012 MMnet Workshop

                MMMM: Memory Management for Many- and Multicore

		               www.mm-net.org.uk

                Monday 23rd April 2012, Imperial College London

Call for participation
----------------------

Memory management remains an important research area as managed run-time
environments are increasingly widely adopted. Server requirements for
multi-gigabyte heaps, the ubiquitous use of multicore platforms and the
advent of many-core processors throw up new research challenges. This
one-day workshop aims to bring together UK academia and industry with the
aim of establishing new links and enhancing existing collaboration within
the framework of the UK Memory Management Network.

Participation is welcomed from both academic and industrial sectors.
Topics of interest include but are not limited to:

      * virtual machine implementation
      * garbage collection algorithms
      * performance tuning techniques
(Continue reading)

Jon Harrop | 4 Mar 17:52
Favicon

Simple worked examples

For anyone who is interested, I recently published an article about my
implementation of the Very Concurrent Garbage Collector (VCGC) in F#:

http://fsharpnews.blogspot.com/2012/02/very-concurrent-garbage-collector.htm
l

The idea of prototyping GCs using high-level languages this way seems to be
quite novel. I adopted an allocationless style of programming where the
(preallocated) heap is represented as a value type array so it is completely
unboxed and, consequently, the results will be much more representative of a
production GC implementation. Note that the heap is specialized for the
n-queens solver but the GC algorithm itself is generic.

Regarding the algorithm, the complete VCGC implementation is under 300 lines
of F# code and it achieves submillisecond pause times and throughput
comparable to .NET itself. I instrumented the code with time stamping in
order to visualize the phases the mutator and GC thread were going through,
which proved invaluable when debugging performance issues.

The exchange of information between the mutator and GC threads proved
interesting. My first solutions used message passing but this proved to be
too inefficient. So I replaced this with pairs of collections and the
exchange at barriers would swap over collections in each pair. For example,
there are two free lists with the mutator popping heap blocks off one while
the GC pushes heap blocks onto the other. There are two newly-allocated
collections that the mutator pushes freshly allocated heap blocks onto one
while the GC copies heap blocks from the other newly-allocated collection
(obtained from the mutator at the last synchronization) to its local
collection of allocated heap blocks. Now, the only collection that gets
copied at synchronization is the stack and performance is great.
(Continue reading)

Emery Berger | 30 Nov 23:46
Picon
Favicon
Gravatar

Call For Papers: Wodet-3 - Third Workshop on Determinism and Correctness in Parallel Programming

CALL FOR PAPERS

Wodet-3 - Third Workshop on Determinism and Correctness in Parallel Programming

http://goo.gl/K78CQ
March 3, 2012Co-located with ASPLOS 12, London, England, UK
Unintentional non-determinism is the bane of multithreaded
softwaredevelopment. Defective software might execute correctly
hundreds oftimes before a subtle synchronization bug appears, and when
it does,developers often cannot readily reproduce it while
debugging.Nondeterminism also complicates testing as good coverage
requires botha wide range of program inputs and a large number of
possibleinterleavings for each input. These problems have taken on
renewedurgency as multicore systems have driven parallel programming
tobecome mainstream.
Determinism is emerging as an important research area, ranging
fromtechniques for existing code (including deterministic
executionmodels, parallelizing compilers, and deterministic replay
fordebugging) to new programming models (including deterministic
generalpurpose languages and run-time systems). Deterministic
multiprocessingyields deep open questions in programming languages,
compilers,operating systems, runtime systems and architecture.
While there is a growing consensus that determinism would greatly
helpwith the programmability challenges of multicore systems, there
isstill little consensus on many important questions. What are
theperformance and programmability trade-offs for enforcing
deterministicsemantics with different approaches? Should deterministic
semantics bestrictly enforced or guaranteed only for programs that
are"well-behaved" in certain ways? How can we support
trulynon-deterministic algorithms, where non-determinism is
(Continue reading)

Alex Garthwaite | 30 Nov 21:24
Favicon

RESoLVE 2012 workshop CFP

Hello All!

Below is the CFP for the RESoLVE 2012 workshop (collocated with 
ASPLOS 2012). The theme of the workshop is about coordinating
across layers (application, runtime, OS, hypervisor) and
memory-management is definitely one challenge in this space.
The deadline is December 10th and the workshop is looking
for 6-8-page works-in-progress with an intent to spark discussion
and get early feedback. Please consider submitting. Thanks!

                             -- Alex

                 Virtualized Environments
                      (RESoLVE'12)
     http://www.dcs.gla.ac.uk/conferences/resolve12/
  	
ASPLOS 2012 Workshop, London, UK
March 3rd, 2012
UPDATE: Submission deadline extended to 10 Dec

Introduction:
Today's applications typically target high-level runtime systems and
frameworks. At the same time, the operating systems on which they run
are themselves increasingly being deployed on top of (hardware)
virtual machines. These trends are enabling applications to be
written, tested, and deployed more quickly, while simplifying tasks
such as checkpointing, providing fault-tolerance, enabling data and
computation migration, and making better, more power-efficient use of
hardware infrastructure.

(Continue reading)

Jon Harrop | 16 Nov 21:26
Favicon

Visualizing clumps

I just posted this blog article:

http://flyingfrogblog.blogspot.com/2011/11/what-is-your-garbage-collector.ht
ml

I used Mathematica to visualize the structure of the clumps (weakly
connected components) being collected by my mark region collector whilst
running a list-based n-queens solver. In this case, 84% of unreachable
connected components are isolated 2-element lists and only a handful have
interesting topologies.

I'm curious if anyone can think of a common benchmark that is likely to
produce more interesting garbage. :-)

--

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com

Jon Harrop | 15 Nov 21:09
Favicon

The LMAX disruptor

People here might find this interesting:

http://flyingfrogblog.blogspot.com/2011/11/lmax-disruptor-and-bakers-treadmi
ll.html

I reread a paper about a new data structure designed for low latency servers
and noticed that it is quite similar to Baker's Treadmill GC (1992). I'd be
very interested to hear how more recent improvements to Baker's Treadmill
might be used to improve upon the LMAX disruptor! :-)

--

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com

Jon Harrop | 12 Nov 19:57
Favicon

Using bitmasks to accelerate marking

I'm just playing around with some toy GCs I've written in C++ that use an
instrumented list-based n-queens solver as an example mutator. The fastest
I've got so far is a kind of mark-region collector that allocates into a
region and uses two external bitvectors to track the mark bits and allocated
bits, respectively. Sweeping is then just a case of ANDing the two
bitvectors (as I described before).

I haven't profiled it at all but it occurs to me that I might be able to
accelerate the mark phase (which I assume is the bottleneck) significantly
by storing third bitvector where each bit is used to indicate whether the
first pointer in the corresponding heap block points to the heap block
immediately before it. A 2kB look up table could be used to find the length
of the chain of consecutive heap blocks reachable from the first one and
they could all be marked simultaneously using an OR. This is obviously
expected to be effective for my n-queens solver because it uses lots of
little immutable singly-linked lists but I expect it to be effective for
many other purely functional data structures as well.

Has any work been done on this kind of thing before?

--

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com

Jon Harrop | 12 Nov 13:31
Favicon

Divide and conquer GC algorithms

Reading Richard's excellent book, I was surprised to observe that none of
the major GC algorithms seem to be hierarchical. Some algorithms like the
Train collector, Beltway and Immix amortize costs by grouping heap-allocated
blocks together but they don't seem to use trees to gradually refine the
granularity.

I've also been looking at "dynamic" graph algorithms for the first time.
These algorithms allow a small change in the graph (add/remove one
edge/vertex) to be converted into a small change in their output. The
concept of a topology tree, a data structure used in some dynamic graph
algorithms, caught my attention because this conveys information about the
topology of a graph. For example, you might be able to determine quickly
that a 1/4 partition of the heap contained pointers to another 1/4 partition
but not the other way around so there were no cycles spanning the two
partitions. Could a GC use a topology tree to do some of its work in O(log
n) time? Furthermore, what if a dynamic graph algorithm were used to
incrementally update the GC's topology tree as the write barrier mutates the
topology of the heap?

Do any GC algorithms use divide-and-conquer, trees or yield complexities
logarithmic in the number of pointers in the heap?  Can we estimate how big
a heap would have to be before hierarchical techniques would become
competitively performant?

This reminds me of the debate surrounding the use of trees in file systems
as opposed to flat collections...

On the exact opposite end of the spectrum, has anyone tried using GPGPU to
accelerate garbage collection, e.g. the mark phase?

(Continue reading)

Jon Harrop | 12 Nov 13:09
Favicon

Re: Doubling up pointers

Right, that's exactly what I was thinking. I'm guessing the transition could
occur as a gradual phase transition (aka ragged barrier) in order to avoid
gross synchronization. Perhaps the first phase would write to the left-hand
pointer, the second phase would have mutators write to both left- and
right-hand pointers while the GC threads copied every left-hand pointer over
its right-hand pointer and the third phase would have mutators write just to
right-hand pointers. Once the third phase is reached the left-hand pointers
in the heap contain a snapshot of its topology. Note that this write barrier
would only be expensive during the second phase: in the first and third
phases it is just a single pointer write. Moreover, my measurements indicate
that the .NET write barriers is ~5x slower than a single pointer write so
I'd expect this double-write barrier to be a relatively fast write barrier
even in the worst case. OCaml's write barrier is even slower than .NET's (it
is a functional language). The main disadvantage is doubling the space
consumed by pointers in the heap, of course.

As an aside, a VM with a JIT compiler might avoid the cost of testing the GC
phase in the write barrier by jumping between three different versions of
the compiled code, one specialized for each phase. Failing that, you could
use masks to avoid jumps.

Cheers,
Jon.

From: Andrew Chen [mailto:a_s_c <at> me.com] 
Sent: 14 September 2011 20:24
To: Jon Harrop
Cc: gclist <at> iecc.com
Subject: Re: [gclist] Doubling up pointers

(Continue reading)

Jon Harrop | 12 Nov 12:58
Favicon

Review: The Garbage Collection Handbook

I finally found the time to read Richard Jones' new book and have posted my
review on Amazon:

http://www.amazon.co.uk/review/R2UYL7W7WG10K0/

"This book gives a thorough and detailed review of modern garbage collection
algorithms including the tricky subjects of parallel, concurrent and
real-time collectors and yet remains comprehensible throughout. Moreover,
the information contained within is presented as a fascinating narrative
littered with interesting observations and insightful remarks, making this
book a real gem.

Anyone developing software built upon a garbage collector would do well to
brush up on the theory that underlies this core subject and there is no
better way than to read this excellent book!"

--

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com


Gmane