Kostya Serebryany | 17 Apr 14:21 2014

multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)


The current design of -fprofile-instr-generate has the same fundamental flaw 
as the old gcc's gcov instrumentation: it has contention on counters. 
A trivial synthetic test case was described here: 

For the problem to appear we need to have a hot function that is simultaneously executed 
by multiple threads -- then we will have high contention on the racy profile counters.

Such situation is not necessary very frequent, but when it happens 
-fprofile-instr-generate becomes barely usable due to huge slowdown (5x-10x)

An example is the multi-threaded vp9 video encoder.

cd libvpx/
F="-no-integrated-as -fprofile-instr-generate"; CC="clang $F" CXX="clang++ $F" LD="clang++ $F" ./configure
make -j32 
time ./vpxenc -o /dev/null -j 8 akiyo_cif.y4m  

When running single-threaded, -fprofile-instr-generate adds reasonable ~15% overhead
(8.5 vs 10 seconds)
When running with 8 threads, it has 7x overhead (3.5 seconds vs 26 seconds).

I am not saying that this flaw is a showstopper, but with the continued move
towards multithreading it will be hurting more and more users of coverage and PGO.
AFAICT, most of our PGO users simply can not run their software in single-threaded mode,
and some of them surely have hot functions running in all threads at once. 

At the very least we should document this problem, but better try fixing it. 

Some ideas:

- per-thread counters. Solves the problem at huge cost in RAM per-thread
- 8-bit per-thread counters, dumping into central counters on overflow. 
- per-cpu counters (not portable, requires very modern kernel with lots of patches)
- sharded counters: each counter represented as N counters sitting in different cache lines. Every thread accesses the counter with index TID%N. Solves the problem partially, better with larger values of N, but then again it costs RAM.
- reduce contention on hot counters by not incrementing them if they are big enough:
   {if (counter < 65536) counter++}; This reduces the accuracy though. Is that bad for PGO?
- self-cooling logarithmic counters: if ((fast_random() % (1 << counter)) == 0) counter++;

Other thoughts?


LLVM Developers mailing list
LLVMdev <at> cs.uiuc.edu         http://llvm.cs.uiuc.edu
reed kotler | 16 Apr 23:06 2014

adding comment

Is there a simple way to add a comment in the machine instructions of a 
basic block?

Ideally something that can be used with machine instruction builder.


Weiming Zhao | 16 Apr 21:02 2014

[ARM64] use_iterator in ARM64AddressTypePromotion.cpp



In ARM64AddressTypePromotion::propagateSignExtension(Instructions &SExtInsts) {

while (!Inst->use_empty()) {

          Value::use_iterator UseIt = Inst->use_begin();     è should we use “user_iterator”, “user_begin()” here ?

          Instruction *UseInst = dyn_cast<Instruction>(*UseIt);

          assert(UseInst && "Use of sext is not an Instruction!");

          UseInst->setOperand(UseIt->getOperandNo(), SExt);





I don’t have a test case. I just saw it when I was browsing the code.





Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation


LLVM Developers mailing list
LLVMdev <at> cs.uiuc.edu         http://llvm.cs.uiuc.edu
Lee Hammerton | 16 Apr 15:37 2014

X86 mmx movq disassembler fail

0x0f 0x6f 0xc8


0x0f 0x7f 0xc1

Should both be movq % mm0, % mm1. (AT&T)

However, llvm 3.4 at least does not recognise the second variant as being a valid instruction.

We are currently compiling up latest src incase it has been fixed. If not, could someone take a look or recommend how to fix?


LLVM Developers mailing list
LLVMdev <at> cs.uiuc.edu         http://llvm.cs.uiuc.edu
Sri | 16 Apr 13:55 2014

Importance of VMKit JIT function cache

      VMKit JIT has a function cache to store compiled IR code , so, as soon as method is compiled , it will be stored in to function cache. My question is , who will use this compiled information from function cache.  Since we  are jitting ,  the llvm translate  source code in to native code and executing them . So, how we relates this function cache with JIT ?


LLVM Developers mailing list
LLVMdev <at> cs.uiuc.edu         http://llvm.cs.uiuc.edu
Konrad Anheim | 16 Apr 12:50 2014

LowerFormalArguments: info on aggregates


All arguments processed in the function LowerFormalArguments seem to be 
are already lowered to machine types. In addition, the flag isByVal 
seems to be set for each argument passed on the stack, and cannot be 
used to indicate a struct/union.

Is there a way to tell if an argument is of a simple type or of an 
aggregate (within LowerFormalArguments)? Is there any better way 
processing stack alignment of incoming arguments? This information would 
be needed for correct stack alignment of structs vs. simple types in 
AArch64 big endian mode.

David Blaikie | 16 Apr 06:05 2014

PassRegistry locking scheme

So, long story short I came across some problematic locking:

PassRegistry uses a ManagedStatic<sys::SmartRWMutex<true>> to handle
safe registration, and, importantly, deregistration.

The problem is that ManagedStatic's dtor isn't exactly safe: if the
object is destroyed, then you try to access it again, ManagedStatic
simply constructs a new object and returns it.

This is relied upon to ensure safe/non-racy deregistration of
passes... but what happens during destruction? The mutex is destroyed
and replaced with another, completely different mutex... that doesn't
seem like a safe locking scheme.

Does any of this make sense? Am I missing something about how this is safe?

Long story:

This is how I came across this bug:

* While removing manual memory management in PassRegistry I noticed a
rather arcane piece of Pimpl usage
* The pimpl used a void*, so I set about moving the declaration of the
pimpl type into the outer type as is the usual case
* Then I wondered why it was a pimpl at all - so I tried to remove it
* (aside: to handle safe locking around destruction I added an
Optional<SmartReadLock> as the first member, initialized that in the
dtor, and let it hold the lock through the destruction of the other
members... )
* But then I realized that PassRegistry::removeRegistrationListener
relied on testing that the pimpl was null to safely no-op after the
PassRegistry had been destroyed (undefined order of destruction & all
that jazz)
* so I added a boolean to track destruction state - but this did not
fix the issue
* then, while talking it out, I realized what was really happening:
the PassRegistry had been destroyed and the ManagedStatic around it,
instead of returning the destroyed object, had simply built a new one
in its place with a still null pimpl (the pimpl was lazy, clearly part
of this cunning scheme) - and thus the removeRegistrationListener call
was a no-op, except to construct and leave around an extra, empty,
PassRegistry in the post-destroyed ManagedStatic...
* which got me thinking: what about the lock, which was also a
ManagedStatic... which is how I arrived at the above observation
Huaibo Wang | 16 Apr 00:45 2014

llvm help

hi, comrades,

I was raising two questions:
1)ependence loop in SelectionDAG.

there is no loop in a DAG,but i did meet a situation there was a loop.

an example is a BR node targets a node after it in the chain.

to the BR node . the target node  is an operand .
to the target node, the BR node is in the "Chain" pointed to by operand 0.

is it wro

2) Chain
in some nodes, operand 0 is a Chain.
Martell Malone | 15 Apr 20:43 2014

[PATCH] Seh exceptions on Win64


I'd like to submit a patch to match the clang patch on the front end.

The front end doesn't need this patch to work but it's still important.
This is mostly based on work done by kai from redstar.de

Could I get some feedback on this?
I'm not sure if the emitting of the register names will effect msvc.

Many Thanks

Martell Malone
Attachment (llvm-win64-exceptions.patch): application/octet-stream, 22 KiB
Attachment (llvm-emit-reg-names.patch): application/octet-stream, 2370 bytes
LLVM Developers mailing list
LLVMdev <at> cs.uiuc.edu         http://llvm.cs.uiuc.edu
Carback, Richard T., III | 15 Apr 21:16 2014

[RFC] TableGen DAGISel Backend Documentation

This is part of the ongoing effort led by Renato and Sean Silva, here: http://llvm.org/docs/TableGen/BackEnds.html


I’ve been working on the DAGISel backend documentation, and I have posted notes here:




This is a rough cut, but we decided it would be best to seek feedback early so that any misunderstandings on my/our part are resolved before I get too far. My notional outline for a final DAGISel section is as follows:


-          Start with background section showing how the backend is used in the instruction selection process.

-          Introduce the MatcherTable, and show before and after DAGs, to illustrate what it actually does.

-          Show how to generate an “unoptimized” matcher table and walkthrough the actions it takes to match/emit.

-          Walkthrough the CodeGenDAGPatterns and friends.

-          Walkthrough the pattern emitter/matcher table printing stage.


For those without a lot of time on your hands, you may want to CTRL+F for where “seems” and “likely” appear in the text.


Goals at this point are to:

1.  Identify errors in understanding.

2.  Get final answers to the pieces that are advertised as not well understood in the text.

3.  Suggestions on section layout/content/graphics.

4.  Identify any missing content.


Feel free to edit the wiki page directly or send e-mail to me or the list.







LLVM Developers mailing list
LLVMdev <at> cs.uiuc.edu         http://llvm.cs.uiuc.edu
Diego Novillo | 15 Apr 19:38 2014

Announcement - A tool to convert Perf profiles to use with LLVM's sample profiler

I'm glad to announce the availability of the AutoFDO converter for LLVM. This tool reads a profile generated with Linux Perf (https://perf.wiki.kernel.org/) and converts it into a format readable with LLVM's sample-based profiler. The converter shares a significant amount of code with GCC's converter tool (authored by Dehao Chen), so we can support both compilers with it.

At this time, I'm looking for volunteers to try out the tool and use it with their own code. There will be missing documentation, sharp edges and other issues typical of a tool that so far has had limited use.

To download and build the converter:

$ git clone https://github.com/google/autofdo.git autofdo
$ cd autofdo && ./configure && make

Note that you will need a compiler with C++11 support (either gcc 4.7+ or a recent Clang/LLVM).

To use the converter, you need to use Perf with a kernel that supports LBR (any kernel post 3.4 should suffice) and LLVM from trunk (there is some support in the 3.4 release, but there were changes post 3.4 that are only present in trunk).  The workflow is:

$ clang++ -O2 -gline-tables-only code.cc -o code
$ perf record -b ./code
$ create_llvm_prof --binary=./code --out=code.prof
$ clang++ -O2 -fprofile-sample-use=code.prof code.cc -o code

The second version of 'code' should run faster than the first one. In theory.

For issues with the tool itself, please use the issue tracker in github (https://github.com/google/autofdo/issues). There is also a mailing list where you can ask questions about it (https://groups.google.com/forum/#!forum/autofdo).

Thanks. Diego.
LLVM Developers mailing list
LLVMdev <at> cs.uiuc.edu         http://llvm.cs.uiuc.edu