deadal nix | 21 Sep 08:42 2014

Aggregate store/load optimization

Hi all,

One area where LLVM suck pretty badly is aggregate store and loads. clang do not use them so there are not seen as important, and aren't handled nicely. Other frontends work around the issue as it is not handled properly and we ends up with some kind of chicken and egg issue.

I recently proposed a diff to be able to optimize load from aggregate stores in GVN without great success. Interest was unclear to some and alteration of GVN was a concern to many.

I'd like to improve the aggregate support. SROA is deaggregating from alloca. Would the same approach for other loads./store be preferable ? That would give a way to use the existing infrastructure while providing a good support for aggregate.

If yes, what would be the right place and way to do this ?
LLVM Developers mailing list
LLVMdev <at>
Sinha, Prokash | 21 Sep 05:03 2014

How to get this profile library built from 3.6 source of clang/llvm under freebsd10+

I've the latest 3.6 version of clang/llvm and I can build it using gmake. But I don't get
libclang_rt.profile-x86_64.a built.

So I guess, I'm not providing the right set of options when I do the configure.  Here is the configure I used -
../llvm_9_19/configure --prefix=/fs/home/psinha/bin --enable-targets=x86_64 --enable-profiling

When I try to build a simple program using the following options ( without profile option I can build the
simple app )

Here is the result -

bash$ clang -fprofile-arcs main.c -v
clang version 3.6.0 (trunk 218123)
Target: x86_64-unknown-freebsd10.0
Thread model: posix
 "/.automount/" -cc1 -triple
x86_64-unknown-freebsd10.0 -emit-obj -mrelax-all -disable-free -main-file-name main.c
-mrelocation-model static -mdisable-fp-elim -masm-verbose -mconstructor-aliases
-munwind-tables -target-cpu x86-64 -target-linker-version 2.17.50 -v -dwarf-column-info
-femit-coverage-data -resource-dir
-fdebug-compilation-dir /fs/home/psinha/pgms/clang -ferror-limit 19 -fmessage-length 131
-mstackrealign -fobjc-runtime=gnustep -fdiagnostics-show-option -fcolor-diagnostics -o
/tmp/main-d6a592.o -x c main.c
clang -cc1 version 3.6.0 based upon LLVM 3.6.0svn default target x86_64-unknown-freebsd10.0
#include "..." search starts here:
#include <...> search starts here:
End of search list.
_mcleanup: tos overflow
 "/usr/bin/ld" --eh-frame-hdr -dynamic-linker /libexec/ --hash-style=both
--enable-new-dtags -o a.out /usr/lib/crt1.o /usr/lib/crti.o /usr/lib/crtbegin.o -L/usr/lib
/tmp/main-d6a592.o -lgcc --as-needed -lgcc_s --no-as-needed -lc -lgcc --as-needed -lgcc_s
--no-as-needed /usr/lib/crtend.o /usr/lib/crtn.o /.automount/
No such file: No such file or directory
clang: error: linker command failed with exit code 1 (use -v to see invocation)
bash$ clang -fprofile-arcs -ftest-coverage main.c -v
clang version 3.6.0 (trunk 218123)
Target: x86_64-unknown-freebsd10.0
Thread model: posix
 "/.automount/" -cc1 -triple
x86_64-unknown-freebsd10.0 -emit-obj -mrelax-all -disable-free -main-file-name main.c
-mrelocation-model static -mdisable-fp-elim -masm-verbose -mconstructor-aliases
-munwind-tables -target-cpu x86-64 -target-linker-version 2.17.50 -v -dwarf-column-info
-femit-coverage-notes -femit-coverage-data -resource-dir
-fdebug-compilation-dir /fs/home/psinha/pgms/clang -ferror-limit 19 -fmessage-length 131
-mstackrealign -fobjc-runtime=gnustep -fdiagnostics-show-option -fcolor-diagnostics -o
/tmp/main-a127fa.o -x c main.c
clang -cc1 version 3.6.0 based upon LLVM 3.6.0svn default target x86_64-unknown-freebsd10.0
#include "..." search starts here:
#include <...> search starts here:
End of search list.
_mcleanup: tos overflow
 "/usr/bin/ld" --eh-frame-hdr -dynamic-linker /libexec/ --hash-style=both
--enable-new-dtags -o a.out /usr/lib/crt1.o /usr/lib/crti.o /usr/lib/crtbegin.o -L/usr/lib
/tmp/main-a127fa.o -lgcc --as-needed -lgcc_s --no-as-needed -lc -lgcc --as-needed -lgcc_s
--no-as-needed /usr/lib/crtend.o /usr/lib/crtn.o /.automount/
No such file: No such file or directory <============= See the archieve file missing. !!!!!
clang: error: linker command failed with exit code 1 (use -v to see invocation)

I really need some advice as to how to make that archived build.

Mikulas Patocka | 21 Sep 00:19 2014

ARM assembler bug on LLVM 3.5


I have the following ARM Linux program. The program detects if the 
processor has division instruction, if it does, it uses it, otherwise it 
uses slower library call.

The program works with gcc, but it doesn't work with clang. clang reports 
error on the sdiv instruction in the assembler.

The problem is this - you either compile this program with 
-mcpu=cortex-a9, then clang reports error on the sdiv instruction because 
cortex a9 doesn't have sdiv. Or - you compile the program with 
-mcpu=cortex-a15, then clang compiles it, but it uses full cortex-a15 
instruction set and the program crashes on cortex a9 and earlier cores.

Even if I use -no-integrated-as (as suggested in bug 18864), clang still 
examines the string in "asm" statement and reports an error. GCC doesn't 
examine the string in "asm" and works.

I'd like to ask how to write this program correctly so that it works in 
clang. Or - if it's not possible - I'd like to ask if you could drop that 
pointless restriction on instruction set in the assembler and be able to 
generate all ARM instructions regardless of the cpu switch. This 
restriction doesn't exist on x86 - on x86, you can compile the program 
with -march=pentium2 and still use SSE instructions in the assembler, no 
matter that pentium2 doesn't have SSE. The ARM backend seems overly 
protective and prevents such instructions.


#include <stdio.h>
#include <stdint.h>
#include <fcntl.h>
#include <unistd.h>

int have_hardware_division = 0;

int divide(int a, int b)
	int result;
	if (have_hardware_division)
		asm (".cpu cortex-a15 \n sdiv %0, %1, %2" : "=r"(result) : "r"(a), "r"(b));
		result = a / b;
	return result;

int main(void)
	int h, i;
	unsigned a;
	h = open("/proc/self/auxv", O_RDONLY);
	if (h != -1) {
		uint32_t cap[2];
		while (read(h, &cap, 8) == 8) {
			if (cap[0] == 16) {
#if defined(__thumb2__)
				if (cap[1] & (1 << 18))
					have_hardware_division = 1;
				if (cap[1] & (1 << 17))
					have_hardware_division = 1;
	a = 0;
	for (i = 1; i < 100000000; i++) {
		a += divide(100000000, i);
	printf("%u\n", a);
	return 0;
Vitaliy Filippov | 20 Sep 23:54 2014

PHINode containing itself causes segfault when compiling Blender OpenCL kernel with R600 backend


I'm trying to run Blender using Mesa OpenCL implementation on a radeonsi  
card. First the kernel didn't want to compile, but that was caused by a  
bug in it (they were using . instead of -> in 1 place), and after fixing  
this bug I've got the kernel to compile...

...But after that, LLVM started to crash during translation of IR into  
shader code with R600 backend.

I've done some investigation and figured out that the crash is caused by a  
PHINode containing itself. SIAnnotateControlFlow::handleLoopCondition()  
can't handle such situation - it recurses into itself, calls  
Phi->eraseFromParent() inside the inner execution, returns into outer one,  
gets zeroed out object and crashes when trying to do something with its  
members... for example when trying to erase it again.

I have no real background in LLVM or GCC, so the concept of PHINode itself  
was a real discovery for me :) and PHINode containing itself does look  
even more strange... I've tried to understand the semantics of such  
PHINodes from reading the code and got a suspicion that the rest of LLVM  
code just ignores PHINodes equal to their parent... So I've tried to fix  
the bug by making handleLoopCondition() skip IncomingValues equal to the  
Phi itself, but the bug didn't go away! Surprisingly, PHINode may not just  
contain itself directly, but it also may contain itself inside another  
PHINode, i.e. Phi->getIncomingValue(0)->getIncomingValue(0) == Phi, which  
results in the same problem with SIAnnotateControlFlow...

Besides "how to make a correct fix" :), my question also is: what are the  
real semantics of a PHINode containing itself directly or indirectly? I've  
done some tracing and saw such PHINodes added by the optimizer, in  
llvm::InlineFunction()... but what do they mean and how to deal with them  


With best regards,
   Vitaliy Filippov
Iritscen | 19 Sep 17:49 2014

Optimization of string.h calls

Hello all.  Is there a way to get llvm/clang at build-time to optimize a string.h call so that the final form of
the string is saved in the binary?  For instance, for a statement like...

strrchr(__FILE__, '/') + 1

…and where clang is called on this code’s source file with “clang /some/long/path/file.c”, can I
end up with only “file.c” stored in the binary on disk?  I think you can see that I am attempting to avoid
full paths from my machine ending up in the program.  My IDE, Xcode, always passes full paths to clang when
building.  Thanks.
Quentin Colombet | 19 Sep 22:36 2014

Re: Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Hi Andrea,

I think most if not all the regressions are covered by the previous test cases I’ve provided.
Please double check if you want to avoid reducing them :).

On Sep 19, 2014, at 1:22 PM, Andrea Di Biagio <andrea.dibiagio <at>> wrote:

> Hi Chandler,
> I have tested the new shuffle lowering on a AMD Jaguar cpu (which is
> AVX but not AVX2).
> On this particular target, there is a delay when output data from an
> execution unit is used as input to another execution unit of a
> different cluster. For example, There are 6 executions units which are
> divided into 3 execution clusters of Float(FPM,FPA), Vector Integer
> (MMXA,MMXB,IMM), and Store (STC). Moving data between clusters costs
> an addition 1 cycle latency penalty.
> Your new shuffle lowering algorithm is very good at keeping the
> computation inside clusters. This is an improvement with respect to
> the "old" shuffle lowering algorithm.
> I haven't observed any significant regression in our internal codebase.
> In one particular case I observed a slowdown (around 1%); here is what
> I found when investigating on this slowdown.
> 1.  With the new shuffle lowering, there is one case where we end up
> producing the following sequence:
>   vmovss .LCPxx(%rip), %xmm1
>   vxorps %xmm0, %xmm0, %xmm0
>   vblendps $1, %xmm1, %xmm0, %xmm0
> Before, we used to generate a simpler:
>   vmovss .LCPxx(%rip), %xmm1
> In this particular case, the 'vblendps' is redundant since the vmovss
> would zero the upper bits in %xmm1. I am not sure why we get this
> poor-codegen with your new shuffle lowering. I will investigate more
> on this bug (maybe we no longer trigger some ISel patterns?) and I
> will try to give you a small reproducible for this paticular case.

I think it should already be covered by one of the test case I provided: none_useless_shuflle.ll

> 2.  There are cases where we no longer fold a vector load in one of
> the operands of a shuffle.
> This is an example:
>     vmovaps  320(%rsp), %xmm0
>     vshufps $-27, %xmm0, %xmm0, %xmm0    # %xmm0 = %xmm0[1,1,2,3]
> Before, we used to emit the following sequence:
>     # 16-byte Folded reload.
>     vpshufd $1, 320(%rsp), %xmm0      # %xmm0 = mem[1,0,0,0]
> Note: the reason why the shuffle masks are different but still valid
> is because the upper bits in %xmm0 are unused. Later on, the code uses
> register %xmm0 in a 'vcvtss2sd' instruction; only the lower 32-bits of
> %xmm0 have a meaning in this context).
> As for 1. I'll try to create a small reproducible.

Same here, I think this is already covered by: missing_folding.ll

> 3.  When zero extending 2 packed 32-bit integers, we should try to
> emit a vpmovzxdq
> Example:
>  vmovq  20(%rbx), %xmm0
>  vpshufd $80, %xmm0, %xmm0 # %xmm0 = %xmm0[0,0,1,1]
> Before:
>   vpmovzxdq  20(%rbx), %xmm0

Probably same logic as: sse4.1_pmovzxwd.ll
But you can double check it. 

> 4.  We no longer emit a simpler 'vmovq' in the following case:
>   vxorpd %xmm4, %xmm4, %xmm4
>   vblendpd $2, %xmm4, %xmm2, %xmm4 # %xmm4 = %xmm2[0],%xmm4[1]
> Before, we used to generate:
>   vmovq %xmm2, %xmm4
> Before, the vmovq implicitly zero-extended to 128 bits the quadword in
> %xmm2. Now we always do this with a vxorpd+vblendps.

Probably same as: none_useless_shuflle.ll

> As I said, I will try to create smaller reproducible for each of the
> problems I found.
> I hope this helps. I will keep testing.
> Thanks,
> Andrea
Quentin Colombet | 19 Sep 20:53 2014

Re: Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Thanks Chandler!

Here are two more test cases :).

I kept the format of my previous email.

1. avx_unpck.ll avx
Instead of issueing a single unpck with use a sequence of extrats, inserts, shuffles, and blends.

2. none_useless_shuflle none
Instead of using a single move to materialize a zero extended constant into a vector register, we explicitly zeroed a vector register and use a shuffle.


Attachment (none_useless_shuffle.ll): application/octet-stream, 210 bytes

Attachment (avx_unpck.ll): application/octet-stream, 415 bytes

On Sep 18, 2014, at 2:13 AM, Chandler Carruth <chandlerc <at>> wrote:

As of r218038 we should get palign for all integer shuffles. That fixes the test case you reduced for me. If you have any other regressions that point to palignr, I'd be especially interested to have an actual test case. As I noted in my commit log, there seems to be a clear place where using this could be faster but it introduces domain crossing. I don't really have a good model for the cost there and so am hesitant to go down that route without good evidence of the need.

On Wed, Sep 17, 2014 at 1:30 PM, Chandler Carruth <chandlerc <at>> wrote:

On Wed, Sep 17, 2014 at 12:51 PM, Quentin Colombet <qcolombet <at>> wrote:
We use two shuffles instead of 1 palign.

Doh! I just forgot to teach it about palign... This one should at least be easy.

LLVM Developers mailing list
LLVMdev <at>
Samuel F Antao | 19 Sep 17:42 2014

noalias and alias.scope metadata producers

Hi all,

In LLVM language reference I read that one can use noalias and alias.scope metadata to provide more detailed information about pointer aliasing. However, I was unable to obtain any LLVM IR annotations using this metadata from any LLVM optimization pass or Clang frontend (am I missing something?).

If I understand it correctly, this information would complement the type-based alias information and whatever mechanisms the alias analysis passes in LLVM compute from the input program. I was wondering if the coverage provided by these two components is already acceptable or if there is work that can be done in LLVM IR clients like clang to provide more information with proper noalias and alias.scope annotations.

Any comments?


LLVM Developers mailing list
LLVMdev <at>
Sergey Dmitrouk | 19 Sep 17:12 2014

More careful treatment of floating point exceptions


I'd like to make code emitted by LLVM that includes floating point
operations which raise FP exceptions behave closer to what is defined by
IEEE754 standard.  I'm not going to "fix everything", just incorrect
behaviour I faced so far.

Most of troubles regarding FP exceptions are caused by optimizations, so
there should be a flag to disable/block them if one wants to get better
code in a sense of its IEEE754 conformance.  I couldn't find an existing
flag for this and would like to introduce one.  I guess it should be
added to llvm::TargetOptions (e.g. "IEEE754FPE"), then -std=c99 or
separate option could enable it from front-end (Clang in this case).

I'm doing this for ARM platform and the flag should be reachable from
all these places in LLVM:

 - lib/Analysis/ValueTracking.cpp
 - lib/CodeGen/SelectionDAG/SelectionDAG.cpp
 - lib/IR/ConstantFold.cpp
 - lib/Target/ARM/ARMFastISel.cpp
 - lib/Target/ARM/ARMISelLowering.cpp
 - lib/Target/ARM/ (through predicates)
 - lib/Target/ARM/ (through predicates)

and in Clang:

 - lib/AST/ExprConstant.cpp

Did I get it right and there is no such flag so far?  Does what I'm
suggesting sounds reasonable?

Melanie Kambadur | 19 Sep 15:57 2014

Publications: Harmony and ParaShares

Hi everyone,

I recently presented a paper at Euro-Par 2014 that features a tool called Harmony, which my co-authors and I built on top of LLVM and clang. Could someone please add it to along with our original Harmony paper from 2012 which never made the publication list?

For those interested, Harmony is an open source tool (built as an LLVM pass) that creates a new kind of application profile called Parallel Block Vectors, or PBVs. PBVs track dynamic program parallelism at basic block granularity to expose opportunities for improving hardware design and software performance. Please visit to learn more and download the tool. 

Thank you!

Melanie Kambadur  
LLVM Developers mailing list
LLVMdev <at>
Steven Wu | 19 Sep 04:01 2014

[RFC] Exhaustive bitcode compatibility tests for IR features

From the discussion of bitcode backward compatibility on the list, it seems we lack systematic way to test
every existing IR features. It is useful to keep a test that exercises all the IR features for the current
trunk and we can freeze that test in the form of bitcode for backward compatibility test in the future. I am
proposing to implement such a test, which should try to accomplish following features:
1. Try to keep it in one file so it is easy to freeze and move to the next version.
2. Try to exercise and verify as much features as possible, which should includes all the globals,
instructions, metadata and intrinsics (and more). 
3. The test should be easily maintainable. It should be easy to fix when broken or get updated when assembly
gets updated. 
I am going to implement such test with a lengthy LLVM assembly, in the form of the attachment (which I only
tests for global variable). It is going to be long, but someone must do it first. Future updates should be
much simper. In the test, I started with a default global variable and enumerate all the possible
attributes by changing them one by one. I try to keep the variable declaration as simple as possible so that
it won’t be affected by some simple assembly level changes (like changing the parsing order of some
attributes, since this is supposed to be a bitcode compatibility test, not assembly test). I try to make
the tests as thorough as possible but avoid large duplications. For example, I will tests Linkage
attribute in both GlobalVariable as well as Function, but probably not enumerate all the types I want to
test. I will keep the tests for Types in a different section since it is going to be huge and it is orthogonal
to the tests of globals.
When making a release or some big changes in IR, we can freeze the test by generating bitcode, change the RUN
line so it runs llvm-dis directly, and modified the CHECKs that corresponding to the change. Then we can
move on with a new version of bitcode tests. This will add some more works for people who would like to make
changes to IR (which might be one more reason to discourage them from breaking the compatibility). I will
make sure to update the docs for changing IRs after I add this test.

Currently, there are individual bitcode tests in the llvm which are created when IR or intrinsics get
changed. This exhaustive test shouldn’t overlap with the existing ones since this tests is focusing on
keeping a working up-to-date version of IR tests. Both approaches of bitcode tests can co-exists. For
example, for small updates, we can add specific test cases like the ones currently to test auto-upgrade,
while updating the exhaustive bitcode test to incorporate the new changes. When making huge upgrades and
major releases, we can freeze the exhaustive test for future checks.

For the actual test cases, I think it should be trivial for globals, instructions, types (Correct me  if I am
wrong), but intrinsics can be very tricky. I am not sure how much compatibility is guaranteed for
intrinsics, but they can’t not be checked through llvm-as then llvm-dis. Intrinsics, as far as I know,
are coded like normal functions, globals or metadata. My current plan is to write a separate tool to check
the intrinsics actually supported in the IR or backend. Intrinsic function might be the easiest since the
supported ones should all be declared in Intrinsics*.td and can be check by calling getIntrinsicID()
after reading the bitcode. Intrinsics coded as globals (llvm.used) or metadata (llvm.loop) can be more
tricky. Maybe another .td file with hardcoded intrinsics for these cases should be added just for the
testing purpose (we can add a new API to it later so that we don’t need to do string compares to figure out
these intrinsics). After we have another tool to test intrinsics (which can be merged with llvm-dis to
save a RUN command and execution time), the attached test will just need to be updated like following
(checking llvm.global_ctors for example):
; RUN: verify-intrinsics %s.bc | FileCheck -check-prefix=CHECK-INT %s

%0 = type { i32, void ()*, i8* }
 <at> llvm.global_ctors = appending global [1 x %0] [%0 { i32 65535, void ()*  <at> ctor, i8*  <at> data }]
; CHECK:  <at> llvm.global_ctors = appending global [1 x %0] [%0 { i32 65535, void ()*  <at> ctor, i8*  <at> data }]
; CHECK-INT:  <at> llvm.global_ctors int_global_ctors

Let me know if there is better proposal.


Attachment (llvm-3.6.ll): application/octet-stream, 3796 bytes
LLVM Developers mailing list
LLVMdev <at>