Duncan P. N. Exon Smith | 25 Oct 01:16 2014
Picon

First-class debug info IR: MDLocation

I've attached a preliminary patch for `MDLocation` as a follow-up to the
RFC [1] last week.  It's not commit-ready -- in particular, it squashes
a bunch of commits together and doesn't pass `make check` -- but I think
it's close enough to indicate the direction and work toward consensus.

[1]: http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-October/077715.html

IMO, the files to focus on are:

    include/llvm/IR/DebugInfo.h
    include/llvm/IR/DebugLoc.h
    include/llvm/IR/Metadata.h
    include/llvm/IR/Value.h
    lib/AsmParser/LLLexer.cpp
    lib/AsmParser/LLParser.cpp
    lib/AsmParser/LLParser.h
    lib/AsmParser/LLToken.h
    lib/Bitcode/Reader/BitcodeReader.cpp
    lib/Bitcode/Writer/BitcodeWriter.cpp
    lib/Bitcode/Writer/ValueEnumerator.cpp
    lib/Bitcode/Writer/ValueEnumerator.h
    lib/IR/AsmWriter.cpp
    lib/IR/AsmWriter.h
    lib/IR/DebugInfo.cpp
    lib/IR/DebugLoc.cpp
    lib/IR/LLVMContextImpl.cpp
    lib/IR/LLVMContextImpl.h
    lib/IR/Metadata.cpp

Using `Value` instead of `MDNode`
(Continue reading)

Nitin Mukesh Tiwari | 25 Oct 03:55 2014

Query regarding LLVM library for compilation

Dear LLVM Team

I have installed LLVM and have also tried of using examples for learning LLVM.
I am really sorry if this is a silly question but I tried writing one file for clang++ which uses some header file from LLVm nad clang like verifier.h.
I could not compile as the error was thrown that verifier.h not found even when I have the exact path. But when I put my file in example folder and made changes to makefile and build the examples by running make file everything just worked fine. My query is is ther any way that my system automatically finds the header for LLVm and clang whenever I use them like it does for stdio.h and iostream.h or I need to build the example again and again ?

Thanks and Regards
_______________________________________________
LLVM Developers mailing list
LLVMdev <at> cs.uiuc.edu         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
betulb | 25 Oct 02:26 2014

Indirect call site profiling


Hi All,

We've been working on enhancing LLVM's instrumentation based profiling by
adding indirect call target profiling support. Our goal is to add
instrumentation around indirect call sites, so that we may track the
frequently taken target addresses and their call frequencies.

The acquired data has uses in optimization of indirect function call 
heavy applications. Our initial findings show that using the profile data
in optimizations would help improve the performance of some of the spec
benchmarks notably. We have a proof of concept implementation,  which we
plan to put it up for review. However, I’d like to inquire prior if there
are any plans or ongoing work done in the community to enable indirect
call target profiling support or not. Please inform if cfe-dev is a better
candidate for posting PGO related emails.

Thanks,
-Betul Buyukkurt

Employee of the Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
Foundation Collaborative Project
Diego Novillo | 24 Oct 23:48 2014
Picon

Recent changes in -gmlt break sample profiling

I'm not sure if this was intended, but it's going to be a problem for sample profiles.


When we compile with -gmlt, the profiler expects to find the line number for all the function headers so that it can compute relative line locations for the profile.

The tool that reads the ELF binary is not finding them, so it writes out absolute line numbers, which are impossible to match during the profile-use phase.

The problem seems to be that we are missing DW_TAG_subprogram for all the functions in the file.

Attached are the dwarf dumps of the same program. One compiled with my system's clang 3.4 and the other with today's trunk. In both compiles, I used -gline-tables-only.

The trunk version is missing all the subprogram tags for the functions in the file. This breaks the sample profiler.

Should I file a bug, or is -gmlt going to be like this from now on? The latter would be a problem for us.


Thanks. Diego.

Attachment (fnptr-clang36.bad.dwarfdump): application/octet-stream, 5689 bytes
Attachment (fnptr-clang34.good.dwarfdump): application/octet-stream, 8 KiB
_______________________________________________
LLVM Developers mailing list
LLVMdev <at> cs.uiuc.edu         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Artem Dinaburg | 24 Oct 22:17 2014

Cross-Block Dead Store Elimination

Hi,

It looks like the DeadStoreElimination optimization doesn't work across BasicBlock boundaries. The
project I'm working on (https://github.com/trailofbits/mcsema), would tremendously benefit from
even simple cross-block DSE. 

There was a patch to do non-local DSE few years ago
(http://lists.cs.uiuc.edu/pipermail/llvmdev/2010-January/028751.html), but seems that the
patch was never merged.

Is there an existing way to do cross-block DSE?

Was there something wrong with the original non-local DSE patch that it wasn't merged?

Thanks,
Artem
_______________________________________________
LLVM Developers mailing list
LLVMdev <at> cs.uiuc.edu         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Diego Novillo | 24 Oct 22:16 2014
Picon

Adding sample profile support to llvm-profdata?

Duncan, Justin,


I'm about to submit a series of patches that add writing capabilities for sample profiles in both text and binary formats. Soon, I'll add a third format (to make it interoperable with GCC).

I would like to add some profile maintenance utilities as well: merging, dumping and converting.

It seems like the best place would be tools/llvm-profdata. But that means that I need to have a way of distinguishing sample from instrumented profiles.

For the binary formats, it's easy to have the tool check the magic bits at the start, but for the text format it is not easy to tell whether we're dealing with a sample profile vs an instrumented profile.

The options I see are:

1- Add a --profile-type={sample|instr} to llvm-profdata to specify whether we are dealing with a sample or an instrumented profile. This would help prevent mixing and matching the two types of profiles (they are not convertible one to the other, not easily anyway).

2- Write a totally separate tool to deal with sample profiles.

I am slightly in favour of option #1. I could even make --profile-type=instr to avoid having a flag day for tools you may have deployed.


Thanks. Diego.
_______________________________________________
LLVM Developers mailing list
LLVMdev <at> cs.uiuc.edu         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Hui Zhang | 24 Oct 21:49 2014
Picon

which is the right version

Hello,

I met a bug that not iterating all the compile units in DebugInfoFinder class, and I found this thread that fixed this issue:  http://llvm.org/bugs/show_bug.cgi?id=17507

so the code was modified and updated as "revision 192879"
​but I don't know which is the right version of DebugInfo.h corresponding to this modified .cpp file ? Does anyone know about this ?

thanks !​

--
Best regards


Hui Zhang
_______________________________________________
LLVM Developers mailing list
LLVMdev <at> cs.uiuc.edu         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Jingyue Wu | 24 Oct 20:18 2014
Picon

IndVar widening in IndVarSimplify causing performance regression on GPU programs

Hi, 


I noticed a significant performance regression (up to 40%) on some internal CUDA benchmarks (a reduced example presented below). The root cause of this regression seems that IndVarSimpilfy widens induction variables assuming arithmetics on wider integer types are as cheap as those on narrower ones. However, this assumption is wrong at least for the NVPTX64 target. 

Although the NVPTX64 target supports 64-bit arithmetics, since the actual NVIDIA GPU typically has only 32-bit integer registers, one 64-bit arithmetic typically ends up with two machine instructions taking care of the low 32 bits and the high 32 bits respectively. I haven't looked at other GPU targets such as R600, but I suspect this problem is not restricted to the NVPTX64 target. 

Below is a reduced example:
__attribute__((global)) void foo(int n, int *output) {
  for (int i = 0; i < n; i += 3) {
    output[i] = i * i;
  }
}

Without widening, the loop body in the PTX (a low-level assembly-like language generated by NVPTX64) is:
BB0_2:                                  // =>This Inner Loop Header: Depth=1        
        mul.lo.s32      %r5, %r6, %r6;                                              
        st.u32  [%rd4], %r5;                                                        
        add.s32         %r6, %r6, 3;                                                
        add.s64         %rd4, %rd4, 12;                                              
        setp.lt.s32     %p2, %r6, %r3;
        <at> %p2 bra        BB0_2;
in which %r6 is the induction variable i. 

With widening, the loop body becomes:
BB0_2:                                  // =>This Inner Loop Header: Depth=1        
        mul.lo.s64      %rd8, %rd10, %rd10;                                         
        st.u32  [%rd9], %rd8;                                                         
        add.s64         %rd10, %rd10, 3;                                            
        add.s64         %rd9, %rd9, 12;                                             
        setp.lt.s64     %p2, %rd10, %rd1;                                           
        <at> %p2 bra        BB0_2;

Although the number of PTX instructions in both versions are the same, the version with widening uses more mul.lo.s64, add.s64, and setp.lt.s64 instructions which are more expensive than their 32-bit counterparts. Indeed, the SASS code (disassembly of the actual machine code running on GPUs) of the version with widening looks significantly longer. 

Without widening (7 instructions): 
.L_1:                                                                               
        /*0048*/                IMUL R2, R0, R0;                                      
        /*0050*/                IADD R0, R0, 0x1;                                   
        /*0058*/                ST.E [R4], R2;                                      
        /*0060*/                ISETP.NE.AND P0, PT, R0, c[0x0][0x140], PT;             /*0068*/                IADD R4.CC, R4, 0x4;                                
        /*0070*/                IADD.X R5, R5, RZ;                                  
        /*0078*/             <at> P0 BRA `(.L_1);

With widening (12 instructions):
.L_1:                                                                            
        /*0050*/                IMUL.U32.U32 R6.CC, R4, R4;                      
        /*0058*/                IADD R0, R0, -0x1;                                    
        /*0060*/                IMAD.U32.U32.HI.X R8.CC, R4, R4, RZ;             
        /*0068*/                IMAD.U32.U32.X R8, R5, R4, R8;                   
        /*0070*/                IMAD.U32.U32 R7, R4, R5, R8;                     
        /*0078*/                IADD R4.CC, R4, 0x1;                             
        /*0088*/                ST.E [R2], R6;                                   
        /*0090*/                IADD.X R5, R5, RZ;                               
        /*0098*/                ISETP.NE.AND P0, PT, R0, RZ, PT;                 
        /*00a0*/                IADD R2.CC, R2, 0x4;                             
        /*00a8*/                IADD.X R3, R3, RZ;                                  
        /*00b0*/             <at> P0 BRA `(.L_1);

I hope the issue is clear up to this point. So what's a good solution to fix this issue? I am thinking of having IndVarSimplify consult TargetTransformInfo about the cost of integer arithmetics of different types. If operations on wider integer types are more expensive, IndVarSimplify should disable the widening. 

Another thing I am concerned about: are there other optimizations that make similar assumptions about integer widening? Those might cause performance regression too just as IndVarSimplify does. 

Jingyue
Attachment (indvar.cu): application/octet-stream, 162 bytes
Attachment (indvar.32.ptx): application/octet-stream, 1106 bytes
Attachment (indvar.32.sass): application/octet-stream, 4340 bytes
Attachment (indvar.64.ptx): application/octet-stream, 1150 bytes
Attachment (indvar.64.sass): application/octet-stream, 4802 bytes
_______________________________________________
LLVM Developers mailing list
LLVMdev <at> cs.uiuc.edu         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Boris Boesler | 24 Oct 16:53 2014
Picon
Picon

Virtual register def doesn't dominate all uses

Hi!

During my backend development I get the error message for some tests:
*** Bad machine code: Virtual register def doesn't dominate all uses. ***

(C source-code, byte-code disassembly and printed machine code at the end of the email)

The first USE of vreg4 in BB#1 has no previous DEF in BB#0 or #1. But why? I can't see how the LLVM byte-code is
transformed to the lower machine code.

One possible reason could be that I haven't implemented all operations, eg I didn't implement MUL at this
stage. Their "state" is LEGAL and not CUSTOM or EXPAND. But it fails with implemented operations as well.

What did I do wrong? Missing implementation for some operations? What did I miss to implement?

Thanks in advance,
Boris

----8<----

C source-code:
int simple_loop(int end_loop_index)
{
  int sum = 0;
  for(int i = 0; i < end_loop_index; i++) {
    sum += i;
  }
  return(sum);
}

LLVm byte-code disassembly:
; Function Attrs: nounwind readnone
define i32  <at> simple_loop(i32 %end_loop_index) #1 {
entry:
  %cmp4 = icmp sgt i32 %end_loop_index, 0
  br i1 %cmp4, label %for.cond.for.end_crit_edge, label %for.end

for.cond.for.end_crit_edge:                       ; preds = %entry
  %0 = add i32 %end_loop_index, -2
  %1 = add i32 %end_loop_index, -1
  %2 = zext i32 %0 to i33
  %3 = zext i32 %1 to i33
  %4 = mul i33 %3, %2
  %5 = lshr i33 %4, 1
  %6 = trunc i33 %5 to i32
  %7 = add i32 %6, %end_loop_index
  %8 = add i32 %7, -1
  br label %for.end

for.end:                                          ; preds = %for.cond.for.end_crit_edge, %entry
  %sum.0.lcssa = phi i32 [ %8, %for.cond.for.end_crit_edge ], [ 0, %entry ]
  ret i32 %sum.0.lcssa
}

The emitted blocks are:
Function Live Ins: %R0 in %vreg2

BB#0: derived from LLVM BB %entry
    Live Ins: %R0
	%vreg2<def> = COPY %R0; IntRegs:%vreg2
	%vreg3<def> = MV 0; SRegs:%vreg3
	CMP %vreg2, 1, %FLAG<imp-def>; IntRegs:%vreg2
	%vreg6<def> = COPY %vreg3; SRegs:%vreg6,%vreg3
	BR_cc <BB#2>, 20, %FLAG<imp-use,kill>
	BR <BB#1>
    Successors according to CFG: BB#1(20) BB#2(12)

BB#1: derived from LLVM BB %for.cond.for.end_crit_edge
    Predecessors according to CFG: BB#0
	%vreg4<def> = MV %vreg4; IntRegs:%vreg4
	%vreg5<def> = ADD %vreg4<kill>, -1; IntRegs:%vreg5,%vreg4
	%vreg0<def> = COPY %vreg5<kill>; SRegs:%vreg0 IntRegs:%vreg5
	%vreg6<def> = COPY %vreg0; SRegs:%vreg6,%vreg0
    Successors according to CFG: BB#2

BB#2: derived from LLVM BB %for.end
    Predecessors according to CFG: BB#0 BB#1
	%vreg1<def> = COPY %vreg6<kill>; SRegs:%vreg1,%vreg6
	%R0<def> = COPY %vreg1; SRegs:%vreg1
	RETURN %R0<imp-use>

# End machine code for function simple_loop.

*** Bad machine code: Virtual register def doesn't dominate all uses. ***
- function:    simple_loop
- basic block: BB#1 for.cond.for.end_crit_edge (0x7fd7cb025250)
- instruction: %vreg4<def> = MV %vreg4; IntRegs:%vreg4
LLVM ERROR: Found 1 machine code errors.
Caio Souza Oliveira | 24 Oct 14:48 2014
Picon

LLVM JIT Execution Engine port for PPC64le

Hi,

 

                I wonder if the community would accept patches on 3.5 for the old JIT interface. I’m focused on provide PPC64le environment support for the old JIT, so that certain applications could run properly on Power environment. What are your  thoughts on this?

Best regards

-- Caio

_______________________________________________
LLVM Developers mailing list
LLVMdev <at> cs.uiuc.edu         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Demikhovsky, Elena | 24 Oct 13:24 2014
Picon

Adding masked vector load and store intrinsics

Hi,
 
We would like to add support for masked vector loads and stores by introducing new target-independent intrinsics. The loop vectorizer will then be enhanced to optimize loops containing conditional memory accesses by generating these intrinsics for existing targets such as AVX2 and AVX-512. The vectorizer will first ask the target about availability of masked vector loads and stores. The SLP vectorizer can potentially be enhanced to use these intrinsics as well.
 
The intrinsics would be legal for all targets; targets that do not support masked vector loads or stores will scalarize them.
The addressed memory will not be touched for masked-off lanes. In particular, if all lanes are masked off no address will be accessed.
 
  call void <at> llvm.masked.store (i32* %addr, <16 x i32> %data, i32 4, <16 x i1> %mask)
 
  %data = call <8 x i32> <at> llvm.masked.load (i32* %addr, <8 x i32> %passthru, i32 4, <8 x i1> %mask)
 
where %passthru is used to fill the elements of %data that are masked-off (if any; can be zeroinitializer or undef).
 
Comments so far, before we dive into more details?
 
Thank you.
 
- Elena and Ayal
 
 

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

_______________________________________________
LLVM Developers mailing list
LLVMdev <at> cs.uiuc.edu         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Gmane