David Kang | 31 Oct 23:01 2014

how to keep a hard register across multiple instrutions?


 I'm newbie in gcc porting.

 The architecture that I'm porting gcc has hardware FPU.
But the compiler has to generate code which builds a FPU instruction in a integer register
at run-time and writes the value to the FPU command register.

 To make a single FPU instruction, three instructions are needed.
Two instructions make the FPU instruction in 32 bit (cmd, operands[2], operands[1], operands[0]) format.
Here operands are the FPU register numbers, which can be 0 ~ 32.
As an example, f3 = f1 + 2 can be encoded as (code of 'add', 2, 1, 3).

 And the third instruction write it to a FPU command register.
The architecture can issue up to 3 instructions at a time.

 The difficulty lies in that we need to know the FPU register number
for those operands to generate the FPU instruction.

 The easiest but lowest performance implementation is to generate those three instruction
from a single "define_insn" as three consecutive instructions.
However, we lose all possible bundling of those 3 instructions with other instructions for optimization.

 So, I'm trying to find a better way.
I used "define_insn_and_split" and split a single FPU instruction into 3 instructions like this:
(Here I assume to use register r10, but it can be any integer register.)

 operands[0] = plus (operands[1], operands[2])

(Continue reading)

David Kang | 31 Oct 20:03 2014

Support for architectures without hardware interlocks


 I'm working on porting gcc to an architecture without hardware interlock support for floating point unit.
I read that instruction latency time can be expressed in machine description file of gcc. I set the latency
time of the instructions and built gcc.
I expected that gcc would put the two dependent instructions apart automatically
at least as many as the latency time of the first instruction.
However, my gcc doesn't do that.
I'm using a little old 4.7.3.
I also expected that gcc may fill the gap with no-op when it cannot find
other useful instructions to fill the gap.
But, I don't see that, either.

 Does gcc support an architecture without hardware interlock automatically?
Could anyone help me to understand how I can enforce the latency requirements
of two dependent instructions in gcc?

 I saw that GCC didn't support architectures without hardware interlocks in the gcc mailing list 
which is dated in 2007. (https://gcc.gnu.org/ml/gcc/2007-07/msg00915.html)
Is it still true?




Dr. Dong-In "David" Kang
Computer Scientist
(Continue reading)

Ilya Palachev | 31 Oct 14:12 2014

Performance for AArch64 in ILP32 ABI


According to this mail thread 
https://gcc.gnu.org/ml/gcc-patches/2013-12/msg00282.html GCC has ILP32 
GNU/Linux support.

1. The question is: how reasonable it can be to use ILP32 mode for 
building of the *whole* Linux distribution from the side of view of 

IIRC gcc built for i686 can work faster than gcc built for x86_64 
architecture on the same hardware, because there are a lot of data 
structures with fields of pointer type, and if 32 pointers are used, 
less memory is allocated for these structures. As a result, smaller 
structures are loaded from memory faster and less cache misses happen. 
Is this the same case for AArch64 ILP32 ABI?

2nd idea is that if integers are of 32 bit size, than 2 times more 
integers can be saved in CPU registers than if they were of 64 bit size, 
and thus less loads/stores to the memory are needed.

2. What's the current status of ILP32 support implementation in GCC?

3. Did anybody try to benchmark AArch64 binaries ILP32 vs. LP64 builds? 
Is it possible to compare the performance of these modes?

Best regards,
Ilya Palachev

(Continue reading)

Wei Mi | 31 Oct 04:49 2014

cgraph node profile update in cgraph_rebuild_references causes a performance issue

We have a program like this:

A() {    // hot func

B() {
  A();    // very hot
  if (i) {
    A();  // very cold

Both callsites of A will be inlined into B. In gcc func
save_inline_function_body in inline_transform stage, A's first clone
will be choosen and turned into a real func. For our case, the clone
node choosen corresponds to the cold callsite of A.
cgraph_rebuild_references in tree_function_versioning will reset the
cgraph node count of the choosen clone to the entry bb count of func A
(A is hot). So the cgraph node count of the choosen clone becomes hot
while its inline edge count is still cold. It breaks the assumption
described here:
for inline node, bb->count == edge->count == edge->callee->count

For the patch committed in the thread above (it is listed below),
cg_edge->callee->count is used for profile update to its inline
instance, which leads to a hot BB in func B which is actually very
cold. The wrong profile information causes performance regression in
one of our internal benchmarks.
(Continue reading)

gccadmin | 30 Oct 23:37 2014

gcc-4.8-20141030 is now available

Snapshot gcc-4.8-20141030 is now available on
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 4.8 SVN branch
with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-4_8-branch revision 216945

You'll find:

 gcc-4.8-20141030.tar.bz2             Complete GCC


Diffs from 4.8-20141023 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-4.8
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.

Gerald Pfeifer | 30 Oct 22:47 2014

libcc1 still breaks bootstrap (with clang as system compiler)

Now the <gmp.h> error is gone on my nightly FreeBSD test systems,
I am getting the following:

In file included from /scratch2/tmp/gerald/gcc-HEAD/libcc1/plugin.cc:58:
In file included from /usr/include/c++/v1/string:438:
In file included from /usr/include/c++/v1/cwchar:107:
In file included from /usr/include/c++/v1/cwctype:54:
/usr/include/c++/v1/cctype:51:72: error: use of undeclared identifier 
inline _LIBCPP_INLINE_VISIBILITY int __libcpp_isalnum(int __c) {return isalnum(__c);}
note: expanded from macro 'isalnum'
#define isalnum(c) do_not_use_isalnum_with_safe_ctype
gmake[4]: *** [plugin.lo] Error 1
gmake[4]: Leaving directory `/scratch2/tmp/gerald/OBJ-1030-2127/libcc1'
gmake[3]: *** [all] Error 2
gmake[3]: Leaving directory `/scratch2/tmp/gerald/OBJ-1030-2127/libcc1'
gmake[2]: *** [all-stage1-libcc1] Error 2
gmake[2]: Leaving directory `/scratch2/tmp/gerald/OBJ-1030-2127'
gmake[1]: *** [stage1-bubble] Error 2
gmake[1]: Leaving directory `/scratch2/tmp/gerald/OBJ-1030-2127'
gmake: *** [bootstrap-lean] Error 2

That's on FreeBSD 10.x with libc++ and clang as the system compiler.

Does this trigger something, or should I peek closer?  (I am
(Continue reading)

Jakub Jelinek | 30 Oct 11:51 2014

GCC 4.9.2 Status Report (2014-10-30)


GCC 4.9.2 has been released, the branch is now open again
for regression and documentation fixes.
Unless some blocker bug is found, GCC 4.9.3 should be released
in March or April.

Quality Data

Priority          #   Change from Last Report
--------        ---   -----------------------
P1                0      0
P2               82      0
P3               46   +  6
--------        ---   -----------------------
Total           128   +  6

Previous Report


The next report will be sent by Richard.

Jakub Jelinek | 30 Oct 11:48 2014

GCC 4.9.2 Released

The GNU Compiler Collection version 4.9.2 has been released.

GCC 4.9.2 is a bug-fix release from the GCC 4.9 branch
containing important fixes for regressions and serious bugs in
GCC 4.9.1 with more than 65 bugs fixed since the previous release.
This release is available from the FTP servers listed at:


Please do not contact me directly regarding questions or comments
about this release.  Instead, use the resources available from

As always, a vast number of people contributed to this GCC release
-- far too many to thank them individually!

Uros Bizjak | 30 Oct 08:40 2014

Recent go changes broke alpha bootstrap


Recent go changes broke alpha bootstrap:

/bin/mkdir -p .; files=`echo
errors.gox io.gox runtime.gox sync/atomic.gox sync.gox syscall.gox
time.gox | sed -e 's/[^ ]*\.gox//g'`; /bin/sh ./libtool --tag GO
(Continue reading)

gccadmin | 29 Oct 23:36 2014

gcc-4.9-20141029 is now available

Snapshot gcc-4.9-20141029 is now available on
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 4.9 SVN branch
with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-4_9-branch revision 216854

You'll find:

 gcc-4.9-20141029.tar.bz2             Complete GCC


Diffs from 4.9-20141022 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-4.9
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.

Marat Zakirov | 29 Oct 16:35 2014

[RFC] Adjusted VRP

Hi folks!

During asan performance tunning we tried to use VRP and found that it is 
path-insensitive and thus suboptimal. In example bellow "i" has VRP 
range 0..1000 across whole program but in loop body it is just 0..999.

int a[1000];
void foo ()
    for (int i=0;i<1000;i++)
        a[i] = 0;

I think that path-sensitive approach could significantly increase VRP 
precision. I suggest to extend existing get_range_info (tree t) user 
interface by get_range_info (tree t, basic_block = NULL). So if user do 
not want specify basic_block he will get usual range info. In case when 
bb is specified if hash<pair<tree, basic_block>> has entry - 
bblock-accurate range info will be returned, usual otherwise. The goal 
of adjustment algorithm is to fill hash<pair<tree, basic_block>> where 
tree is a SSA which has bblock-accurate VRP. Memory usage of 
hash<pair<tree, basic_block>> could be reduced by memorizing just 
important values such as loop iterators, array indexes etc.

I propose two approaches to fill up hash<pair<tree, basic_block>>. First 
one is built on top of existing VRP pass. Special CFG RPO iterative 
traverse truncates and stores VRP in every basic block during CFG 
traversal. It merges VRP on join bblocks and truncates on conditional 
bblocks. A draft implementation of this approach is in the attached 
patch. The patch was developed for asan but VRP enhancements are generic.
(Continue reading)