I am implementing unicode normalization in Haskell. I challenged myself to match the performance with the best C/C++ implementation, the best being the ICU library. I am almost there, beating it in one of the benchmarks and within 30% for others. I am out of all application level tricks that I could think of and now need help from the compiler.
I started with a bare minimum loop and adding functionality incrementally watching where the performance trips. At one point I saw that adding just one 'if' condition reduced the performance by half. I looked at what's going on at the assembly level. Here is a github gist of the assembly instructions executed in the fast path of the loop, corresponding cmm snippets and also the full cmm corresponding to the loop:
I have annotated the assembly code with labels matching the corresponding CMM.
With the addition of another "if" condition the loop which was pretty simple till now suddenly got bloated with a lot of register reassignments. Here is a snippet of the register movements added:
# swap r14 <-> r11
=> 0x408d6b: mov %r11,0x98(%rsp)
=> 0x408d73: mov %r14,%r11
=> 0x408d76: mov 0x98(%rsp),%r14
# rbx -> r10 -> r9 -> r8 -> rdi -> rsi -> rdx -> rcx -> rbx
=> 0x408d7e: mov %rbx,0x90(%rsp)
=> 0x408d86: mov %rcx,%rbx
=> 0x408d89: mov %rdx,%rcx
=> 0x408d8c: mov %rsi,%rdx
=> 0x408d8f: mov %rdi,%rsi
=> 0x408d92: mov %r8,%rdi
=> 0x408d95: mov %r9,%r8
=> 0x408d98: mov %r10,%r9
=> 0x408d9b: mov 0x90(%rsp),%r10
loop logic here which uses only %rax, %r10 and %r9 .
# shuffle back to original assignments
=> 0x4090dc: mov %r14,%r11
=> 0x4090df: mov %r9,%r10
=> 0x4090e2: mov %r8,%r9
=> 0x4090e5: mov %rdi,%r8
=> 0x4090e8: mov %rsi,%rdi
=> 0x4090eb: mov %rdx,%rsi
=> 0x4090ee: mov %rcx,%rdx
=> 0x4090f1: mov %rbx,%rcx
=> 0x4090f4: mov %rax,%rbx
=> 0x4090f7: mov 0x88(%rsp),%rax
=> 0x4090ff: jmpq 0x408d2a
The registers seem to be getting reassigned here, data flowing from one to the next. In this particular path a lot of these register movements seem unnecessary and are only undone at the end without being used.
Maybe this is because these are reusable blocks and the movement is necessary when used in some other path?
Can this be avoided? Or at least avoided in a certain fast path somehow by hinting the compiler? Any pointers to the GHC code will be appreciated. I am not yet much familiar with the GHC code but can dig deeper pretty quickly. But before that I hope someone knowledgeable in this area can shed some light on this at a conceptual level or if at all it can be improved. I can provide more details and experiment more if needed.