This topic, or one very similar, appears to have been discussed before in the erlang-patches mailing list thread titled "erlang node crashes in erts_gc_after_bif_call" from October, 2012 (http://erlang.org/pipermail/erlang-patches/2012-October/003072.html
). No clear resolution was reached on this thread, and I am currently dealing with it in production systems, so I have decided to address this mailing list.
Please see the bottom of the email for system specification, as I believe this to be largely unrelated (except possibly for multithreading).
Please feel free to request any pertinent information I may have left out, or to make suggestions to improve future bug reports. I don't often submit bug reports, and am not at all familiar with Erlang/OTP's particular practices in this regard.
## Scenario and Error ##
The error is a segmentation fault arising out of the erts_garbage_collect and check_process_code functions.
The scenario is as follows:
1) You must be hot-loading a module (in my case, this module is dynamically generated) periodically.
2) You must have non-suspended processes active in the module you are hot-loading while it is being loaded (though not necessarily *in* the code of the module; may be using terms from the module or having function references ot the module).
3) Purging of the old version of the module must be happening at the same time as garbage collection. (in my case, the garbage collection is explicit because of the use of large binary terms with relatively few reductions; that does not appear to be the case in the situation laid out in the previously mentioned thread).
It appears, at least to my untrained eye, that garbage collection sweeps can occur at the same time as code purging, and that this seems to happen without multithreading protection. My reason for this suspicion is that in my production systems I began receiving one of two segmentation faults: one occuring in the function check_process_code (of erts/emulator/beam/beam_bif_load.c) and erts_garbage_collect (of erts/emulator/beam/erl_gc.c). Most of the time *in production*, the segmentation fault occured in the check_process_code function. Only sometimes did it appear to be coming from erts_garbage_collect.
## Reproducing the Error ##
It took a while, but I did ultimately manage to create an app which reliably produces this error (insofar as I can tell). Please see the app here: https://github.com/fauxsoup/erlang-sigsegv
There are some apparent differences from what I was observing in production, but this could possibly be related to differences between my production environment and my testing environment (which are non-trivial), and potentially differences between my minimal test case and the production service. Please see the bottom of this email for pertinent details about both environments.
For testing, and because my production deployment of Erlang does not include debug symbols, I recompiled Erlang/OTP 17.4 with the flags "-g -O2" to produce debug symbols and prevent aggressive optimizations which may distort the stacktrace.
The primary difference between the *results* of the error in production versus testing is that the segmentation fault in testing always comes from erts_garbage_collect. I have not at all been able to produce a test result in which the segmentation fault occured in check_process_code using the minimal test case code.
Another difference, which I believe to be caused by the inclusion of debug symbols, is that erts_garbage_collect appears earlier in the backtrace in testing, and that the actual segmentation fault appears to come from the function sweep_one_area (erl_gc.c again). My assumption is that the optimization and lack of debug symbols in the production system merely obfuscated the origin of the segmentation fault there.
## The Backtrace ##
Included here for your convenience (also available in test case README):
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff3b3e700 (LWP 26743)]
sweep_one_area (n_hp=0x7fffe8862028, n_htop=0x7fffe8862c48, src=src <at> entry=0x7fffe9ec2028 "", src_size=src_size <at> entry=600224) at beam/erl_gc.c:1816
1816 mb->base = binary_bytes(*origptr);
#0 sweep_one_area (n_hp=0x7fffe8862028, n_htop=0x7fffe8862c48, src=src <at> entry=0x7fffe9ec2028 "", src_size=src_size <at> entry=600224) at beam/erl_gc.c:1816
#1 0x0000000000527ea0 in do_minor (nobj=1, objv=0x7ffff3b3dd50, new_sz=121536, p=0x7ffff5c80800) at beam/erl_gc.c:1160
#2 minor_collection (recl=<synthetic pointer>, nobj=1, objv=0x7ffff3b3dd50, need=0, p=0x7ffff5c80800) at beam/erl_gc.c:876
#3 erts_garbage_collect (p=0x7ffff5c80800, need=need <at> entry=0, objv=objv <at> entry=0x7ffff3b3dd50, nobj=nobj <at> entry=1) at beam/erl_gc.c:450
#4 0x000000000052877b in erts_gc_after_bif_call (p=0x7ffff5c80800, result=140736302308346, regs=<optimized out>, arity=<optimized out>) at beam/erl_gc.c:370
#5 0x0000000000571951 in process_main () at beam/beam_emu.c:2787
#6 0x00000000004a9a70 in sched_thread_func (vesdp=0x7ffff51cc8c0) at beam/erl_process.c:7743
#7 0x00000000006056fb in thr_wrapper (vtwd=0x7fffffffd9a0) at pthread/ethread.c:106
#8 0x00007ffff704d374 in start_thread () from /usr/lib/libpthread.so.0
#9 0x00007ffff6b8327d in clone () from /usr/lib/libc.so.6
## The Systems ##
Erlang/OTP 17.4 (also observed on Erlang R15B01)
Amazon EC2 c3.8xlarge (32 Virtual CPUs, ~64 GB Memory)
uname -a: Linux rtb0.ec2.chitika.net
3.2.0-4-amd64 #1 SMP Debian 3.2.63-2 x86_64 GNU/Linux
Intel Core i5 760 <at> 2.80GHz (4 Logical CPUs, 2 cores IIRC), ~16GB Memory
Arch Linux (up-to-date)
uname -a: Linux diogenes 4.0.1-1-ARCH #1 SMP PREEMPT Wed Apr 29 12:00:26 CEST 2015 x86_64 GNU/Linux