The SIG11 problem (original) (raw)

Signal 11 while compiling the kernel

This FAQ describes what the possible causes are for an effect that bothers lots of people lately. Namely that a linux(*)-kernel (or any other large package for that matter) compile crashes with a "signal 11". The cause can be software or (most likely) hardware. Read on to find out more.
(*) Of course nothing is Linux specific. If your hardware is flaky, Linux, Windows 3.1, FreeBSD, Windows NT and NextStep will all crash.

translations

There are multiple translations of this. I forgot to write down where. Let's build a list here.

Contact me if you find any spelling errors, worthwhile additions or with an "it also happened to me" story. (Note that I reject some suggested additions on my belief that it is technical nonsense). I would appreciate it if you put "sig11" or something like that in the subject.


QUESTION

Signal 11, what does that mean?

ANSWER

Signal 11, or officially know as "segmentation fault", means that the program accessed a memory location that was not assigned. That's usually a bug in the program. So if you're writing your own program, that's the most likely cause. However, this FAQ will concentrate on the possibilities besides that.

QUESTION

My (kernel) compile crashes with

gcc: Internal compiler error: program cc1 got fatal signal 11

What is wrong with the compiler? Which version of the compiler do I need? Is there something wrong with the kernel?

ANSWER

Most likely there is nothing wrong with your installation, your compiler or kernel. It very likely has something to do with your hardware. There are a variety of subsystems that can be wrong, and there are a variety of ways to fix it. Read on, and you'll find out more. There are two exceptions to this "rule". You could be running low on virtual memory, or you could be installing Red Hat 5.x, 6.x or 7.x. There is more about this near the end.


QUESTION

Ok it may not be the software, How do I know for sure?

ANSWER

First lets make sure it is the hardware that is causing your trouble. When the "make" stops, simply type "make" again. If it compiles a few more files before stopping, it must be hardware that is causing you troubles. If it immediately stops again (i.e. scans a few directories with "nothing to be done for xxxx" before bombing at exactly the same place), try

dd if=/dev/HARD_DISK of=/dev/null bs=1024k count=MEGS

Change HARD_DISK to "hda" to the name of your harddisk (e.g. hda or sda. Or use "df ."). Change the MEGS to the number of megabytes of main memory that you have. This will cause the first several megabytes of your harddisk to be read from disk, forcing the C source files and the gcc binary to be reread from disk the next time you run it. Now type make again. If it still stops in the same place I'm starting to wonder if you're reading the right FAQ, as it is starting to look like a software problem after all.... Take a peek at the "what are the other possibilities" question..... If without this "dd" command the compiler keeps on stopping at the same place, but moves to another place after you use the "dd" you definitely have a disk->ram transfer problem.

QUESTION

What does it really mean? Are you sure it's a hardware problem?

ANSWER

Well, the compiler accessed memory outside its memory range. If this happens on working hardware it's a programming error inside the compiler. That's why it says "internal compiler error". However when the hardware occasionally flips a bit, gcc uses so many pointers, that it is likely to end up accessing something outside of its addressing range. (random addresses are mostly outside your addressing range, as even though your main memory may be a significant portion of 4G nowadays, usually only a small portion is mapped to any one process. :-) It seems that nowadays, everybody with "signal 11" problems gets directed to this page. If you're developing your own software or have software that hasn't been debugged quite enough, "signal 11" (or segmentation fault) is still a very strong hint that there is something wrong with the program. Only when a program like "gcc" that works for almost everybody else to crash on a dataset (e.g. the Linux-kernel) that has also been well-tested, then it becomes a hint that there is something wrong with your hardware. If some software component like a hardware driver in your system is broken, it could cause symptoms that are VERY close to those of a hardware failure. However, when a driver is faulty it is more likely to cause serious trouble inside the kernel, than just causing the compiler to crash.


QUESTION

Ok. I may have a hardware problem what is it?

ANSWER

If it happens to be the hardware it can be:


QUESTION

RAM timing problems? I fiddled with the bios settings more than a month ago. I've compiled numerous kernels in the mean time and nothing went wrong. It can't be the RAM timing. Right?
### ANSWER
Wrong. Do you think that the RAM manufacturers have a machine that makes 60ns RAMs and another one that makes 70ns RAMs? Of course not! They make a bunch, and then test them. Some meet the specs for 60 ns, others don't. Those might be 61 ns if the manufacturer would have to put a number to it. In that case it is quite likely that it works in your computer when for example the temperature is below 40 degrees centigrade (chips become slower when the temp rises. That's why some supercomputers need so much cooling).
However "the coming of summer" or a long compile job may push the temperature inside your computer over the "limit". -- Philippe Troin (ptroin@compass-da.com)

QUESTION

I got suckered into not buying ECC memory because it was slightly cheaper. I feel like a fool. I should have bought the more expensive ECC memory. Right?
### ANSWER
Buying the more expensive ECC memory and motherboards protects you against a certain type of errors: Those that occur randomly by passing alpha particles.
Because most people can reproduce "signal 11" problems within half an hour using "gcc" but cannot reproduce them by memory testing for hours in a row, that proves to me that it is not simply a random alpha particle flipping a bit. That would get noticed by the memory test too. This means that something else is going on.
I have the impression that most sig11 problems are caused by timing errors on the CPU <-> cache <-> memory path. ECC on your main memory doesn't help you in that case.
When should you buy ECC? a) When you feel you need it. b) When you have LOTS of RAM. (Why not a cut-off number? Because the cut-off changes with time, just like "LOTS".) Some people feel very strong about everybody using ECC memory. I refer them to reason "a)".

QUESTION

Memory problems? My BIOS tests my memory and tells me its ok. I have this fancy DOS program that tells me my memory is OK. Can't be memory right?
### ANSWER
Wrong. The memory test in the BIOS is utterly useless. It may even occasionally OK more memory than really is available, let alone test whether it is good or not.
A friend of mine used to have a 640k PC (yeah, this was a long time ago) which had a single 64kbit chip instead of a 256kbit chip in the second 256k bank. This means that he effectively had 320k working memory. Sometimes the BIOS would test 384k as "OK". Anyway, only certain applications would fail. It was very hard to diagnose the actual problem...
Most memory problems only occur under special circumstances. Those circumstances are hardly ever known. gcc Seems to exercise them. Some memory tests, especially BIOS memory tests, don't. I'm no longer working on creating a floppy with a linux kernel and a good memory tester on it. Forget about bugging me about it......
The reason is that a memory test causes the CPU to execute just a few instructions, and the memory access patterns tend to be very regular. Under these circumstances only a very small subset of the memories breaks down. If you're studying Electrical Engineering and are interested in memory testing, a masters thesis could be to figure out what's going on. There are computer manufacturers that would want to sponsor such a project with some hardware that clients claim to be unreliable, but doesn't fail the production tests......

QUESTION

Does it only happen when I compile a kernel?
### ANSWER
Nope. There is no way your hardware can know that you are compiling a kernel. It just so happens that a kernel compile is very tough on your hardware, so it just happens a lot when you are compiling a kernel. Compiling other large packages like gcc or glibc also often trigger the sig11.
* People have seen "random" crashes for example while installing using the slackware installation script.... -- dhn@pluto.njcc.com
* Others get "general protection errors" from the kernel (with the crashdump). These are usually in /var/adm/messages. -- fox@graphics.cs.nyu.edu
* Some see bzip2crash with "signal 11" or with "internal assertion failure (#1007)." Bzip2 is pretty well-tested, so if it crashes, it's likely not a bug in bzip2. -- Julian Seward (jseward@acm.org)

QUESTION

QUESTION

Is it always signal 11?
### ANSWER
Nope. Other signals like four, six and seven also occur occasionally. Signal 11 is most common though.
As long as memory is getting corrupted, anything can happen. I'd expect bad binaries to occur much more often than they really do. Anyway, it seems that the odds are heavily biased towards gcc getting a signal 11. Also seen:
* free_one_pmd: bad directory entry 00000008
* EXT2-fs warning (device 08:14): ext_2_free_blocks bit already cleared for block 127916
* Internal error: bad swap device
* Trying to free nonexistent swap-page
* kfree of non-kmalloced memory ...
* scsi0: REQ before WAIT DISCONNECT IID
* Unable to handle kernel NULL pointer dereference at virtual address c0000004
* put_page: page already exists 00000046
invalid operand: 0000
* Whee.. inode changed from under us. Tell Linus
* crc error -- System halted (During the uncompress of the Linux kernel)
* Segmentation fault
* "unable to resolve symbol"
* make [1]: *** [sub_dirs] Error 139
make: *** [linuxsubdirs] Error 1
* The X Window system can terminate with a "caught signal xx"
The first few ones are cases where the kernel "suspects" a kernel-programming-error that is actually caused by the bad memory. The last few point to application programs that end up with the trouble.
-- S.G.de Marinis (trance@interseg.it)
-- Dirk Nachtmann (nachtman@kogs.informatik.uni-hamburg.de)

QUESTION

What do I do?
### ANSWER
Here are some things to try when you want to find out what is wrong... note: Some of these will significantly slow your computer down. These things are intended to get your computer to function properly and allow you to narrow down what's wrong with it. With this information you can for example try to get the faulty component replaced by your vendor.
* Jumper the motherboard for lower CPU and bus speed.
* Go into the BIOS and tell it "Load BIOS defaults". Make sure you write the disk drive settings down beforehand.
* Disable the cache (BIOS) (or pull it out if it's on a "stick").
* boot kernel with "linux mem=4M" (disables memory above 4Mb).
* Try taking out half the memory. Try both halves in turn.
* Fiddle with settings of the refresh (BIOS)
* Try borrowing memory from someone else. Preferably this should be memory that runs Linux flawlessly in the other machine... (Silicon graphics Indy machines are also nice targets to borrow memory from)
* If you want to verify if a solution really works try the following script:
#!/bin/sh
#set -x
t=1
while [ -f log.$t ]
do
t=expr $t + 1
done
while true
do
make clean
make -k bzImage > log.cur 2>&1
mv log.cur log.$t
t=expr $t + 1
done
All the resulting logfiles should be the same (i.e. the same size, and the same contents). Every kernel build takes around 4 minutes on a 1GHz Athlon with 512Mb of memory. (and about 3 months on a 386 with 4Mb :-).
* Another way to test if your current setup is stable might be to run "md5sum" on files of different sizes (dd if=/dev/random of=testfile bs=1024k count=). If you use a file twice the size of your RAM, you'll be exercising your disk. If you use a file 4 to 10 Mb smaller than your RAM, you'll exercise your RAM/CPU.
Whether this method catches all possible problems, however, is uncertain. Gcc executes lots of different instructions in different orders, and md5sum might simply not hit the right sequence of instructions that gcc does. But if md5sum leads to errors, it might do so quicker than a kernel compile. -- Rob Ludwick (rob@no-spam)
The hardest part is that most people will be able to do all of the above except borrowing memory from someone else, and it doesn't make a difference. This makes it likely that it really is the RAM. RAM used to be one of the priciest parts of a PC, so you would rather not arrive at this conclusion, but, I'm sorry, I get lots of reactions that in the end turn out to be the RAM. However don't despair just yet: your RAM may not be completely wasted: you can always try to trade it in for different or more RAM.

QUESTION

I had my RAMs tested in a RAM-tester device, and they are OK. Can't be the RAM right?
### ANSWER
Wrong. It seems that the errors that are currently occurring in RAMS are not detectable by RAM-testers. It might be that your motherboard is accessing the RAMs in dubious ways or otherwise messing up the RAM while it is in YOUR computer. The advantage is that you can sell your RAM to someone who still has confidence in his RAM-tester......

QUESTION

What other hardware could be the problem?
### ANSWER
Well, any hardware problem inside your computer. But things that are easy to check should be checked first. So, for example, all your cards should be correctly inserted into the mother board.

QUESTION

Why is the Red Hat install bombing on me?
### ANSWER
The Red Hat 5.x, 6.x and 7.x install has problems on some machines. Try running the install with only 32M. This can usually be done with mem=32m as a boot parameter.
It could be that there is a read-error on the CD. The installer handles this less-than-perfect..... Make sure that your CD is flawless! It seems that the installer will bomb on marginal CDs!
People report, and I've seen with my own eyes, that Red Hat installs can go wrong (crash with signal 7 or signal 11) on machines that are perfectly in order. My machine was and still is 100% reliable (actually the machine I tested this on, is by now reliably dead). People are getting into trouble by wiping the old "working just fine" distribution, and then wanting to install a more recent Red Hat distribution. Going back is then no longer an option, because going back to 5.x also results in the same "crashes while installing".
Patrick Haley (haleyp@austin.rr.com) reports that he tried all memory configurations up to 96Mb (32 & 64) and found that only when he had 96Mb installed, the install would work. This is also consistent with my own experience (of Red Hat installs failing): I tried the install on a 32M machine.
NEW: It seems that this may be due to a kernel problem. The kernel may (temporarliy) run low on memory and kill the current process. The fix by Hubert Mantel (mantel@suse.de) is at: https://juanjox.linuxhq.com/patch/20-p0459.html.
If this is actually the case, try switching to the second virtual console (ctrl-alt-F2) and type "sync" there every few seconds. This reduces the amount of memory taken by harddisk-buffers... I would really appreciate hearing from you if you've seen the Red Hat install crash two or more times in a row, and then were able to finish the install using this trick!!!
What do you do to get around this problem?...
* Use SuSE. It's better: It doesn't crash during the installation. (Moreover, it actually is better. ;-)
* Maybe you're running into a bad-block on your CD. This can be drive-dependent. If that's the case, try making a copy of the CD in another drive. Try borrowing someone elses copy of Red Hat.
* Try configuring a GIGABYTE of swap. I have two independent reports that report that they got through with a gig of swap. Please report to me if it helps!
* Modify the "settings" for the harddisk. Changing the setting from "LBA" to "NORMAL" in the bios has helped for at least one person. If you try this, I'd really appreciate it if you'd EMail me: I would like to hear from you if it helps or not. (and what you exactly changed to get it to work)
* I got my machine to install by installing a minimal base system, and then adding packages to the installed system.
* Someone suggested that the machine might be out-of-memory when this happens. Try having a swap partition ready. Also, the install may be "prepared" to handle low mem situations, but misjudging the situation. For example, it may load a RAMDISK, leaving just 1M of free RAM, and then trying to load a 2M application. So if you have 16M of RAM, booting with mem=14M may actually help, as the "load RAMDISK" stage would then fail and the install would then know to run off the CD instead of off the RAMDISK. (installs used to work for >8M machines. Is that still true?)
* Try, in one session to clear the disk of all the partitions that are going to be used by Linux. Reboot. Then try the install. Either by partitioning manually, or by letting the install program figure it out. (I take it that Red Hat has that possibility too, SuSE has it...) If this works for you, I'd appreciate it if you'd tell me.
* A corrupted download can also cause this. Duh.
* Someone reports that installs on 8Mb machines no longer work, and that the install ungracefully exits with a sig7. -- Chris Rocco (crocco@earthlink.net)
* One person reports that disabling "BIOS shadow" (system & VIDEO), helped for him. As Linux doesn't use the BIOS, shadowing it doesn't help. Some computers may even give you 384k of extra RAM if you disable the shadowing. Just disable it, and see what happens. -- Philippe d'Offay (pdoffay@pmdsoft.com).

QUESTION

QUESTION

I found that running ..... detects errors much quicker than just compiling kernels. Please mention this on your site.
### ANSWER
Many people email me with notes like this. However, what many don't realize is that they encountered ONE case of problematic hardware. The person recommending "unzip -t" happened to have a certain broken DRAM stick. And unzip happened to "find" that much quicker than a kernel compile.
However, I'm sure that for many other problems, the kernel compile WOULD find it, while other tests don't. I think that the kernel compile is good because it stresses lots of different parts of the computer. Many other tests just exercise just one area. If that area happens to be broken in your case, it will show a problem much quicker than "kernel compile" will. But if your computer is OK on that area and broken in another, the "faster" test may just tell you your computer is OK, while the kernel compile test would have told you something was wrong.
In any case, I might just as well list what people think are good tests, which they are, but not as general as the "try and compile a kernel" test....
* Run unzip while compiling kernels. Use a zipfile about as large as RAM.
* use "memtest86" found at: https://www.memtest86.com/.
* do dd if=/dev/hda of=/dev/null while compiling kernels.
* run md5sum on large trees.
Note that whatever fast method you may find to tell you that your computer is broken, it won't guarantee your computer is fine if such a test suddenly doesn't fail anymore. I always recommend that after fiddling with things to make it work, you should run a 24-hour kernel-compile test.

QUESTION

QUESTION

I don't believe this. To whom has this happened?
### ANSWER
Well for one it happened to me personally. But you don't have to believe me. It also happened to:
* Johnny Stephens (icjps@asuvm.inre.asu.edu)
* Dejan Ilic (d92dejil@und.ida.liu.se)
* Rick Tessner (rick@myra.com)
* David Fox (fox@graphics.cs.nyu.edu)
* Darren White (dwhite@baker.cnw.com) (L2 cache)
* Patrick J. Volkerding (volkerdi@mhd1.moorhead.msus.edu)
* Jeff Coy Jr. (jcoy@gray.cscwc.pima.edu) (Temp problems)
* Michael Blandford (mikey@azalea.lanl.gov) (Temp problems: CPU fan failed)
* Alex Butcher (Alex.Butcher@bristol.ac.uk) (Memory waitstates)
* Richard Postgate (postgate@cafe.net) (VLB loading)
* Bert Meijs (L.Meijs@et.tudelft.nl) (bad SIMMs)
* J. Van Stonecypher (scypher@cs.fsu.edu)
* Mark Kettner (kettner@cat.et.tudelft.nl) (bad SIMMs)
* Naresh Sharma (n.sharma@is.twi.tudelft.nl) (30->72 converter)
* Rick Lim (ricklim@freenet.vancouver.bc.ca) (Bad cache)
* Scott Brumbaugh (scottb@borris.beachnet.com)
* Paul Gortmaker (paul.gortmaker@anu.edu.au)
* Mike Tayter (tayter@ncats.newaygo.mi.us) (Something with the cache)
* Benni ??? (benni@informatik.uni-frankfurt.de) (VLB Overloading)
* Oliver Schoett (os@sdm.de) (Cache jumper)
* Morten Welinder (terra@diku.dk)
* Warwick Harvey (warwick@cs.mu.oz.au) (bit error in cache)
* Hank Barta (hank@pswin.chi.il.us)
* Jeffrey J. Radice (jjr@zilker.net) (Ram voltage)
* Samuel Ramac (sramac@vnet.ibm.com) (CPU tops out)
* Andrew Eskilsson (mpt95aes@pt.hk-r.se) (DRAM speed)
* W. Paul Mills (wpmills@midusa.net) (CPU fan disconnected from CPU)
* Joseph Barone (barone@mntr02.psf.ge.com) (Bad cache)
* Philippe Troin (ptroin@compass-da.com) (delayed RAM timing trouble)
* Koen D'Hondt (koen@dutlhs1.lr.tudelft.nl) (more kernel error messages)
* Bill Faust (faust@pobox.com) (cache problem)
* Tim Middlekoop (mtim@lab.housing.fsu.edu) (CPU temp: fan installed)
* Andrew R. Cook (andy@anchtk.chm.anl.gov) (bad cache)
* Allan Wind (wind@imada.ou.dk) (P66 overheating)
* Michael Tuschik (mt2@irz.inf.tu-dresden.de) (gcc2.7.2p victim)
* R.C.H. Li (chli@en.polyu.edu.hk) (Overclocking: ok for months...)
* Florin (florin@monet.telebyte.nl) (Overclocked CPU by vendor)
* Dale J March (dmarch@pcocd2.intel.com) (CPU overheating on laptop)
* Markus Schulte (markus@dom.de) (Bad RAM)
* Mark Davis (mark_d_davis@usa.pipeline.com) (Bad P120?)
* Josep Lladonosa i Capell (jllado@arrakis.es) (PCI options overoptimization)
* Emilio Federici (mc9995@mclink.it) (P120 overheating)
* Conor McCarthy (conormc@cclana.ucd.ie) (Bad SIMM)
* Matthias Petofalvi (mpetofal@ulb.ac.be) ("Simmverter" problem)
* Jonathan Christopher Mckinney (jono@tamu.edu) (gcc2.7.2p victim)
* Greg Nicholson (greg@job.cba.ua.edu) (many old disks)
* Ismo Peltonen (iap@bigbang.hut.fi) (irq_unmasking)
* Daniel Pancamo (pancamo@infocom.net) (70ns instead of 60 ns RAM)
* David Halls (david.halls@cl.cam.ac.uk)
* Mark Zusman (marklz@pointer.israel.net) (Bad motherboard)
* Elizabeth Ayer (eca23@cam.ac.uk) (Power management features)
* Thorsten Kuehnemann (thorsten@actis.de)
* (Email me with your story, you might get to be mentioned here... :-)---- Update: I like to hear what happened to you. This will allow me to guess what happens most, and keep this file as accurate as possible. However I now have around 500 different Email addresses of people who've had sig-11 problems. I don't think that it is useful to keep on adding "random" people's names on this list. What do YOU think?

I'm interested in new stories. If you have a problem and are unsure about what it is, it may help tocontact me. My curiosity will usually drive me to answering your questions until you find what the problem is..... (on the other hand, I do get pissed when your problem is clearly described above :-)

© 1996-2024 - BitWizard B.V. is a registered trademark