Major system instability problem with fedora 7/xen

Support and queries relating to all previous versions of ZoneMinder
Locked
JimNoble
Posts: 58
Joined: Thu Jul 29, 2004 12:12 am

Major system instability problem with fedora 7/xen

Post by JimNoble »

I've been moving zoneminder over to a new machine, and in the process I think I've discovered a potential problem that can cause significant system instability, and not just with zoneminder.

I was beginning to think the new machine had a serious h/w problem, as I was experiencing a wide range of crashes, errors and just plain odd behaviour. Firefox and Thunderbird would only run for at best a couple of minutes before crashing out with a Floating Point Exception; Gnome desktop applets would vanish frequently for no apparent reason; and various system services would stop running abruptly several times a day leaving little or no clues as to how or why.

Mostly I noticed this with zoneminder, slimserver, and a custom process I run that collects data from 1-wire sensors. Some of the zm problems were caused by the mysqld process crashing on a "signal 8", which I've subsequently discovered is another Floating Point Exception.

However, I think I've finally managed to track down the cause. Some web searching I did recently pointed me in the direction of glibc, and then I noticed that the old machine, which has always been relatively stable, was running glibc-2.6-3, where as the new machine had been yum updated to 2.6-4 back in July.

First I tried upgrading again to 2.6-90 from the fedora development repo, but this had no discernible effect. I then tried to downgrade to 2.6-3, inadvertently hosing the machine by pulling the glibc-2.6-90 rug from beneath its feet in a most undignified fashion. :shock: Whoops. :roll: Composure re-gathered I semi-gracefully recovered the situation with the help of the install cd, and managed to get it up and running again with glibc-2.6-3 in place.

The result? It's been up around 3 hours now, and so far none of the previous problems have resurfaced.

In fact the only thing that's appeared in zmdc.log in that time is an abnormal exit from zmc for a known-flakey remote ip webcam monitor which is connected via both a wi-fi bridge and a Homeplug mains-over-ethernet link.

I'm not sure how I can best report this back to the glibc folks, but if any of you are running fedora-7 or one of its close relations, I'd avoid glibc-2.6-4 if I were you.

Hope that's useful to someone...

Jim[/i]
Last edited by JimNoble on Sat Sep 08, 2007 5:32 pm, edited 1 time in total.
andywright
Posts: 10
Joined: Mon Aug 20, 2007 7:15 pm
Location: Leeds, UK

Post by andywright »

Hi Jim,

I've been running FC7 on my desktop machine for a while and am also now on glibc 2.6.4, but I haven't seen any of the problems you describe. I use firefox & thunderbird a lot and often fire up XP inside vmware. About the only thing I've had crash on me lately is the new release of Google earth. My machine is an Athlon 64 (but I run 32 bit FC7).

By the way, good to see another Slimserver user :) (Mine's running on a Centos server...)

Andy.
JimNoble
Posts: 58
Joined: Thu Jul 29, 2004 12:12 am

Post by JimNoble »

Just out of interest, do you have the i386 or the i686 version of the rpm installed? Looking back through my yum logs, it appears to have installed the i386 version for all of the glibc-<something> packages, and the i686 version of the main glibc package, even though there is an i386 version available. I've no idea if that's a problem or not though.

I wonder if it's an intel-only or even processor specific bug? (The machine has a T2500 Core Duo Mobile cpu).

Jim
JimNoble
Posts: 58
Joined: Thu Jul 29, 2004 12:12 am

Post by JimNoble »

Ps. Hate to say it buy my slimp3 mainly gets used as a glorified clock in the lounge! :lol:

Jim
andywright
Posts: 10
Joined: Mon Aug 20, 2007 7:15 pm
Location: Leeds, UK

Post by andywright »

I compiled from source with -march=athlon rather than using rpms....

Andy.
JimNoble
Posts: 58
Joined: Thu Jul 29, 2004 12:12 am

Post by JimNoble »

An update:

I thought it was fixed, but it's not. It's certainly more stable with glibc-2.6-3 than with 2.6-4, but I'm still seeing signal-8 (FP exception) and signal-11 (segfault) in many different services and processes, and zma/zmc crashes with signal-255 (?).

glibc was a red herring, but I think I'm getting closer to the real cause.

The problem seems to lie somewhere with Xen. If I boot with a non-xen kernel, the system is absolutely stable - at least it was for the few hours I ran it... (Ironically the Windows XP VM that Xen is running works flawlessly.)

I've found vague references on the web to problems with the Fedora-7 releases of Xen. Perhaps a side effect of the kernel being compiled against Xen 3.0.* when the installed Xen tools are v3.1.

There are some suggestions that using the 64-bit Hypervisor with 32-bit dom0 and domU avoids the problem - I tried that but the machine hangs as soon as xendomains starts, even if I stop the winxp.hvm auto-starting.

Similarly turning off SMP for dom0 is supposed to help, but I've yet to try that.

The symptoms are similar to a problem with previous versions of Xen whereby glibc's TLS would clash with Xen when they both try and use the top end of available memory, causing segmentation faults. But that is supposed to be fixed now - a xen kernel will use the "nonegseg" version of glibc, which disables TLS.

If/when I manage to track down the actual problem I'll report back again...

Jim
Locked