Page 1 of 2
Machine Lockups - Random
Posted: Mon Mar 13, 2006 8:13 pm
by Ruler
I've got 3 zomeminder machines, all running with the same hardware - AMD 2500+, a gig of RAM, 1500VA UPS, 350 watt PSU, 80 gig boot drive, 4x400 gig data drives RAIDed together, DVD+RW drive, Spectra-8 capture card w/expansion bracket. They all ran perfectly for the first 4 months or so of installation, but within the past week, 2 of these machines have started having problems; each has locked up twice. The screen will go black, though the monitor shows that it's still receiving a signal. No keyboard keys work, not even control-alt-delete. The machines refuse connections over the network. In short, the machine is frozen. The only way to access the machine is to press the reset button on the front of the case.
I have 30 in /proc/sys/kernel/panic, so if the machine panics, it'll reboot itself 30 seconds later. This is not happening though, so I assume that the machine isn't experiencing a kernel panic. (Further, even kernel panics have something displayed on the monitor.) The lockups don't happen at a given time each day, but are random throughout the day and night. The first of these three machines hasn't had a problem and it's running with the same hardware and software configuration. (Slackware 10.1, ZoneMinder 1.21.3.) Each has it's own UPS connected to the server with a data communications cable; there are no power failure events listed in /var/log/apcupsd.events near the time of the lockups. The crashes on each machine happened at different days/times.
/var/log/messages doesn't show much of anything. Most of the time, there is a very long line of unprintable characters added to the log at the time of the lockup, but one time the log just indicated a system restart - the last item in the log is that an event started. (ZM is set to record continually in 5-minute chunks.)
Anybody have any ideas as to how to proceed diagnosing a problem such as this?
Posted: Mon Mar 13, 2006 10:27 pm
by jameswilson
I had this too on an 2.6.11 kernel. I updated the kernel to 2.6.13 and it stopped. Also had out of mem errors but it didnt panic just virtually stopped
The other thing id check is acpi
Posted: Mon Mar 13, 2006 11:10 pm
by Ruler
These machines are running 2.4.29 kernel. What is acpi?
Posted: Mon Mar 13, 2006 11:20 pm
by jameswilson
2.4.29 sorry before my time.
ACPI is something todo with power management and has caused me tonnes of headaches. If i disable it int he bios things go back to normal!.
Not sure about the 2.4 kernel does it fully support your hardware
Posted: Tue Mar 14, 2006 7:59 am
by Flash_
Acpi is a pain in the arse on linux - most kernels have a switch to disable it on boot.
Other suggestions: 5 hdd's and a hungry processor might be pushing a 350w supply, I'd be checking voltages or upgrading the psu for testing.
Temperatures. Have sensord running for a bit, phpsysinfo useful widget too. hdtemp also good. Again, 5hdd's and a hungry cpu will be kicking out some heat. (No idea if your case layout/design/cooling - apologies if this is simplistic)
A hdd might be failing - sometimes they don't fail cleanly and "hang" when sent data, especially when writing. I had a seagate do just that here. Unlikely if two machines are doing the same thing, though.
Anyway, hope you can fix it.
Posted: Tue Mar 14, 2006 5:09 pm
by Ruler
I'm going to look into how to disable ACPI - what does it do anyways? Disabling it won't have any negative imacts, will it?
I'd thought of the power supply being maxxed out too, but given that I have another identical machine next to it and two others with as much or more stuff in them, thought this was unlikely. Now I'm thinking that this is probably the most likely cause (other than a malfunctioning MB, RAM, etc). I don't have a bigger PSU available at the moment - maybe I'll connect another 350 watter and just run the 4 data drives off that one and leave everything else on the main PS. It'll at least tell me if that's the cause or not. If it is the PSU, I can buy bigger ones for all the servers.
I'm relatively certain that it's not heat. I've got 2x80mm fans blowing air through the hard drives, one 80mm side intake blowing air right onto the HSF, and a 120mm exhaust at the top rear. Heh, nobody has ever accused me of building quiet computers, but the only one that's overheated is the one somebody put in a wood cupboard and piled stuff around it so that it couldn't breathe at all.
A failing HD is a good thought; I'll run an fsck on them the next time I'm near the machines. (Don't want to do it remotely just in case something goes wrong.)
Posted: Tue Mar 14, 2006 5:14 pm
by jameswilson
ACPI is something to do with power management and how things are spoken to i think. No it wont have any bad effects. Well none that are worse than you currently have lol
Posted: Wed Mar 22, 2006 7:05 pm
by Ruler
OK, I have more information about this.
I went out to the remote site last week, jumped a power supply to run constantly, and connected half the equipment in the box to the second PSU. (Unconventional, but I figured that this would eliminate insufficient power as a possible cause.) I also ran an fsck on all hard drives in the machine, including the 1.6 TB raid. Note to self: do this when you have other things to do - it takes about for-freakin-ever. Everything checked out OK, so I restarted and left. The machine crashed before I got back to my office.
I went out to the remote site again yesterday, intending to spend 10 minutes upgrading the kernel to 2.6.13.
5 hours later...
Did you know that RAID functions aren't enabled by default in 2.6.13? Also, udev does not create /dev/md0 automatically like it does under 2.4. (For other slackware users who encounter this, edit /etc/rc.d/rc.S and add 'mknod /dev/md0 b 9 0' and 'raidstart /dev/md0' after udev loads, then uncomment the '/sbin/modprobe raidn' line in /etc/rc.d/rc.modules and add '/sbin/modprobe md' immediately before it. This will remedy the situation and allow you to use RAID in 2.6.)
Anyways, I finally got the kernel upgraded and had high hopes - the machine hadn't puked the entire time, even with me throwing random objects at it.
Seriously though, I was ready to put my fist through the front panel - nothing would work and booting the old kernel caused a panic before it got fully booted; I seriously don't understand booting multiple kernels in linux. (I have lilo all configured to boot separate kernels, just as I have at home to boot DOS, windoze, and linux. Can't seem to boot two separate kernel versions though, no matter what I do.) Anyways, after I got everything up and going, I re-enabled the mysql and zoneminder start scripts and rebooted. BAM! Machine locked up as soon as zoneminder tried to start. I did this three times with the same results each time.
I disabled the start script for zoneminder and booted the machine, this time successfully. I decided to upgrade ZoneMinder from 1.21.3, just as a last resort before chucking the machine into the middle of the highway out front. Note: there should really be notes with the announcement of the new version that specify any additional dependencies. I downloaded two perl modules that ./configure was complaining about being missing, then gave up when these started bitching about not having other perl modules that they needed.
I yanked the machine and it's currently sitting on my desk running memtest86. It passed all tests ran on it for a weekend when I built it - hasn't failed yet. I'm guessing something in the motherboard doesn't like what zoneminder is doing?!?????? (It's a gigabyte board - 2 out of 15 such boards didn't work out of the box when I bought them. I really don't have a good solution for a basic, stable, nforce socket A motherboard anymore - I used AOpen boards for a long time with great success, one failure in 80+ machines, but of course they don't make those anymore.
)
Anybody see anything that I'm missing here? I'm planning on finishing the zoneminder update and when that doesn't work, replacing the motherboard and reinstalling everything so that it's exactly like the other servers again, although I know that this is a microshaft solution. I'm hoping somebody will see a simple solution for me. (Other than to quit my job that is.
)
Posted: Wed Mar 22, 2006 7:26 pm
by jameswilson
i have had issues before with a machine that started having issues after zm started, but the machine would be stable for weeks without zm. Turned out i had the wrong card number for it. Once i cput the right card number in it was great. I also had one machine that i Built with the wrong size mobo spacers. and the card didnt sit correctly. Other than that i suppose you gotta try another mobo. Or capture card. Do you have an ip cam to test zm without the card and see?
Posted: Wed Mar 22, 2006 8:19 pm
by Ruler
The capture card is the spectra-8, the same as all the other ZM boxes I have, and seated in the same PCI slot as the others. (The other box that crashed above hasn't done it again, knock on wood, and the other box I have set up with this same configuration hasn't had a problem since a power failure caused a reboot 127 days ago.)
I don't have any ip cameras to test that with.
Posted: Wed Mar 22, 2006 8:23 pm
by jameswilson
just a straw grab mate. Did your last post infer that since you updated the kernel its been better?
I have had issues with a p4 abit board that i wa spulling my hair on i though it was memroy to start with even though it passed memtest. I tried everything. I even found the northbridge fan was failing and i thought it was that. It wasnt. It wasnt crashing just going really slowly. Turns out (i think!) that something in the bios was corrupted i pulled the battery onit and its been fine since (played safe and updated kernel to 2.6.14) but i think it was the bios.
Posted: Wed Mar 22, 2006 8:33 pm
by Ruler
All the kernel upgrade did (aside from pissing me off because the RAID wouldn't work
) was make it so that ZM won't run at all. Under 2.4.29, ZM would start and run for an indeterminate period of time before locking up - not causing a kernel panic or turning off, but freezing and locking the keyboard. (It was connected to a KVM and the hotkey for the switch wouldn't even work!) Under 2.6.13, ZM won't even start without causing a lockup like this.
I'll yank the battery and see if that has an effect. It's a heck of a lot quicker and easier than replacing the motherboard.
Posted: Wed Mar 22, 2006 8:36 pm
by jameswilson
im sorry mate i have no idea. The battery thing with me wasnt locking though just causing something to slow down after about 24 hours of highish zm load. Good luck though matey!
Posted: Wed Mar 22, 2006 9:00 pm
by Ruler
Oh. My. Dear. God.
I pulled the battery for 5 minutes and slapped it back in. I booted, then started zoneminder. It didn't lock up! I went into the interface and found several lines of 'incorrect key file for table: 'E'Try to repair it' before the regular interface began. I stopped zoneminder and went into mysql - sure enough, the events and frames tables both had errors. I repaired them and restarted zoneminder successfully. The system is now running; I'm going to let it be and see if it dies as it did before.
If this is all it was, I'm not sure whether to be very thankful or extremely mad.
Posted: Wed Mar 22, 2006 9:01 pm
by jameswilson
lol be thankful. I couldnt believe it on mine rig when i found it