Page 2 of 2

Posted: Tue Jul 12, 2005 8:57 pm
by cordel
Here's what runs daily on my video server. This are mostly all defaults in Fedora:
/etc/cron.daily/prelink
/etc/cron.daily/logrotate
/etc/cron.daily/yum.cron
/etc/cron.daily/slocate.cron
/etc/cron.daily/0anacron
/etc/cron.daily/00-makewhatis.cron
/etc/cron.daily/rpm
/etc/cron.daily/00webalizer
/etc/cron.daily/00-logwatch
/etc/cron.daily/tmpwatch

I would think if anything is going to load the system would be slocate.
The only one I started was yum

Posted: Tue Jul 12, 2005 10:05 pm
by Ruler
I just checked and the CPU is running between 60 and 65% idle. No idea what that jumps to during the log rotation though.

What does slocate do? I looked through the man page for the updatedb command that it calls and it didn't sound like it was too big a deal. While googling, I found a site discussing logrotate and that it places a heavy burden on the server - it was discussed in terms of a laptop that was never on when the daily cron fired and the guys wife being POed that her system was slow in the morning because he put it in the startup script.

Posted: Tue Jul 12, 2005 11:01 pm
by cordel
slocate scans all the files on you drive and places the location info of each file into a database. Then you can use the locate <name> and it returns anything with that name.
I have never watched to see what kind of a load it places though I might now.

Posted: Wed Jul 13, 2005 3:42 pm
by Ruler
Since I don't use locate, I think I'll disable this cron job and see what happens over the next couple of weeks.

Does logrotate place a heavy burden on the system? I'd think that it'd just delete the .5, rename .4 to .5, .3 to .4 and so on, then rename and recreate the main log file - not something that would be particularly stressful IMHO. http://www.thisishull.net/archive/index.php/t-7470.html seems to disagree with this logic though. Maybe I'll try to add a couple of echo commands to the script in order to log to a file somewhere exactly when it begins and ends so that I can tell how long it's alive when it runs. The good (and bad) thing about this is that it takes 6-8 minutes after the daily cron jobs fire to panic, so I know for a fact that if it is something from cron causing the problems, it's got to be a fairly major task. (Then again, it could just be coincidence, in which case I'm screwed. ;) :lol: )

Posted: Wed Jul 13, 2005 4:07 pm
by Ruler
Upon thinking about it some more, I'm going to leave slocate enabled and simply add echos around it to log when it starts/stops. This way, I'll be able to tell for certain which of these two processes is causing the problem.

I had another thought. I have 8 cameras recording 24x7, all recording to a RAID-1 array. /video is understandably huge - 500 gig. Could slocate be choking because there are so huge a number of files? (I'm thinking either strangling because of processing time or more likely the 4 gig file barrier?) Only problem with this theory is that my other ZM box has 16 cameras on 24x7 and a terrabyte RAID in it and that machine hasn't had hardly any troubles. That one is recording at 320x240@2.5fps versus this ones 640x480@1fps, so the problem should be worse with the larger number of files.

Posted: Wed Jul 13, 2005 8:57 pm
by SyRenity
Hi Ruler.

Care to list the specs of your terrabyte server? It's very interesting to know, what hardware does one need for such load.

Thanks!

Posted: Fri Jul 15, 2005 4:04 pm
by Ruler
I believe that I've found the source of this kernel panic problem.

I was scheduled to go out of town yesterday morning, so of course this is when the problem became severe enough to effectively troubleshoot. It became bad enough that the machine wouldn't stay up for 5 minutes before panicing. During my troubleshooting, I replaced the heatsink, then the CPU (thinking that the large load caused it to overheat and become damaged). This theory didn't make too much sense to me, as it had no problems at all during the memory test, but I didn't know what else to do. As soon as I booted the machine up with the new CPU in it, no more problem. "Hmmmm....", I thought, "Must be that the chip got too hot." I started thinking of a different heatsink.

While verifying that zoneminder was indeed up and running, I found that there was an error in the web interface. It said something about events.myi being inaccessible. "Greeeeeaaaaattt...... database corruption." By this time, I was already quite late for my trip, but pressed on. I used the check table function in mysql to verify that the events table was indeed corrupt, then the repair table command to fix the problem. Wouldn't you know it - about 20 seconds after the repair table command finished, BLAM! Another kernel panic.

This pointed me in the right direction. I booted the machine and was able to log in and remove the execute bit from the zoneminder startup script before the machine paniced again. I then rebooted and mounted the RAID array as read-only and ran an fsck on the file system. Lo and behold, it failed and said block number blahblahblah was bad. I unmounted the RAID, then ran fdisk on the drives. By doing this I was able to determine which drive it was that had the problem.

After replacing the drive, rebuilding the RAID, reconstructing the file system, restoring the setup that I'd done initially in zoneminder, and fixing all the permission problems, the problem appears to be gone.

I'm hypothesizing that slocate running in conjunction with zoneminder stressed the drive enough while building it's index that it caused the problem with the drive to manifest itself as a kernel panic. Thank you to everybody who contributed here - I appreciate the help. :)


As a side note, does anybody know how to copy a directory tree while preserving the permissions of the items therein? I did a cp -R /video /home/Setup just after I'd gotten everything set up and running (recording events to /video and using that as the apache root path as well) and this allowed me to quickly restore the /video file system after I rebuilt the RAID, but all the files ended up being owned by root:root since that was the id of the user who copied them. I had to go through and manually fix all the permissions before zoneminder could successfully record video again.



SyRenity - I have 16 cameras running at 320x240 resolution at about 2.5 fps. They record 24x7 in 5-minute increments with no motion detection and a terrabyte can hold approx 14-18 days of footage. It's got an Athlon Barton 2500+ in it with a gig of dual-channel DDR RAM. I also have an 80 gig drive acting as the boot drive with everything but /video on it. The PSU is either a 300 or 350 watt - I honestly can't remember at the moment. The 80 gig boot drive is the primary master; I've got a DVD+-RW as the secondary master. The 4, 250 gig drives are all master devices on 2 separate Promise UltraATA TX2 PCI controllers. I have it set up as a RAID-1 configuration so that all the drives physically write 1/4 of the data recorded to the file system. I do lose the data on the entire file system if any one of the drives fail though; add another drive to make it RAID-5 and any single drive could fail and all the data would remain intact. Under RAID-5, the computer would even rebuild the data on the failed drive when it's replaced - very cool indeed. (I don't have room in the case for more drives though - one of the 4 is wedged into a floppy drive bay! ;) ) Seagate has a 400-gig model that I'm most likely going to get for any new systems and their tech support just got back with me - 500 gig drives should be out September-October of this year! :)

Posted: Fri Jul 15, 2005 7:08 pm
by cordel
Very cool, I am really glad you got that sorted out. Last thing I would have even mentioned would be a bad/corupt sector And I've only ever seen a kernel panic once from that but have seen many databases take a dive. Good show.
Regards,
Cordel

Posted: Mon Jul 18, 2005 7:03 pm
by Ruler
I found this amusing. The databases became corrupt because of all the kernel panics. I honestly had no idea how to repair a corrupt database and was pissing and moaning about it royally. (It was actually as difficult as typing 2 commands in mysql and waiting for a few seconds, but I didn't know that beforehand.) It was only because of this problem though that I figured out what the real problem was. (Like you, it would be the last thing I'd think of.)

So I'm actually very lucky that the databases became corrupt. ;) :D

Posted: Mon Jul 18, 2005 9:53 pm
by zoneminder
I'm glad you got it sorted out. I've had a couple of (non ZM) machines go down because of bad disks, luckily RAIDed so nothing lost. I've got a box at work which is flashing up DMA/hd errors so I'm just waiting for that one to drop as well.

They seem to come in bunches though, I've never lost a home machine, but of about four I supplied to someone, at least three have had disk failures, all disks from the same batch. Very expensive to replace!!

Phil

what kind

Posted: Tue Nov 22, 2005 8:39 am
by brenden
what kind of drive was it that failed?

we have had _so_ many hard drive problems, we've probably, sadly, conservatively, lost man-decades due to those (collective drive makers) bastards.

Posted: Tue Nov 22, 2005 9:40 am
by jameswilson
just to start a poll on hard disks, the biggest problem i have had with drives is maxtor, i use wd now for all my pc's

Posted: Tue Nov 22, 2005 7:16 pm
by Ruler
The drive that failed was a WD RAID edition. I've also had a significant percentage of PC drives fail from WD. I'm now using Seagate for the zoneminder servers and so far, so good. :)

Posted: Tue Nov 22, 2005 7:25 pm
by jameswilson
oh wonderful news
Thankyou very much
lol I was converted to WD by my raptors but used to love ibm drives before before they sold their granny

Posted: Wed Nov 23, 2005 10:20 am
by zoneminder
I've also had issues with Maxtors as have some of my colleagues. However I've also had problems with other components causing disk failures. I have a couple of shuttle boxes with Maxtor disks in which are running 24x7 or switched on and off every day and have no trouble. I also sold a client 3 or 4 of the same kind of systems but a few months later and they all have had disks fail, sometimes repeatedly. I suspect in these cases that maybe there is an issue with the disk controllers or something else in the pipeline.

Phil