Page 1 of 1

Strange server crash

Posted: Fri Oct 07, 2011 8:08 pm
by Kleine
Hello,

I hope someone will help me to start debugging this strange behaviour. The situation is the following:
I have a new Dell poweredge T110 server with Intel Xeon X3430 (2.40GHz, 4C, 8M Cache) with 2 * 4GB memory.
Coupled to a QNap network storage to store the events and wich has the mysql database running on 1 networkcard in our first VLAN (through a gigabit switch).
His second networkcord is connected to the second VLAN wich contains 3 foscam camera's.
The server only runs the apache webserver and zoneminder v1.24.4. And as operating system it has Fedora 2.6.38.8-32.fc15.x86_64

Most of the days the server runs normally. Except the occasionally bufferoverrun it is fairly stable. (the system is in use for 1 month now) But once a week it stops working. So badly that I can't connect to it by ssh. And I have to hard reboot it.

When I look in the /var/log/messages it suddenly starts to give more messages like: Restarting analysis daemon for CAMx, time since last analysis 15 seconds
And Buffer overrun on all the camera's. (this is often when 2 camera's are configured to run as mocord, at night they wil do modect. Never got the problem at night)

than after a while this occures (there are peaces missing, but I attached my complete /var/log/messages file from yesterday untill today, the day the crash occured again.):

Code: Select all

Oct  7 11:48:14 scecamserver1 kernel: [756365.762188] zmc invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
Oct  7 11:48:14 scecamserver1 kernel: [756365.762192] zmc cpuset=/ mems_allowed=0
Oct  7 11:48:14 scecamserver1 kernel: [756365.762195] Pid: 25041, comm: zmc Not tainted 2.6.38.8-32.fc15.x86_64 #1
Oct  7 11:48:14 scecamserver1 kernel: [756365.762197] Call Trace:
...
Oct  7 11:48:14 scecamserver1 kernel: [756365.762274] Mem-Info:
Oct  7 11:48:14 scecamserver1 kernel: [756365.762275] Node 0 DMA per-cpu:
Oct  7 11:48:14 scecamserver1 kernel: [756365.762277] CPU    0: hi:    0, btch:   1 usd:   0
Oct  7 11:48:14 scecamserver1 kernel: [756365.762279] CPU    1: hi:    0, btch:   1 usd:   0
Oct  7 11:48:14 scecamserver1 kernel: [756365.762281] CPU    2: hi:    0, btch:   1 usd:   0
Oct  7 11:48:14 scecamserver1 kernel: [756365.762283] CPU    3: hi:    0, btch:   1 usd:   0
Oct  7 11:48:14 scecamserver1 kernel: [756365.762284] Node 0 DMA32 per-cpu:
Oct  7 11:48:14 scecamserver1 kernel: [756365.762286] CPU    0: hi:  186, btch:  31 usd:   0
Oct  7 11:48:14 scecamserver1 kernel: [756365.762288] CPU    1: hi:  186, btch:  31 usd:   0
Oct  7 11:48:14 scecamserver1 kernel: [756365.762290] CPU    2: hi:  186, btch:  31 usd:   0
Oct  7 11:48:14 scecamserver1 kernel: [756365.762292] CPU    3: hi:  186, btch:  31 usd:   0
Oct  7 11:48:14 scecamserver1 kernel: [756365.762293] Node 0 Normal per-cpu:
Oct  7 11:48:14 scecamserver1 kernel: [756365.762295] CPU    0: hi:  186, btch:  31 usd:  31
Oct  7 11:48:14 scecamserver1 kernel: [756365.762297] CPU    1: hi:  186, btch:  31 usd:  28
Oct  7 11:48:14 scecamserver1 kernel: [756365.762299] CPU    2: hi:  186, btch:  31 usd: 145
Oct  7 11:48:14 scecamserver1 kernel: [756365.762301] CPU    3: hi:  186, btch:  31 usd:   0
Oct  7 11:48:14 scecamserver1 kernel: [756365.762305] active_anon:29320 inactive_anon:7613 isolated_anon:64
Oct  7 11:48:14 scecamserver1 kernel: [756365.762306]  active_file:8 inactive_file:54 isolated_file:0
...
Oct  7 11:48:14 scecamserver1 kernel: [756365.785646] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
Oct  7 11:48:14 scecamserver1 kernel: [756365.785654] [  392]     0   392     4285        1   1     -17         -1000 udevd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785657] [  441]     0   441     4304        1   2     -17         -1000 udevd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785660] [  443]     0   443     4304        1   3     -17         -1000 udevd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785664] [  622]    38   622     7629       34   0       0             0 ntpd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785667] [  649]     0   649    29550       25   0       0             0 crond
Oct  7 11:48:14 scecamserver1 kernel: [756365.785670] [  658]    81   658     5364        0   0     -13          -900 dbus-daemon
Oct  7 11:48:14 scecamserver1 kernel: [756365.785673] [  672]     0   672    61261       65   2       0             0 rsyslogd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785676] [ 1087]     0  1087     6408        0   0     -17         -1000 auditd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785679] [ 1116]     0  1116     4754       17   0       0             0 rpcbind
Oct  7 11:48:14 scecamserver1 kernel: [756365.785682] [ 1188]    29  1188     5797        0   1       0             0 rpc.statd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785685] [ 1203]     0  1203    18777        0   1     -17         -1000 sshd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785688] [ 1214]     0  1214     7953        1   1       0             0 rpc.idmapd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785691] [ 1232]     0  1232     1035        0   3       0             0 agetty
Oct  7 11:48:14 scecamserver1 kernel: [756365.785694] [ 1233]     0  1233     1035        0   0       0             0 agetty
Oct  7 11:48:14 scecamserver1 kernel: [756365.785696] [ 1234]     0  1234     1035        0   2       0             0 agetty
Oct  7 11:48:14 scecamserver1 kernel: [756365.785699] [ 1235]     0  1235     1035        0   0       0             0 agetty
Oct  7 11:48:14 scecamserver1 kernel: [756365.785702] [ 1236]     0  1236    83579       12   0       0             0 httpd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785705] [ 1238]     0  1238     1035        0   3       0             0 agetty
Oct  7 11:48:14 scecamserver1 kernel: [756365.785708] [ 1239]     0  1239     1035        0   3       0             0 agetty
Oct  7 11:48:14 scecamserver1 kernel: [756365.785711] [ 1249]     0  1249    21508       40   0       0             0 sendmail
Oct  7 11:48:14 scecamserver1 kernel: [756365.785714] [ 1257]    51  1257    19552        0   0       0             0 sendmail
Oct  7 11:48:14 scecamserver1 kernel: [756365.785717] [17508]   501 17508    83727        1   0       0             0 httpd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785720] [18408]   501 18408    83924        1   0       0             0 httpd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785722] [18418]   501 18418    83915        1   0       0             0 httpd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785725] [18420]   501 18420    83918        1   0       0             0 httpd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785728] [18901]   501 18901    84098        1   0       0             0 httpd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785731] [19051]   501 19051    83959        0   0       0             0 httpd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785733] [19054]   501 19054    83974        0   0       0             0 httpd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785736] [19060]   501 19060    84817        0   0       0             0 httpd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785739] [19062]   501 19062    83909        1   1       0             0 httpd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785742] [19063]   501 19063    83917        0   0       0             0 httpd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785745] [19068]   501 19068    83916        1   1       0             0 httpd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785748] [19071]   501 19071    83955        1   1       0             0 httpd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785750] [19074]   501 19074    83763        1   1       0             0 httpd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785753] [19075]   501 19075    83763        1   0       0             0 httpd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785756] [19076]   501 19076    83974        1   0       0             0 httpd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785759] [19077]   501 19077    84485        1   0       0             0 httpd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785762] [19078]   501 19078    83763        1   0       0             0 httpd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785765] [19083]   501 19083    84485        1   0       0             0 httpd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785767] [19084]   501 19084    84496        0   0       0             0 httpd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785770] [19085]   501 19085    83920        1   0       0             0 httpd
Oct  7 11:48:14 scecamserver1 kernel: [756365.785773] [24388]   501 24388    44437      303   1       0             0 zmdc.pl
Oct  7 11:48:14 scecamserver1 kernel: [756365.785776] [24442]   501 24442    48732      339   1       0             0 zmfilter.pl
Oct  7 11:48:14 scecamserver1 kernel: [756365.785779] [24447]   501 24447    46420     2788   1       0             0 zmaudit.pl
Oct  7 11:48:14 scecamserver1 kernel: [756365.785782] [24452]   501 24452    45447      720   1       0             0 zmwatch.pl
Oct  7 11:48:14 scecamserver1 kernel: [756365.785785] [24457]   501 24457    46041        0   0       0             0 zmupdate.pl
Oct  7 11:48:14 scecamserver1 kernel: [756365.785788] [24973]   501 24973    81866     9778   1       0             0 zmc
Oct  7 11:48:14 scecamserver1 kernel: [756365.785790] [24989]   501 24989    81224     6190   1       0             0 zmc
Oct  7 11:48:14 scecamserver1 kernel: [756365.785793] [25041]   501 25041    82032     6030   0       0             0 zmc
Oct  7 11:48:14 scecamserver1 kernel: [756365.785796] [25059]   501 25059    80430     4221   0       0             0 zma
Oct  7 11:48:14 scecamserver1 kernel: [756365.785799] [25062]   501 25062    44437     2296   1       0             0 zmdc.pl
Oct  7 11:48:14 scecamserver1 kernel: [756365.785801] Out of memory: Kill process 24973 (zmc) score 2 or sacrifice child
Oct  7 11:48:14 scecamserver1 kernel: [756365.785819] Killed process 24973 (zmc) total-vm:327464kB, anon-rss:5408kB, file-rss:33704kB
...
Oct  7 11:48:52 scecamserver1 zma_m2[25067]: INF [In mode 3/1, warming up]

Sortlike messages occure every few minutes.

2 hours later the log suddenly stops. Below the last few rules in the log file:

Code: Select all

Oct  7 13:32:15 scecamserver1 zmaudit[24447]: INF [Filesystem event '4/26832' does not exist in database]
Oct  7 13:32:15 scecamserver1 zmaudit[24447]: INF [deleting]
Oct  7 13:32:16 scecamserver1 zmaudit[24447]: INF [Filesystem event '3/26833' does not exist in database]
Oct  7 13:32:16 scecamserver1 zmaudit[24447]: INF [deleting]
Oct  7 13:32:23 scecamserver1 zmdc[24388]: INF ['zmc -m 4' stopping at 11/10/07 13:32:22]
Oct  7 13:32:25 scecamserver1 zmc_m4[25356]: INF [Got signal 15 (Terminated), exiting]
Oct  7 13:32:29 scecamserver1 zmdc[24388]: INF ['zmc -m 4' crashed, signal 8]
Oct  7 13:32:29 scecamserver1 zmdc[24388]: INF [Starting pending process, zmc -m 4]
Oct  7 13:32:29 scecamserver1 zmdc[25409]: INF ['zmc -m 4' started at 11/10/07 13:32:29]
Oct  7 13:32:29 scecamserver1 zmdc[24388]: INF ['zmc -m 4' starting at 11/10/07 13:32:29, pid = 25409]
Oct  7 13:32:34 scecamserver1 zmc_m4[25409]: INF [Debug Level = 0, Debug Log = <none>]
Oct  7 13:33:02 scecamserver1 zmc_m4[25409]: INF [Starting Capture]
Oct  7 13:33:54 scecamserver1 zmc_m2[25381]: INF [SCECAM2: 1000 - Capturing at 3.00 fps]
Oct  7 13:34:01 scecamserver1 zmc_m3[25251]: INF [SCECAM3: 16000 - Capturing at 4.85 fps]
Oct  7 13:35:57 scecamserver1 zmc_m3[25251]: INF [SCECAM3: 17000 - Capturing at 8.70 fps]
Oct  7 13:36:43 scecamserver1 zmc_m2[25381]: INF [SCECAM2: 2000 - Capturing at 5.92 fps]
Oct  7 13:37:29 scecamserver1 zmc_m4[25409]: INF [SCECAM4: 1000 - Capturing at 3.75 fps]
Oct  7 13:38:09 scecamserver1 zmc_m3[25251]: INF [SCECAM3: 18000 - Capturing at 7.52 fps]

My linux knowledge is that of a somewath experienced beginner, so now I was wondering if there is someone who could help me to get a startingpoint of where I'm supposed to start looking for the problem. I'm am reading this forum for days now. Searching and reading a lot of posts. But until now with no results. (this crash is todays crash) the only thing I just noticed is that in the pid list of the second error, there are 20+ httpd processes running. Now after the reboot there are 8 httpd processes. I have the idea that it's a memory problem. The camera's are capturing at 15 frames per second 640x480. 24 bit color.

But I can't understand that my 3 camera's are causing problems when there are people with more than 10 camera's (and more) Ok, I need a good buffer because I write images over the network to the network storage. So I set my ringbuffer to 75 frames. Pre event buffer is 10, post event buffer 10.

Ok, I hope that I gave good information. Sorry, my english is not very good. If you need more or different information, just ask. I will be very hapy that someone tries to help me.


With kind regards,

Kleine

Oh yeah, it was a very rainy day today. And there is 1 outdoor camera and its not ideally configured (SCECam2). It could be usefull information. Thanks.

Re: Strange server crash

Posted: Sat Oct 08, 2011 7:59 am
by Flasheart
Hi

Oct 7 11:48:14 scecamserver1 kernel: [756365.762188] zmc invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
That tells the tale. Your computer has run out of memory.

Now, there is are a few things to bear in mind with this:

1, It rarely means your computer doesnt have enough ram. You have plenty.
2, The process that is killed is irrelevant. zmc was killed here, but I'll bet it's not the misbehaving app.
3, It can be an utter pig to find out which process is causing the problem.
4, Most likely, your core OS, hardware, network etc are all fine. This is a software problem.

Linux's oomkiller is invoked by the kernel when a process cannot successfully allocate free memory. oomkiller has some logic to decide which process it wants to kill, and will chomp away until it frees enough ram, or until it eats something critical and locks your machine. Google oomkiller, I think there's a wiki that explains it fairly well too.

One of the apps running on this server, or is scheduled to run before the crash, has a memory leak. It's growing and eating all the ram until oomkiller is called into play. By that time it's usually game over. oomkiller is very much a last ditch attempt and only rarely manages to kill the right app in my experience.

How to solve? Monitor things - munin is a good app to run for overall memory usage. Check your swap settings. Then it's a slow and painful task of disabling apps until things stop crashing. (You can crontab memory usage logs for apps, but sometimes it happens so quickly that'll miss it).

I had exactly this issue on a virtual machine which persisted across hosts and even an os rebuild. Took me nigh 8 months of crashes to locate the issue as a cronned webstats script.

Good luck!

Re: Strange server crash

Posted: Sat Oct 08, 2011 8:45 am
by Kleine
Hi,

Thanks for this information. You confirmed what I was already fearing. The server is a clean install without a userinterface. I only installed the packages used by Zoneminder, Zoneminder itself and Apache.

I will try to install a monitoring application such as Munin to look at the problem. Hoping to tackle the problem faster than 8 month's

With kind regards,

Kleine