Strange server crash
Posted: Fri Oct 07, 2011 8:08 pm
Hello,
I hope someone will help me to start debugging this strange behaviour. The situation is the following:
I have a new Dell poweredge T110 server with Intel Xeon X3430 (2.40GHz, 4C, 8M Cache) with 2 * 4GB memory.
Coupled to a QNap network storage to store the events and wich has the mysql database running on 1 networkcard in our first VLAN (through a gigabit switch).
His second networkcord is connected to the second VLAN wich contains 3 foscam camera's.
The server only runs the apache webserver and zoneminder v1.24.4. And as operating system it has Fedora 2.6.38.8-32.fc15.x86_64
Most of the days the server runs normally. Except the occasionally bufferoverrun it is fairly stable. (the system is in use for 1 month now) But once a week it stops working. So badly that I can't connect to it by ssh. And I have to hard reboot it.
When I look in the /var/log/messages it suddenly starts to give more messages like: Restarting analysis daemon for CAMx, time since last analysis 15 seconds
And Buffer overrun on all the camera's. (this is often when 2 camera's are configured to run as mocord, at night they wil do modect. Never got the problem at night)
than after a while this occures (there are peaces missing, but I attached my complete /var/log/messages file from yesterday untill today, the day the crash occured again.):
Sortlike messages occure every few minutes.
2 hours later the log suddenly stops. Below the last few rules in the log file:
My linux knowledge is that of a somewath experienced beginner, so now I was wondering if there is someone who could help me to get a startingpoint of where I'm supposed to start looking for the problem. I'm am reading this forum for days now. Searching and reading a lot of posts. But until now with no results. (this crash is todays crash) the only thing I just noticed is that in the pid list of the second error, there are 20+ httpd processes running. Now after the reboot there are 8 httpd processes. I have the idea that it's a memory problem. The camera's are capturing at 15 frames per second 640x480. 24 bit color.
But I can't understand that my 3 camera's are causing problems when there are people with more than 10 camera's (and more) Ok, I need a good buffer because I write images over the network to the network storage. So I set my ringbuffer to 75 frames. Pre event buffer is 10, post event buffer 10.
Ok, I hope that I gave good information. Sorry, my english is not very good. If you need more or different information, just ask. I will be very hapy that someone tries to help me.
With kind regards,
Kleine
Oh yeah, it was a very rainy day today. And there is 1 outdoor camera and its not ideally configured (SCECam2). It could be usefull information. Thanks.
I hope someone will help me to start debugging this strange behaviour. The situation is the following:
I have a new Dell poweredge T110 server with Intel Xeon X3430 (2.40GHz, 4C, 8M Cache) with 2 * 4GB memory.
Coupled to a QNap network storage to store the events and wich has the mysql database running on 1 networkcard in our first VLAN (through a gigabit switch).
His second networkcord is connected to the second VLAN wich contains 3 foscam camera's.
The server only runs the apache webserver and zoneminder v1.24.4. And as operating system it has Fedora 2.6.38.8-32.fc15.x86_64
Most of the days the server runs normally. Except the occasionally bufferoverrun it is fairly stable. (the system is in use for 1 month now) But once a week it stops working. So badly that I can't connect to it by ssh. And I have to hard reboot it.
When I look in the /var/log/messages it suddenly starts to give more messages like: Restarting analysis daemon for CAMx, time since last analysis 15 seconds
And Buffer overrun on all the camera's. (this is often when 2 camera's are configured to run as mocord, at night they wil do modect. Never got the problem at night)
than after a while this occures (there are peaces missing, but I attached my complete /var/log/messages file from yesterday untill today, the day the crash occured again.):
Code: Select all
Oct 7 11:48:14 scecamserver1 kernel: [756365.762188] zmc invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
Oct 7 11:48:14 scecamserver1 kernel: [756365.762192] zmc cpuset=/ mems_allowed=0
Oct 7 11:48:14 scecamserver1 kernel: [756365.762195] Pid: 25041, comm: zmc Not tainted 2.6.38.8-32.fc15.x86_64 #1
Oct 7 11:48:14 scecamserver1 kernel: [756365.762197] Call Trace:
...
Oct 7 11:48:14 scecamserver1 kernel: [756365.762274] Mem-Info:
Oct 7 11:48:14 scecamserver1 kernel: [756365.762275] Node 0 DMA per-cpu:
Oct 7 11:48:14 scecamserver1 kernel: [756365.762277] CPU 0: hi: 0, btch: 1 usd: 0
Oct 7 11:48:14 scecamserver1 kernel: [756365.762279] CPU 1: hi: 0, btch: 1 usd: 0
Oct 7 11:48:14 scecamserver1 kernel: [756365.762281] CPU 2: hi: 0, btch: 1 usd: 0
Oct 7 11:48:14 scecamserver1 kernel: [756365.762283] CPU 3: hi: 0, btch: 1 usd: 0
Oct 7 11:48:14 scecamserver1 kernel: [756365.762284] Node 0 DMA32 per-cpu:
Oct 7 11:48:14 scecamserver1 kernel: [756365.762286] CPU 0: hi: 186, btch: 31 usd: 0
Oct 7 11:48:14 scecamserver1 kernel: [756365.762288] CPU 1: hi: 186, btch: 31 usd: 0
Oct 7 11:48:14 scecamserver1 kernel: [756365.762290] CPU 2: hi: 186, btch: 31 usd: 0
Oct 7 11:48:14 scecamserver1 kernel: [756365.762292] CPU 3: hi: 186, btch: 31 usd: 0
Oct 7 11:48:14 scecamserver1 kernel: [756365.762293] Node 0 Normal per-cpu:
Oct 7 11:48:14 scecamserver1 kernel: [756365.762295] CPU 0: hi: 186, btch: 31 usd: 31
Oct 7 11:48:14 scecamserver1 kernel: [756365.762297] CPU 1: hi: 186, btch: 31 usd: 28
Oct 7 11:48:14 scecamserver1 kernel: [756365.762299] CPU 2: hi: 186, btch: 31 usd: 145
Oct 7 11:48:14 scecamserver1 kernel: [756365.762301] CPU 3: hi: 186, btch: 31 usd: 0
Oct 7 11:48:14 scecamserver1 kernel: [756365.762305] active_anon:29320 inactive_anon:7613 isolated_anon:64
Oct 7 11:48:14 scecamserver1 kernel: [756365.762306] active_file:8 inactive_file:54 isolated_file:0
...
Oct 7 11:48:14 scecamserver1 kernel: [756365.785646] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
Oct 7 11:48:14 scecamserver1 kernel: [756365.785654] [ 392] 0 392 4285 1 1 -17 -1000 udevd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785657] [ 441] 0 441 4304 1 2 -17 -1000 udevd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785660] [ 443] 0 443 4304 1 3 -17 -1000 udevd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785664] [ 622] 38 622 7629 34 0 0 0 ntpd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785667] [ 649] 0 649 29550 25 0 0 0 crond
Oct 7 11:48:14 scecamserver1 kernel: [756365.785670] [ 658] 81 658 5364 0 0 -13 -900 dbus-daemon
Oct 7 11:48:14 scecamserver1 kernel: [756365.785673] [ 672] 0 672 61261 65 2 0 0 rsyslogd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785676] [ 1087] 0 1087 6408 0 0 -17 -1000 auditd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785679] [ 1116] 0 1116 4754 17 0 0 0 rpcbind
Oct 7 11:48:14 scecamserver1 kernel: [756365.785682] [ 1188] 29 1188 5797 0 1 0 0 rpc.statd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785685] [ 1203] 0 1203 18777 0 1 -17 -1000 sshd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785688] [ 1214] 0 1214 7953 1 1 0 0 rpc.idmapd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785691] [ 1232] 0 1232 1035 0 3 0 0 agetty
Oct 7 11:48:14 scecamserver1 kernel: [756365.785694] [ 1233] 0 1233 1035 0 0 0 0 agetty
Oct 7 11:48:14 scecamserver1 kernel: [756365.785696] [ 1234] 0 1234 1035 0 2 0 0 agetty
Oct 7 11:48:14 scecamserver1 kernel: [756365.785699] [ 1235] 0 1235 1035 0 0 0 0 agetty
Oct 7 11:48:14 scecamserver1 kernel: [756365.785702] [ 1236] 0 1236 83579 12 0 0 0 httpd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785705] [ 1238] 0 1238 1035 0 3 0 0 agetty
Oct 7 11:48:14 scecamserver1 kernel: [756365.785708] [ 1239] 0 1239 1035 0 3 0 0 agetty
Oct 7 11:48:14 scecamserver1 kernel: [756365.785711] [ 1249] 0 1249 21508 40 0 0 0 sendmail
Oct 7 11:48:14 scecamserver1 kernel: [756365.785714] [ 1257] 51 1257 19552 0 0 0 0 sendmail
Oct 7 11:48:14 scecamserver1 kernel: [756365.785717] [17508] 501 17508 83727 1 0 0 0 httpd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785720] [18408] 501 18408 83924 1 0 0 0 httpd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785722] [18418] 501 18418 83915 1 0 0 0 httpd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785725] [18420] 501 18420 83918 1 0 0 0 httpd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785728] [18901] 501 18901 84098 1 0 0 0 httpd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785731] [19051] 501 19051 83959 0 0 0 0 httpd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785733] [19054] 501 19054 83974 0 0 0 0 httpd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785736] [19060] 501 19060 84817 0 0 0 0 httpd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785739] [19062] 501 19062 83909 1 1 0 0 httpd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785742] [19063] 501 19063 83917 0 0 0 0 httpd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785745] [19068] 501 19068 83916 1 1 0 0 httpd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785748] [19071] 501 19071 83955 1 1 0 0 httpd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785750] [19074] 501 19074 83763 1 1 0 0 httpd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785753] [19075] 501 19075 83763 1 0 0 0 httpd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785756] [19076] 501 19076 83974 1 0 0 0 httpd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785759] [19077] 501 19077 84485 1 0 0 0 httpd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785762] [19078] 501 19078 83763 1 0 0 0 httpd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785765] [19083] 501 19083 84485 1 0 0 0 httpd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785767] [19084] 501 19084 84496 0 0 0 0 httpd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785770] [19085] 501 19085 83920 1 0 0 0 httpd
Oct 7 11:48:14 scecamserver1 kernel: [756365.785773] [24388] 501 24388 44437 303 1 0 0 zmdc.pl
Oct 7 11:48:14 scecamserver1 kernel: [756365.785776] [24442] 501 24442 48732 339 1 0 0 zmfilter.pl
Oct 7 11:48:14 scecamserver1 kernel: [756365.785779] [24447] 501 24447 46420 2788 1 0 0 zmaudit.pl
Oct 7 11:48:14 scecamserver1 kernel: [756365.785782] [24452] 501 24452 45447 720 1 0 0 zmwatch.pl
Oct 7 11:48:14 scecamserver1 kernel: [756365.785785] [24457] 501 24457 46041 0 0 0 0 zmupdate.pl
Oct 7 11:48:14 scecamserver1 kernel: [756365.785788] [24973] 501 24973 81866 9778 1 0 0 zmc
Oct 7 11:48:14 scecamserver1 kernel: [756365.785790] [24989] 501 24989 81224 6190 1 0 0 zmc
Oct 7 11:48:14 scecamserver1 kernel: [756365.785793] [25041] 501 25041 82032 6030 0 0 0 zmc
Oct 7 11:48:14 scecamserver1 kernel: [756365.785796] [25059] 501 25059 80430 4221 0 0 0 zma
Oct 7 11:48:14 scecamserver1 kernel: [756365.785799] [25062] 501 25062 44437 2296 1 0 0 zmdc.pl
Oct 7 11:48:14 scecamserver1 kernel: [756365.785801] Out of memory: Kill process 24973 (zmc) score 2 or sacrifice child
Oct 7 11:48:14 scecamserver1 kernel: [756365.785819] Killed process 24973 (zmc) total-vm:327464kB, anon-rss:5408kB, file-rss:33704kB
...
Oct 7 11:48:52 scecamserver1 zma_m2[25067]: INF [In mode 3/1, warming up]
2 hours later the log suddenly stops. Below the last few rules in the log file:
Code: Select all
Oct 7 13:32:15 scecamserver1 zmaudit[24447]: INF [Filesystem event '4/26832' does not exist in database]
Oct 7 13:32:15 scecamserver1 zmaudit[24447]: INF [deleting]
Oct 7 13:32:16 scecamserver1 zmaudit[24447]: INF [Filesystem event '3/26833' does not exist in database]
Oct 7 13:32:16 scecamserver1 zmaudit[24447]: INF [deleting]
Oct 7 13:32:23 scecamserver1 zmdc[24388]: INF ['zmc -m 4' stopping at 11/10/07 13:32:22]
Oct 7 13:32:25 scecamserver1 zmc_m4[25356]: INF [Got signal 15 (Terminated), exiting]
Oct 7 13:32:29 scecamserver1 zmdc[24388]: INF ['zmc -m 4' crashed, signal 8]
Oct 7 13:32:29 scecamserver1 zmdc[24388]: INF [Starting pending process, zmc -m 4]
Oct 7 13:32:29 scecamserver1 zmdc[25409]: INF ['zmc -m 4' started at 11/10/07 13:32:29]
Oct 7 13:32:29 scecamserver1 zmdc[24388]: INF ['zmc -m 4' starting at 11/10/07 13:32:29, pid = 25409]
Oct 7 13:32:34 scecamserver1 zmc_m4[25409]: INF [Debug Level = 0, Debug Log = <none>]
Oct 7 13:33:02 scecamserver1 zmc_m4[25409]: INF [Starting Capture]
Oct 7 13:33:54 scecamserver1 zmc_m2[25381]: INF [SCECAM2: 1000 - Capturing at 3.00 fps]
Oct 7 13:34:01 scecamserver1 zmc_m3[25251]: INF [SCECAM3: 16000 - Capturing at 4.85 fps]
Oct 7 13:35:57 scecamserver1 zmc_m3[25251]: INF [SCECAM3: 17000 - Capturing at 8.70 fps]
Oct 7 13:36:43 scecamserver1 zmc_m2[25381]: INF [SCECAM2: 2000 - Capturing at 5.92 fps]
Oct 7 13:37:29 scecamserver1 zmc_m4[25409]: INF [SCECAM4: 1000 - Capturing at 3.75 fps]
Oct 7 13:38:09 scecamserver1 zmc_m3[25251]: INF [SCECAM3: 18000 - Capturing at 7.52 fps]
My linux knowledge is that of a somewath experienced beginner, so now I was wondering if there is someone who could help me to get a startingpoint of where I'm supposed to start looking for the problem. I'm am reading this forum for days now. Searching and reading a lot of posts. But until now with no results. (this crash is todays crash) the only thing I just noticed is that in the pid list of the second error, there are 20+ httpd processes running. Now after the reboot there are 8 httpd processes. I have the idea that it's a memory problem. The camera's are capturing at 15 frames per second 640x480. 24 bit color.
But I can't understand that my 3 camera's are causing problems when there are people with more than 10 camera's (and more) Ok, I need a good buffer because I write images over the network to the network storage. So I set my ringbuffer to 75 frames. Pre event buffer is 10, post event buffer 10.
Ok, I hope that I gave good information. Sorry, my english is not very good. If you need more or different information, just ask. I will be very hapy that someone tries to help me.
With kind regards,
Kleine
Oh yeah, it was a very rainy day today. And there is 1 outdoor camera and its not ideally configured (SCECam2). It could be usefull information. Thanks.