[Solved] Zoneminder constantly crashing entire system

Forum for questions and support relating to the 1.29.x releases only.
Locked
XaFFaX
Posts: 18
Joined: Wed Aug 17, 2016 7:00 am

[Solved] Zoneminder constantly crashing entire system

Post by XaFFaX »

Hi all,

I am posting to this forum as a last resort for me. I think I googled entire internet in search for a cause of issues I have with Zoneminder. I will try to describe my issues as much and as accurately as possible, if more information is needed, please let me know I will post it.

The problem:
When running Zoneminder every few hours (on average, may be more, may be less) I get either kernel panic, or some other total crash, which is not even logged by system logs. This is only happening when running Zoneminder.

My setup:

Mainboard: MSI H81M-P33
CPU: Celeron G1820T
RAM: 2x2GB
Drive: WD Red Series
System: Arch LInux (Antergos)
Zoneminder: AUR version, most recent (1.29.0-2)
Camera: running one HD camera (ubiquiti aircam), I am using "modect" mode registering 1280x720@20fps (this probably does not matter though)

How installed:
I followed Arch Linux guide on installing Zoneminder from AUR.

More detailed view on the problem:
The issue is difficult for me to track down, because I usually do not get logs/trace/dumps when system hangs. A lot seems to be pointing to MySQL issues, but I have not been able to find anything on that or to reproduce it. MySQL has shown segmentation faults a few times, the same was with zmdc.pl having segfaults, I had issues like CPU being stuck for 20+ seconds, low memory corruption, different varieties of things happening. When reading you might be thinking this is a hardware issue, but I do not think so, please continue reading.

What I have done to try to find/fix the problem:
Apart of googling entire internet :) I did following checks:
- been running Memtest for about 10 passes - no errors,
- ran mprime for 24+ hours (more like 30 probably), with 100% CPU load and a lot of memory used - no errors,
- changed memory mapping to get rid of "Low memory corruption" issues I had - no difference, still crashing (according to this forum post I found: https://bbs.archlinux.org/viewtopic.php?id=189483), my Arch updated its memory map and it does not use low memory (I even reserved entire 640k to be sure),
- updated CPU microcode using this guide: https://wiki.archlinux.org/index.php/Microcode and regenerated Grub config, again - no change
- changed some settings concerning buffering and memory for MySQL - no change,
- checked disk drive with fsck for filesystem and surface problems - no change (there were some filesystem errors, due to system hangs, but were fixed),
- checked MySQL DB for errors - none found
- changed Zoneminder monitor settings (buffers, sources etc.), no change
- changed "nph-zms" to "zms" just for the sake of checking I did not expect this would change anything, and it did not (when works, it works fine on both though)

I think this is most that I did. The machine I am running Zoneminder on can run without it with uptime of weeks without a problem, so I expect this to be a Zoneminder-related problem, I may be wrong though.

Some observations/comments:
- this is not a free disk space issue, accoring to Zoneminder I have 4% disk used,
- this is not insufficient memory issue either, /dev/shm has 1.9G and Zoneminder uses up to 700MB per what I have seen,
- I have noticed in Zoneminder log, that there is a lot (and I mean A LOT) of logs like this:

Code: Select all

08/16/16 20:53:57.336369 zmf[936].DB1-zmf.cpp/244 [Select timed out]
This is being logged every second or so, been trying to find what this actually means, but I found nothing useful (one of the things was part of the source code by which this is being generated, but this does not help me, since I do not now why "if" triggers to display this). Zoneminder works fine other than that, records events, live view works fine, download of video files from events works fine (when it works of course...), so access to the database seems to be ok.

Logs:
I am posting log from /var/log/zoneminder. Their structure is very strange for me (many different files with random numbers, as they seem, at the end), but I found the most recent and largest one, with all those "Select timed out" errors/warnings. I have cut them, because the log has > 2MB and pastebin does not allow more than 512kb. There is nothing else apart of this message in the part I removed.

http://pastebin.com/UJjiVqRK

I have kept the last log entry though, it is cut-off in the log as you see it pasted. It seems that machine hung in the middle of it being written.

Please let me know what other logs would be helpful if any.

Any help is appreciated, the application is very good, I just would like to have it running...

Regards,
XaFFaX
SteveGilvarry
Posts: 494
Joined: Sun Jun 29, 2014 1:12 pm
Location: Melbourne, AU

Re: Zoneminder constantly crashing entire system

Post by SteveGilvarry »

Do you have this option in System on OPT_FRAME_SERVER? If so turn off as it won't be helping.
Production Zoneminder 1.37.x (Living dangerously)
Random Selection of Cameras (Dahua and Hikvision)
XaFFaX
Posts: 18
Joined: Wed Aug 17, 2016 7:00 am

Re: Zoneminder constantly crashing entire system

Post by XaFFaX »

Thanks, I have started Zoneminder, disabled the option and restarted. I will keep it running, maybe it will help...

On another not, does anyone know when this "Select timed out" is being generated? The only thing I remember concerning this was related to HTTP_TIMEOUT setting in Zoneminder. I have increased this value, but I still get timeouts. Besides, it does not seem like a http issue, more like database problem. Nonetheless if anyone knows what causes this to get triggered, this may be helpful. Logging the same line every second is not a good thing...
SteveGilvarry
Posts: 494
Joined: Sun Jun 29, 2014 1:12 pm
Location: Melbourne, AU

Re: Zoneminder constantly crashing entire system

Post by SteveGilvarry »

Sorry I should have said zmf was generating it. May be hiding a different issue that we will now see. I suspect zmf will be deprecated at some point.
https://github.com/ZoneMinder/ZoneMinde ... f.cpp#L244
Production Zoneminder 1.37.x (Living dangerously)
Random Selection of Cameras (Dahua and Hikvision)
SteveGilvarry
Posts: 494
Joined: Sun Jun 29, 2014 1:12 pm
Location: Melbourne, AU

Re: Zoneminder constantly crashing entire system

Post by SteveGilvarry »

Save some power and lower fps, analysing 20fps vs 5fps won't make much difference to detection, but massive difference in load.
Production Zoneminder 1.37.x (Living dangerously)
Random Selection of Cameras (Dahua and Hikvision)
XaFFaX
Posts: 18
Joined: Wed Aug 17, 2016 7:00 am

Re: Zoneminder constantly crashing entire system

Post by XaFFaX »

This is the part of code I have seen. the problem is I do not know what this "select" is suppose to select in the DB, and why it does not find it. Anyway, I have started Zoneminder again with the suggested change, I will check the logs to see if there is any change.
XaFFaX
Posts: 18
Joined: Wed Aug 17, 2016 7:00 am

Re: Zoneminder constantly crashing entire system

Post by XaFFaX »

SteveGilvarry wrote:Save some power and lower fps, analysing 20fps vs 5fps won't make much difference to detection, but massive difference in load.
I know, read about that as well. The load is not bad, I get 0.3-0.4 load when I am using that (lowered from 25fps anyway :) ). I suppose it should work on either, and I expect that lowering the load might "conceal" the issue, because it may be it will come up less often. When there will be a solution to the problem I will probably lower the fps. For now I suppose it will not hurt to run on such fps.
SteveGilvarry
Posts: 494
Joined: Sun Jun 29, 2014 1:12 pm
Location: Melbourne, AU

Re: Zoneminder constantly crashing entire system

Post by SteveGilvarry »

Not a DB command. Not across that code and pretty sure no one is testing it.
http://man7.org/linux/man-pages/man2/select.2.html
Production Zoneminder 1.37.x (Living dangerously)
Random Selection of Cameras (Dahua and Hikvision)
XaFFaX
Posts: 18
Joined: Wed Aug 17, 2016 7:00 am

Re: Zoneminder constantly crashing entire system

Post by XaFFaX »

SteveGilvarry wrote:Not a DB command. Not across that code and pretty sure no one is testing it.
http://man7.org/linux/man-pages/man2/select.2.html
Ok, I see it now, some kind of file operation. I did not go in-depth with the code, since I did not think I would figure out what it does anyway, but it is good to know.

Anyway, after the proposed change in the settings I do not see any "Select timeouts" (at least for now). It may be it fixed that problem. I will let Zoneminder run for more time though, because maybe this will show up later. I will also see if it crashes in coming hours.
mikb
Posts: 678
Joined: Mon Mar 25, 2013 12:34 pm

Re: Zoneminder constantly crashing entire system

Post by mikb »

As a general principle, Linux does not fall on its face due to user processes going awry.

Even if your Zoneminder ate your whole system resources, I'd expect the system to crawl to an unusable halt, but not kernel panic and crash :(

As to not getting logs of the kernel panic, there are methods to get syslog to log a remote machine, or to a serial port, which may help you. The reason you won't see the crash in the logs is that the machine crashed (kind of circular problem there) and didn't flush the log files out. Annoying as this is, you do not WANT a braindamaged kernel trying to write to your file systems!

P.S. the select() call is to let a process wait efficiently for read/write activity on a file, a TCP socket, etc. Rather than furious polling, or getting stuck waiting for input/output that never moves. It is nothing to do with SELECT * FROM TABLE WHERE .... SQLspeak. And probably not connected to your problem.
XaFFaX
Posts: 18
Joined: Wed Aug 17, 2016 7:00 am

Re: Zoneminder constantly crashing entire system

Post by XaFFaX »

mikb wrote:As a general principle, Linux does not fall on its face due to user processes going awry.

Even if your Zoneminder ate your whole system resources, I'd expect the system to crawl to an unusable halt, but not kernel panic and crash :(

As to not getting logs of the kernel panic, there are methods to get syslog to log a remote machine, or to a serial port, which may help you. The reason you won't see the crash in the logs is that the machine crashed (kind of circular problem there) and didn't flush the log files out. Annoying as this is, you do not WANT a braindamaged kernel trying to write to your file systems!

P.S. the select() call is to let a process wait efficiently for read/write activity on a file, a TCP socket, etc. Rather than furious polling, or getting stuck waiting for input/output that never moves. It is nothing to do with SELECT * FROM TABLE WHERE .... SQLspeak. And probably not connected to your problem.
Yes, I am aware of this, today no operating system should "go nuts" from just a problem in user process (in theory et least, when using say C++ you can make an app that will kill operating system, but then I do not think this is a goal for Zoneminder :D ). This is why I am pulling my hair out to understand what the problem is. Especially that it seems all other apps work fine, and HW seems to be fine too. Maybe there is some kind of a driver issue? But then Zoneminder does not use anything "special" that uses some specific driver (especially that I use IP camera) and network card is being used regardless of Zoneminder, with 100s of GB data every month or so (I use this machine as a DLNA, backup, samba, torrent and a few other server, it used to be a router too, but then my provider changed...). Unless there is such a strange bug in either Zoneminder or the way it interacts with MySQL or Apache that causes this, other than that I do not know how to explain its behaviour. Or perhaps I have something messed up with for example MySQL, but I do not think so...

As for logs - yes it does make sense, that there may not be logs of the situation, this is making it even more difficult and annoying to track down. This is why I wrote to forum, maybe there is someone who may know what is going on or what and how to check to possibly find cause and fix it.

As for the select() function - yeah, I read the docs that were posted a few posts before, I have an idea on what this does. I understand that this is not related to DB. However the interesting thing is, that after deselecting the option mentioned (OPT_FRAME_SERVER) I do not get those "Select timed out" warnings/errors in the log, so this has improved at least. I just have the regular "Analysing at x fps" etc. entries.

And another thing is that as I started Zoneminder after this change in the morning hours today (about 12 hours ago) it works fine until now. It does not mean anything yet, it happened that it used to work for 10h+, but at least my logs are not overwhelmed by useless entries. I will leave this tonight and see if in the morning it will still work.

Thanks again for all help and I will post if I will have anything useful to post.

Regards,
XaFFaX
XaFFaX
Posts: 18
Joined: Wed Aug 17, 2016 7:00 am

Re: Zoneminder constantly crashing entire system

Post by XaFFaX »

As a matter of update - after changing the recommended option (OPT_FRAME_SERVER) not only I am not getting "select timeouts", but now Zoneminder is running almost 26h without a problem. I am cautiously optimistic, that this may have been the problem. Obviously this does not mean anything yet, but it seems to have improved.

I still do not know however why I am getting those, and what this is supposed to do (this select() which times out). I think someone should take a look on this and at least let know how to actually fix this. Others may get this issue too, and it may be difficult to track down.
XaFFaX
Posts: 18
Joined: Wed Aug 17, 2016 7:00 am

Re: Zoneminder constantly crashing entire system

Post by XaFFaX »

So unfortunately it crashed again. Got kernel panic, but I do have some logs on what happened, do not know if it will be of any help, but maybe:

Code: Select all

Aug 19 09:24:27 xaffax-router kernel:  [<ffffffff811791d4>] __alloc_pages_nodemask+0x144/0xc10
Aug 19 09:24:27 xaffax-router kernel:  [<ffffffffa01c84e5>] ? __ext4_handle_dirty_metadata+0x45/0x200 [ext4]
Aug 19 09:24:27 xaffax-router kernel:  [<ffffffff8122d664>] ? __find_get_block+0xa4/0x120
Aug 19 09:24:27 xaffax-router kernel:  [<ffffffff811c8f9c>] alloc_pages_current+0x8c/0x110
Aug 19 09:24:27 xaffax-router kernel:  [<ffffffff8116f94a>] __page_cache_alloc+0xca/0xe0
Aug 19 09:24:27 xaffax-router kernel:  [<ffffffff8116fa40>] pagecache_get_page+0xe0/0x220
Aug 19 09:24:27 xaffax-router kernel:  [<ffffffff8116fba6>] grab_cache_page_write_begin+0x26/0x40
Aug 19 09:24:27 xaffax-router kernel:  [<ffffffffa019952d>] ext4_da_write_begin+0x9d/0x3a0 [ext4]
Aug 19 09:24:27 xaffax-router kernel:  [<ffffffff812f73b9>] ? memcpy_from_page+0x49/0x90
Aug 19 09:24:27 xaffax-router kernel:  [<ffffffff8116f78d>] generic_perform_write+0xcd/0x1c0
Aug 19 09:24:27 xaffax-router kernel:  [<ffffffff8121195f>] ? file_update_time+0x5f/0x110
Aug 19 09:24:27 xaffax-router kernel:  [<ffffffff81170d1c>] __generic_file_write_iter+0x14c/0x1c0
Aug 19 09:24:27 xaffax-router kernel:  [<ffffffffa018dd33>] ext4_file_write_iter+0xe3/0x3e0 [ext4]
Aug 19 09:24:27 xaffax-router kernel:  [<ffffffff810c433a>] ? mutex_spin_on_owner.isra.0+0x6a/0x80
Aug 19 09:24:27 xaffax-router kernel:  [<ffffffff811f6237>] vfs_iter_write+0x77/0xc0
Aug 19 09:24:27 xaffax-router kernel:  [<ffffffff81227f9e>] iter_file_splice_write+0x25e/0x390
Aug 19 09:24:27 xaffax-router kernel:  [<ffffffff81229476>] SyS_splice+0x326/0x790
Aug 19 09:24:27 xaffax-router kernel:  [<ffffffff815c7232>] entry_SYSCALL_64_fastpath+0x1a/0xa4
Aug 19 09:24:27 xaffax-router kernel: Disabling lock debugging due to kernel taint
Aug 19 09:24:26 xaffax-router zmdc[1240]: INF ['zmc -m 1' crashed, signal 134]
Aug 19 09:24:27 xaffax-router systemd-coredump[11402]: Process 11396 (zmc) of user 33 dumped core.

                                                       Stack trace of thread 11396:
                                                       #0  0x00007fb33e0e704f raise (libc.so.6)
                                                       #1  0x00007fb33e0e847a abort (libc.so.6)
                                                       #2  0x00007fb33e124c50 __libc_message (libc.so.6)
                                                       #3  0x00007fb33e1acf17 __fortify_fail (libc.so.6)
                                                       #4  0x00007fb33e1acee0 __stack_chk_fail (libc.so.6)
                                                       #5  0x00007fb33fbdaa1f n/a (libavcodec.so.57)
                                                       #6  0x0000000000000fab n/a (n/a)
Aug 19 09:24:33 xaffax-router zmwatch[1310]: INF [Restarting capture daemon for Monitor-1, time since last capture 7 seconds (1471591473-1471591466)]
Aug 19 09:24:34 xaffax-router zmdc[1240]: INF ['zmc -m 1' starting at 16/08/19 09:24:34, pid = 11417]
Aug 19 09:24:34 xaffax-router zmdc[11417]: INF ['zmc -m 1' started at 16/08/19 09:24:34]
Aug 19 09:24:34 xaffax-router undef[11417]: INF [No Server ID or Name specified in config.  Not using Multi-Server Mode.]
Aug 19 09:24:34 xaffax-router zmc_m1[11417]: INF [Starting Capture version 1.29.0]
Aug 19 09:24:34 xaffax-router zmc_m1[11417]: INF [Priming capture from rtsp://192.168.1.5:554/live/ch00_0]
Does anyone have any idea on what could be the cause? I recall now, that it happened that libavcodec.so had a problem and crashed.

If anyone can give me a hint on what I can check I will appreciate it.

Regards,
XaFFaX
XaFFaX
Posts: 18
Joined: Wed Aug 17, 2016 7:00 am

Re: Zoneminder constantly crashing entire system

Post by XaFFaX »

Just as a closing post - when I ran out of options I decided to more thoroughly check hardware of my system to be sure if this is not a hardware issue. I started replacing components to check if any causes an issue. It turned out that when I replaced RAM the system stopped crashing! The most interesting thing is that on the same machine when I ran MEMTEST later on that RAM for over 13h (15 full passes) there were NO ERRORS... It seems one cannot trust anybody now... It looks like some strange incompatibility issue on OS level that caused that. When I changed RAM I currently have several days of uptime without issue.

I just wanted to post that to close down the thread. Thanks to all for all help!

EDIT: I cannot seem to be able to close thread. Please close this thread if possible to keep things clean.
akg1508
Posts: 26
Joined: Sun Aug 02, 2015 9:06 am

Re: Zoneminder constantly crashing entire system

Post by akg1508 »

You can mark the thread solved by editing your original post, which will allow you to edit the subject, and insert [SOLVED] at the start.
Locked