Page 1 of 1
Data corruption with mmapped shm (shared_data size conflict)
Posted: Thu Sep 10, 2015 6:53 pm
by jcrews
I've been using ZM with success for quite some time. Recently I bought an h.264 camera, and had quite a few log messages complaining about shared_data size mismatches (e.g., expected 336, got <large integer>). It mostly seemed to work fine, otherwise, however. zmwatch, which was complaining about the size mismatch, was not functional and would never catch a dead capture daemon.
The cause was found to be that zmdc.pl, which spawns every other zm process with in a fork()ed context, closes *all* of its file descriptors prior to exec. That, combined with libvlc, which likes to print to sterr even when -q is passed, would overwrite an arbitrary amount of the memory mapped ipc blob whenever it generated log messages (which was often). The first casualty is the shared_data struct, and could extend all the way to image buffers if the message is relatively long.
I solved this on my servers by altering zmdc.pl's start() to start with my $fd = 3 instead of 0, to skip 0,1,2 (stdin, stdout, stderr). This way, traffic to/from stdio won't overwrite private data accidentally. This is at line 463 in the 1.28.1 distribution.
Since stdio is system managed anyway, there doesn't appear to be any obvious harm in holding onto those fds, to prevent undefined behavior from arising in cases where external library functions write to them, and those fds (0-2) happen to be assigned to private endpoints.
Re: Data corruption with mmapped shm (shared_data size confl
Posted: Fri Sep 11, 2015 1:57 pm
by iconnor
I'm a little shocked that writing to a closed fd is causing random memory overwrites.
I think some more research is in order.
Good find though.
Re: Data corruption with mmapped shm (shared_data size confl
Posted: Fri Sep 11, 2015 2:13 pm
by iconnor
So... you are saying that instead of closing all file descriptors, you are closing everything except STDIN,STDOUT,STDERR? which would leave them inherited by the sub processes.
Which is not what we want.
Just thoughts...
Re: Data corruption with mmapped shm (shared_data size confl
Posted: Fri Sep 11, 2015 2:39 pm
by iconnor
The SQLite guys seem to have seen this as well....
https://www.sqlite.org/howtocorrupt.html
Sounds like, because we are closing them all, when the subprocess mmap's the shm file, it gets fd 2.
I'm not really sure how to fix this, other than leaving them open. The reason to close them is to prevent hang when the subprocess does in/out on stdin/stdout/stderr. I think the hang only happens on stdin though.
Alternatively we could check at mmap time that we get an fd > 2.
Or maybe we should have the zm logger re-open stderr and do something with it.
Re: Data corruption with mmapped shm (shared_data size confl
Posted: Mon Sep 14, 2015 9:08 pm
by jcrews
I thought I had email notification on, sorry about the delay.
Yes, because all fds are closed by the parent, fd2 could be assigned to the shm file. That neatly defines the problem.
in zmdc, there's a fork() that sets up a child process to do the server work (select, accept...). In that context, the stdio fds are still closed, since that process doesn't need them, and would otherwise cause a blocking read against stdin.
In the parent context, the stdio fds are left open, expressly so that child processes will inherit them, so they are unavailable for assignment when calling open(). It seemed reasonable considering the stdio handles can be used by an external library without warning. I would handle stdio requirements in the child context, as needed (I didn't look for, but also didn't run into any issues so far).
I messed around with the idea of just checking the fd and using dup() until an fd >=3 was assigned, but leaving stdio alone felt more like a best practice, since this class of bug may exist just about anywhere else. As I understand it, it should be up to the child to do what it wants with stdio, since I don't know of a way to re-open them once closed, without making some bad assumptions.
Picking up stdout/err in the logger was another idea I tossed around, but it could produce an interleaved mess should more than one external library get used at the same time. That may be more of an academic concern than a practical one, however, especially given the specialization of each component. The second question to answer is how to assign severity: is stderr warning, and stdout info? That feels reasonable, but they seem to get used somewhat inconsistently in the real world.
Re: Data corruption with mmapped shm (shared_data size confl
Posted: Wed Sep 16, 2015 12:53 pm
by iconnor
The reason that stdin is closed is that if any of the children somehow read from stdin they would hang quietly. Shoudn't ever happen. I would like to hear from the original author (Phil) why he did things this way.
Anyways, I really think the logger code should pick up writes to stderr and send them into where we have logging set to. Otherwise they just get lost.
Also, I think we need the code to check the fd at mmap time. Just for robustness sake. It's on my todo list. Also accepting patches and PRs.
Re: Data corruption with mmapped shm (shared_data size confl
Posted: Wed Sep 16, 2015 1:04 pm
by iconnor