Netio socket unsubscribe terminates the connection to Felixcore and crashes scanConsole at long latencies or many FEs
When Strips run on many FEs, sometimes we get a socket/network error at the end of the scan that terminates scanConsole
prematurely. It is usually the "wrong file descriptor" error, but sometimes it is "could not connect to or something like that.
Commenting out the m_sub_sockets[chn]->unsubscribe
in NetioHandler::delChannel
fixes it and does not seem to bring other problems:
void NetioHandler::delChannel(uint64_t chn){
...
if(it!=m_channels.end()){
nlog->debug("### NetioHandler::delChannel({}) -> unsubscribe", chn);
m_channels.erase(it);
//SHIT: please do not unsubscribe: because felixcore/netio doesn't like it
m_sub_sockets[chn]->unsubscribe(chn, netio::endpoint(m_felixHost, m_felixRXPort));
delete m_sub_sockets[chn];
m_sub_sockets.erase(chn);
}
}
It looks like when one FE deletes its channel and calls m_sub_sockets[chn]->unsubscribe
, it causes Felixcore to close the whole TCP connection to scanConsole
or something like that. Then, the other FE send Netio unsubscribe
to a closed connection and fail.
If the latency is short enough, you may not notice it. As the "unsubscribes" are sent to Felixcore before the connection has been shut down. But if the latency is long or you have many FEs, it can crash the scanConsole
run at the very end.
I think that at some point I asked to un-comment this unsubscribe
, because it was causing issues with AMAC OPC communication. But that's a long gone problem and AMAC OPC gets Netio by other means.
Still, I may be missing something about Netio. But if not, then I can open a merge request on this line.