Repeatedly connecting and breaking the client connection can lead to 1. unreliable messages, 2. server crash #1
Labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
BeRo1985/rnl#1
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I did some "stress testing" (aka "I was doing stupid things to see if something breaks" :) ). I was using the example
examples/combinedexample/combinedexample.dpr, only minimally modified to display some additional information (patch below, although it probably doesn't matter).If I exit the client gracefully (by pressing enter, so it disconnects nicely) things work always OK in my tests.
But if I start to kill the processes of the client (by Ctrl + C in console), I can break things. My test: I execute
combinedexample Serverin one console, and then repeatedly executecombinedexample Clientin the other console, killing the client with Ctrl + C before executing thecombinedexample Clientagain. There are two issues I can reproduce (after ~21 runs of the client) :Sometimes (seldom, but sometimes) some message is lost for a new client. It seems that sometimes the client does not receive all 4 messages. See how ""Hello another world in an world! Yet another hello world with an yet another hello world!" is missing for some client invocations in attached
client.txt.I waited a bit -- I was not killing the process immediately with Ctrl + C. So it seems that the message was lost. If I understand correctly, this should not happen -- channel 0 is RNL_PEER_RELIABLE_ORDERED_CHANNEL , so message should be eventually always delivered?
After ~21 executions of the client this way, the server crashes:
A malicious user could crash the server this way (by repeatedly connecting + killing the client).
A attach a full log from server and client console. Note that, even though I was killing the client processes sequentially (I always killed the previous client before starting the new one), the server was receiving the "A client disconnected" with some delay, sometimes after the next client already connected. Possibly this is related to the observed problems.
My environment is FPC 3.0.2 on Linux x86_64.
Attachments:
client.txt
server.txt
unimportant_id_log.patch.txt
I've tried to fix the crash issue, can you test it again? Once with RNL_LINEAR_PEER_LIST global define and once without RNL_LINEAR_PEER_LIST global define. Because without RNL_LINEAR_PEER_LIST global define the peer list is a circular doubly linked list now, and no more a linear TRNLObjectList list.
And the other issue, there you have probably not wait enough, because at each missing message acknowledgement packet loss, the resend timeout interval will be doubled, but I should change that maybe to a fixed fast (for example 10 ms or 20ms) resend timeout interval value, at least as optional option. Can you test it also again with a little resend timeout waiting patience? :-)
As default channel type yes, but you can change it also with aPeer.ChannelTypes[aChannelIndex] with any TRNLPeerChannelType enum value, as just as:
And here is my own test of your issue report: https://youtu.be/6UatPk4qG7c where I've tested it on the Windows 10 Linux Subsystem with a installed Ubuntu 16.04. But I've tested it also on my rootserver with Debian 9 64-bit, on my vserver also with Debian 9 64-bit, on my MacBook Pro with ArchLinux 64-bit and on my ThinkPad X230 with the older Debian 8 64-bit, with the exact same results as just at the test on the Windows 10 Linux Subsystem with a installed Ubuntu 16.04.
But I must really say, the Windows 10 Linux Subsystem is really very useful for fast debugging and fast testing so such Linux stuff under Windows, without to need to have a Linux VM started or booted a second computer with Linux. ;-)
I tested the latest code with and without
-dRNL_LINEAR_PEER_LIST. In both cases, both of the issues I reported are now fixed :) That is, I always receive now all 4 messages immediately, and I am not able to crash the server anymore.I did not have to wait for any noticeable time for all 4 messages for appear.
The version without
-dRNL_LINEAR_PEER_LISTexhibited a new issue: sometimes the new client seemingly hung at "Client: Connecting":I waited a bit (~30 seconds) but nothing happened. On the server console, I noticed that the server was still disconnecting a few of old client connections when I started the new client process. And it disconnected them OK, but the client seemed to still wait forever (well, at least ~30 seconds) for the connection.
This is not a big issue for practical applications, as killing the "stuck" client and then starting a new client worked OK.
The version with
-dRNL_LINEAR_PEER_LISTdid not exhibit this issue. So it's 100% pefect in my tests :)BTW, with
-dRNL_LINEAR_PEER_LISTthe RNL unit does not compile:Just a typo, extra "f" inside, I changed
OtherfPeerListIndex->OtherPeerListIndex:)Thanks for such quick fix, and the info!
And yet another, non-crash-issue, possible event-related memory-leak-issue and forgotten socket shutdown fix =>
github.com/BeRo1985/rnl@de8b4c68b2I retested today the latest versions, both with and without
-dRNL_LINEAR_PEER_LIST.The original issues I reported here (unreliable messages or server crash) are definitely gone now :)
Very very seldom I can get a situation when new client is "stuck" waiting for the connection to happen in
Client: Connecting. It turns out this happens regardless of-dRNL_LINEAR_PEER_LISTsymbol defined at compilation.It requires some patience to reproduce (run the client, kill with Ctrl + C, run client again, repeat...). And it's completely not an important issue -- one can kill the "stuck" client, and another client connects OK. So the server is 100% working for new clients. If you cannot reproduce this issue on your systems, please close this ticket :)
It definitely happens when the new client is run (it does
Client: Connecting), while the server is still processing disconnects from the previous clients. So on the server console I see... and the new client is "stuck" at
Client: Connectingstage (waited 3 minutes). I attach the full log from the console where I run server and client(s).client2.txt
server2.txt
Can you test it also with -dRNL_DEBUG and -dRNL_DEBUG_COMPRESS and -dRNL_DEBUG_SECURITY ? For a more verbose debug output
Sure, here's the log with the above flags, when I reproduced the problem (the client is stuck at "Client: Connecting" now).
client3.txt
server3.txt
Ok hm, the last important line in the txt files on serverside is DispatchReceivedHandshakePacketConnectionApprovalAcknowledge, and on clientside DispatchReceivedHandshakePacketConnectionApprovalResponse, so the connect is actually established then, unless something at TRNLHost.DispatchReceivedHandshakePacketConnectionApprovalResponse goes wrong here, but which can't be true, because the server receives the ApprovalAcknowledge packet. Hmmmmmm, can you put more debug writelns into the TRNLHost.DispatchReceivedHandshakePacketConnectionApprovalAcknowledge and TRNLHost.DispatchReceivedHandshakePacketConnectionApprovalResponse procedures and retest it then? Because it does to seem to be not happen on my computers, but I still want to know what's the cause of it in order to fix it.
und hm also "Peer 0: compressed 76 => uncompressed 234 (32.5%)" on clientside, so this means that the client has received the 4 messages too, but then doesn't write any events in the event queue, hmm.
Could you attach gdb or lldb to a connect-hanging client instance, and then see what's going on? => https://stackoverflow.com/questions/2308653/can-i-use-gdb-to-debug-a-running-process
and you could also try to disable the poll() codepath on the function TRNLRealNetwork.SocketWait procedure by changing
{$if defined(fpc) and defined(Unix) and declared(fppoll) and not defined(Darwin)}to{$if false}, so that the select() codepath is used then, just for testing if the poll function is broken on your linux kernel version or libc version.Disabling the poll() codepath - this didn't help, I can still reproduce the same problem.
Putting Writelns in
DispatchReceivedHandshakePacketConnectionApprovalAcknowledgeandDispatchReceivedHandshakePacketConnectionApprovalResponse-- they both always finish OK. Code is not stuck inside them.Your notes suggesting that "it should not happen" inspired me to test ConsoleOutput. Maybe everything is working OK, only the output to the console is "stuck"? Indeed! If I replace the ConsoleOutput implementation to simply do
Writeln(s);then everything works!What's more -- it means that I can press "Enter" on the "stuck" client, and see that it's not "stuck" at all. Only the output was stuck. Once I press "Enter", all the messages are printed OK. (Previously, I was always killing the "stuck" client with Ctrl + C after some time, instead of pressing "Enter" -- not smart of me.)
So it's just that the output from the ConsoleOutput is sometimes not displayed in any finite time. At the end you always do
FlushConsoleOutputand it shows that all the messages are OK.I'm attaching the client5.txt . The ConsoleOutput looked like this for this test:
As you can see in client5.txt, the messages are received OK, because I get the
early. But then I have to press "Enter" to see
...and then all other client messages are displayed (without
Direct Writelnprefix). In this case even theis displayed after the client actually disconnected.
It seems that
ConsoleOutputConditionVariable.Signalis not delivered when it should. I placed thewriteln('wrSignaled');right when we havewrSignaled, but in client5.txt it does not occur until we actually disconnected.I'm attaching testing.patch.txt so you know what code modifications I did for testing.
client5.txt
testing.patch.txt
Ok, since it seems, that it is only a issue of the combinedexample project itself on some systems, I'll close this issue now :-)