Repeatedly connecting and breaking the client connection can lead to 1. unreliable messages, 2. server crash #1

Closed
opened 2017-10-09 13:36:19 +00:00 by michaliskambi · 14 comments
michaliskambi commented 2017-10-09 13:36:19 +00:00 (Migrated from github.com)

I did some "stress testing" (aka "I was doing stupid things to see if something breaks" :) ). I was using the example examples/combinedexample/combinedexample.dpr , only minimally modified to display some additional information (patch below, although it probably doesn't matter).

If I exit the client gracefully (by pressing enter, so it disconnects nicely) things work always OK in my tests.

But if I start to kill the processes of the client (by Ctrl + C in console), I can break things. My test: I execute combinedexample Server in one console, and then repeatedly execute combinedexample Client in the other console, killing the client with Ctrl + C before executing the combinedexample Client again. There are two issues I can reproduce (after ~21 runs of the client) :

  1. Sometimes (seldom, but sometimes) some message is lost for a new client. It seems that sometimes the client does not receive all 4 messages. See how ""Hello another world in an world! Yet another hello world with an yet another hello world!" is missing for some client invocations in attached client.txt.

    I waited a bit -- I was not killing the process immediately with Ctrl + C. So it seems that the message was lost. If I understand correctly, this should not happen -- channel 0 is RNL_PEER_RELIABLE_ORDERED_CHANNEL , so message should be eventually always delivered?

  2. After ~21 executions of the client this way, the server crashes:

    Server thread failed with exception class ERangeError
    Exception Message: Out of index range
    Stack trace:
      $00000000004EC8D6 line 8272 of ../../src/RNL.pas
    

    A malicious user could crash the server this way (by repeatedly connecting + killing the client).

A attach a full log from server and client console. Note that, even though I was killing the client processes sequentially (I always killed the previous client before starting the new one), the server was receiving the "A client disconnected" with some delay, sometimes after the next client already connected. Possibly this is related to the observed problems.

My environment is FPC 3.0.2 on Linux x86_64.

Attachments:

client.txt

server.txt

unimportant_id_log.patch.txt

I did some "stress testing" (aka "I was doing stupid things to see if something breaks" :) ). I was using the example `examples/combinedexample/combinedexample.dpr` , only minimally modified to display some additional information (patch below, although it probably doesn't matter). If I exit the client gracefully (by pressing enter, so it disconnects nicely) things work always OK in my tests. But if I start to kill the processes of the client (by Ctrl + C in console), I can break things. My test: I execute `combinedexample Server` in one console, and then repeatedly execute `combinedexample Client` in the other console, killing the client with Ctrl + C before executing the `combinedexample Client` again. There are two issues I can reproduce (after ~21 runs of the client) : 1. Sometimes (seldom, but sometimes) some message is lost for a new client. It seems that sometimes the client does not receive all 4 messages. See how _""Hello another world in an world! Yet another hello world with an yet another hello world!"_ is missing for some client invocations in attached `client.txt`. I waited a bit -- I was not killing the process immediately with Ctrl + C. So it seems that the message was lost. If I understand correctly, this should not happen -- channel 0 is RNL_PEER_RELIABLE_ORDERED_CHANNEL , so message should be eventually always delivered? 2. After ~21 executions of the client this way, the server crashes: ``` Server thread failed with exception class ERangeError Exception Message: Out of index range Stack trace: $00000000004EC8D6 line 8272 of ../../src/RNL.pas ``` A malicious user could crash the server this way (by repeatedly connecting + killing the client). A attach a full log from server and client console. Note that, even though I was killing the client processes sequentially (I always killed the previous client before starting the new one), the server was receiving the "A client disconnected" with some delay, sometimes after the next client already connected. Possibly this is related to the observed problems. My environment is FPC 3.0.2 on Linux x86_64. Attachments: [client.txt](https://github.com/BeRo1985/rnl/files/1368611/client.txt) [server.txt](https://github.com/BeRo1985/rnl/files/1368612/server.txt) [unimportant_id_log.patch.txt](https://github.com/BeRo1985/rnl/files/1368615/unimportant_id_log.patch.txt)
BeRo1985 commented 2017-10-09 15:19:15 +00:00 (Migrated from github.com)

I've tried to fix the crash issue, can you test it again? Once with RNL_LINEAR_PEER_LIST global define and once without RNL_LINEAR_PEER_LIST global define. Because without RNL_LINEAR_PEER_LIST global define the peer list is a circular doubly linked list now, and no more a linear TRNLObjectList list.

And the other issue, there you have probably not wait enough, because at each missing message acknowledgement packet loss, the resend timeout interval will be doubled, but I should change that maybe to a fixed fast (for example 10 ms or 20ms) resend timeout interval value, at least as optional option. Can you test it also again with a little resend timeout waiting patience? :-)

I've tried to fix the crash issue, can you test it again? Once with RNL_LINEAR_PEER_LIST global define and once without RNL_LINEAR_PEER_LIST global define. Because without RNL_LINEAR_PEER_LIST global define the peer list is a circular doubly linked list now, and no more a linear TRNLObjectList<TRNLPeer> list. And the other issue, there you have probably not wait enough, because at each missing message acknowledgement packet loss, the resend timeout interval will be doubled, but I should change that maybe to a fixed fast (for example 10 ms or 20ms) resend timeout interval value, at least as optional option. Can you test it also again with a little resend timeout waiting patience? :-)
BeRo1985 commented 2017-10-09 15:39:42 +00:00 (Migrated from github.com)

channel 0 is RNL_PEER_RELIABLE_ORDERED_CHANNEL , so message should be eventually always delivered?

As default channel type yes, but you can change it also with aPeer.ChannelTypes[aChannelIndex] with any TRNLPeerChannelType enum value, as just as:

 TRNLPeerChannelType=
  (
   RNL_PEER_RELIABLE_ORDERED_CHANNEL=0,
   RNL_PEER_RELIABLE_UNORDERED_CHANNEL=1,
   RNL_PEER_UNRELIABLE_ORDERED_CHANNEL=2,
   RNL_PEER_UNRELIABLE_UNORDERED_CHANNEL=3
  );
> channel 0 is RNL_PEER_RELIABLE_ORDERED_CHANNEL , so message should be eventually always delivered? As default channel type yes, but you can change it also with aPeer.ChannelTypes[aChannelIndex] with any TRNLPeerChannelType enum value, as just as: TRNLPeerChannelType= ( RNL_PEER_RELIABLE_ORDERED_CHANNEL=0, RNL_PEER_RELIABLE_UNORDERED_CHANNEL=1, RNL_PEER_UNRELIABLE_ORDERED_CHANNEL=2, RNL_PEER_UNRELIABLE_UNORDERED_CHANNEL=3 );
BeRo1985 commented 2017-10-09 15:53:37 +00:00 (Migrated from github.com)

And here is my own test of your issue report: https://youtu.be/6UatPk4qG7c where I've tested it on the Windows 10 Linux Subsystem with a installed Ubuntu 16.04. But I've tested it also on my rootserver with Debian 9 64-bit, on my vserver also with Debian 9 64-bit, on my MacBook Pro with ArchLinux 64-bit and on my ThinkPad X230 with the older Debian 8 64-bit, with the exact same results as just at the test on the Windows 10 Linux Subsystem with a installed Ubuntu 16.04.

But I must really say, the Windows 10 Linux Subsystem is really very useful for fast debugging and fast testing so such Linux stuff under Windows, without to need to have a Linux VM started or booted a second computer with Linux. ;-)

And here is my own test of your issue report: https://youtu.be/6UatPk4qG7c where I've tested it on the Windows 10 Linux Subsystem with a installed Ubuntu 16.04. But I've tested it also on my rootserver with Debian 9 64-bit, on my vserver also with Debian 9 64-bit, on my MacBook Pro with ArchLinux 64-bit and on my ThinkPad X230 with the older Debian 8 64-bit, with the exact same results as just at the test on the Windows 10 Linux Subsystem with a installed Ubuntu 16.04. But I must really say, the Windows 10 Linux Subsystem is really very useful for fast debugging and fast testing so such Linux stuff under Windows, without to need to have a Linux VM started or booted a second computer with Linux. ;-)
michaliskambi commented 2017-10-09 17:26:07 +00:00 (Migrated from github.com)
  1. I tested the latest code with and without -dRNL_LINEAR_PEER_LIST. In both cases, both of the issues I reported are now fixed :) That is, I always receive now all 4 messages immediately, and I am not able to crash the server anymore.

    I did not have to wait for any noticeable time for all 4 messages for appear.

  2. The version without -dRNL_LINEAR_PEER_LIST exhibited a new issue: sometimes the new client seemingly hung at "Client: Connecting":

    lib/x86_64-linux/combinedexample Client
    Console output: Thread started
    Client: Thread started
    Client: Connecting
    # ... hangs here
    

    I waited a bit (~30 seconds) but nothing happened. On the server console, I noticed that the server was still disconnecting a few of old client connections when I started the new client process. And it disconnected them OK, but the client seemed to still wait forever (well, at least ~30 seconds) for the connection.

    This is not a big issue for practical applications, as killing the "stuck" client and then starting a new client worked OK.

    The version with -dRNL_LINEAR_PEER_LIST did not exhibit this issue. So it's 100% pefect in my tests :)

  3. BTW, with -dRNL_LINEAR_PEER_LIST the RNL unit does not compile:

    Compile Project, Target: lib/x86_64-linux/combinedexample: Exit code 256, Errors: 1
    RNL.pas(15930,20) Error: Identifier not found "OtherfPeerListIndex"
    

    Just a typo, extra "f" inside, I changed OtherfPeerListIndex -> OtherPeerListIndex :)

Thanks for such quick fix, and the info!

1. I tested the latest code with and without `-dRNL_LINEAR_PEER_LIST`. In both cases, both of the issues I reported are now fixed :) That is, I always receive now all 4 messages immediately, and I am not able to crash the server anymore. I did not have to wait for any noticeable time for all 4 messages for appear. 2. The version without `-dRNL_LINEAR_PEER_LIST` exhibited a new issue: sometimes the new client seemingly hung at "Client: Connecting": ``` lib/x86_64-linux/combinedexample Client Console output: Thread started Client: Thread started Client: Connecting # ... hangs here ``` I waited a bit (~30 seconds) but nothing happened. On the server console, I noticed that the server was still disconnecting a few of old client connections when I started the new client process. And it disconnected them OK, but the client seemed to still wait forever (well, at least ~30 seconds) for the connection. This is not a big issue for practical applications, as killing the "stuck" client and then starting a new client worked OK. The version with `-dRNL_LINEAR_PEER_LIST` did not exhibit this issue. So it's 100% pefect in my tests :) 3. BTW, with `-dRNL_LINEAR_PEER_LIST` the RNL unit does not compile: ``` Compile Project, Target: lib/x86_64-linux/combinedexample: Exit code 256, Errors: 1 RNL.pas(15930,20) Error: Identifier not found "OtherfPeerListIndex" ``` Just a typo, extra "f" inside, I changed `OtherfPeerListIndex` -> `OtherPeerListIndex` :) Thanks for such quick fix, and the info!
BeRo1985 commented 2017-10-09 18:05:29 +00:00 (Migrated from github.com)

And yet another, non-crash-issue, possible event-related memory-leak-issue and forgotten socket shutdown fix => github.com/BeRo1985/rnl@de8b4c68b2

And yet another, non-crash-issue, possible event-related memory-leak-issue and forgotten socket shutdown fix => https://github.com/BeRo1985/rnl/commit/de8b4c68b20de75118cc8ee3ecd10d8b2d43288b
michaliskambi commented 2017-10-10 07:28:25 +00:00 (Migrated from github.com)

I retested today the latest versions, both with and without -dRNL_LINEAR_PEER_LIST.

The original issues I reported here (unreliable messages or server crash) are definitely gone now :)

Very very seldom I can get a situation when new client is "stuck" waiting for the connection to happen in Client: Connecting. It turns out this happens regardless of -dRNL_LINEAR_PEER_LIST symbol defined at compilation.

It requires some patience to reproduce (run the client, kill with Ctrl + C, run client again, repeat...). And it's completely not an important issue -- one can kill the "stuck" client, and another client connects OK. So the server is 100% working for new clients. If you cannot reproduce this issue on your systems, please close this ticket :)

It definitely happens when the new client is run (it does Client: Connecting), while the server is still processing disconnects from the previous clients. So on the server console I see

....
Server: A client disconnected, local peer ID 10, remote peer ID 0, channels count 4
Server: A client disconnected, local peer ID 9, remote peer ID 0, channels count 4
# and now the new client is already at "Client: Connecting" state....
Server: A client disconnected, local peer ID 5, remote peer ID 0, channels count 4
Server: A client disconnected, local peer ID 13, remote peer ID 0, channels count 4
Server: A client disconnected, local peer ID 14, remote peer ID 0, channels count 4
Server: A client disconnected, local peer ID 11, remote peer ID 0, channels count 4
Server: A client disconnected, local peer ID 6, remote peer ID 0, channels count 4
Server: A client disconnected, local peer ID 12, remote peer ID 0, channels count 4
Server: A client disconnected, local peer ID 3, remote peer ID 0, channels count 4
Server: A client disconnected, local peer ID 8, remote peer ID 0, channels count 4

... and the new client is "stuck" at Client: Connecting stage (waited 3 minutes). I attach the full log from the console where I run server and client(s).

client2.txt
server2.txt

I retested today the latest versions, both with and without `-dRNL_LINEAR_PEER_LIST`. The original issues I reported here (unreliable messages or server crash) are definitely gone now :) Very very seldom I can get a situation when new client is "stuck" waiting for the connection to happen in `Client: Connecting`. It turns out this happens regardless of `-dRNL_LINEAR_PEER_LIST` symbol defined at compilation. It requires some patience to reproduce (run the client, kill with Ctrl + C, run client again, repeat...). And it's completely not an important issue -- one can kill the "stuck" client, and another client connects OK. So the server is 100% working for new clients. If you cannot reproduce this issue on your systems, please close this ticket :) It definitely happens when the new client is run (it does `Client: Connecting`), while the server is still processing disconnects from the previous clients. So on the server console I see ``` .... Server: A client disconnected, local peer ID 10, remote peer ID 0, channels count 4 Server: A client disconnected, local peer ID 9, remote peer ID 0, channels count 4 # and now the new client is already at "Client: Connecting" state.... Server: A client disconnected, local peer ID 5, remote peer ID 0, channels count 4 Server: A client disconnected, local peer ID 13, remote peer ID 0, channels count 4 Server: A client disconnected, local peer ID 14, remote peer ID 0, channels count 4 Server: A client disconnected, local peer ID 11, remote peer ID 0, channels count 4 Server: A client disconnected, local peer ID 6, remote peer ID 0, channels count 4 Server: A client disconnected, local peer ID 12, remote peer ID 0, channels count 4 Server: A client disconnected, local peer ID 3, remote peer ID 0, channels count 4 Server: A client disconnected, local peer ID 8, remote peer ID 0, channels count 4 ``` ... and the new client is "stuck" at `Client: Connecting` stage (waited 3 minutes). I attach the full log from the console where I run server and client(s). [client2.txt](https://github.com/BeRo1985/rnl/files/1370922/client2.txt) [server2.txt](https://github.com/BeRo1985/rnl/files/1370923/server2.txt)
BeRo1985 commented 2017-10-10 07:35:56 +00:00 (Migrated from github.com)

Can you test it also with -dRNL_DEBUG and -dRNL_DEBUG_COMPRESS and -dRNL_DEBUG_SECURITY ? For a more verbose debug output

Can you test it also with -dRNL_DEBUG and -dRNL_DEBUG_COMPRESS and -dRNL_DEBUG_SECURITY ? For a more verbose debug output
michaliskambi commented 2017-10-10 08:16:52 +00:00 (Migrated from github.com)

Can you test it also with -dRNL_DEBUG and -dRNL_DEBUG_COMPRESS and -dRNL_DEBUG_SECURITY ? For a more verbose debug output

Sure, here's the log with the above flags, when I reproduced the problem (the client is stuck at "Client: Connecting" now).
client3.txt
server3.txt

> Can you test it also with -dRNL_DEBUG and -dRNL_DEBUG_COMPRESS and -dRNL_DEBUG_SECURITY ? For a more verbose debug output Sure, here's the log with the above flags, when I reproduced the problem (the client is stuck at "Client: Connecting" now). [client3.txt](https://github.com/BeRo1985/rnl/files/1371059/client3.txt) [server3.txt](https://github.com/BeRo1985/rnl/files/1371060/server3.txt)
BeRo1985 commented 2017-10-10 08:38:26 +00:00 (Migrated from github.com)

Ok hm, the last important line in the txt files on serverside is DispatchReceivedHandshakePacketConnectionApprovalAcknowledge, and on clientside DispatchReceivedHandshakePacketConnectionApprovalResponse, so the connect is actually established then, unless something at TRNLHost.DispatchReceivedHandshakePacketConnectionApprovalResponse goes wrong here, but which can't be true, because the server receives the ApprovalAcknowledge packet. Hmmmmmm, can you put more debug writelns into the TRNLHost.DispatchReceivedHandshakePacketConnectionApprovalAcknowledge and TRNLHost.DispatchReceivedHandshakePacketConnectionApprovalResponse procedures and retest it then? Because it does to seem to be not happen on my computers, but I still want to know what's the cause of it in order to fix it.

Ok hm, the last important line in the txt files on serverside is DispatchReceivedHandshakePacketConnectionApprovalAcknowledge, and on clientside DispatchReceivedHandshakePacketConnectionApprovalResponse, so the connect is actually established then, unless something at TRNLHost.DispatchReceivedHandshakePacketConnectionApprovalResponse goes wrong here, but which can't be true, because the server receives the ApprovalAcknowledge packet. Hmmmmmm, can you put more debug writelns into the TRNLHost.DispatchReceivedHandshakePacketConnectionApprovalAcknowledge and TRNLHost.DispatchReceivedHandshakePacketConnectionApprovalResponse procedures and retest it then? Because it does to seem to be not happen on my computers, but I still want to know what's the cause of it in order to fix it.
BeRo1985 commented 2017-10-10 08:45:33 +00:00 (Migrated from github.com)

und hm also "Peer 0: compressed 76 => uncompressed 234 (32.5%)" on clientside, so this means that the client has received the 4 messages too, but then doesn't write any events in the event queue, hmm.

und hm also "Peer 0: compressed 76 => uncompressed 234 (32.5%)" on clientside, so this means that the client has received the 4 messages too, but then doesn't write any events in the event queue, hmm.
BeRo1985 commented 2017-10-10 08:49:05 +00:00 (Migrated from github.com)

Could you attach gdb or lldb to a connect-hanging client instance, and then see what's going on? => https://stackoverflow.com/questions/2308653/can-i-use-gdb-to-debug-a-running-process

Could you attach gdb or lldb to a connect-hanging client instance, and then see what's going on? => https://stackoverflow.com/questions/2308653/can-i-use-gdb-to-debug-a-running-process
BeRo1985 commented 2017-10-10 08:58:18 +00:00 (Migrated from github.com)

and you could also try to disable the poll() codepath on the function TRNLRealNetwork.SocketWait procedure by changing {$if defined(fpc) and defined(Unix) and declared(fppoll) and not defined(Darwin)} to {$if false}, so that the select() codepath is used then, just for testing if the poll function is broken on your linux kernel version or libc version.

and you could also try to disable the poll() codepath on the function TRNLRealNetwork.SocketWait procedure by changing `{$if defined(fpc) and defined(Unix) and declared(fppoll) and not defined(Darwin)}` to `{$if false}`, so that the select() codepath is used then, just for testing if the poll function is broken on your linux kernel version or libc version.
michaliskambi commented 2017-10-10 09:48:49 +00:00 (Migrated from github.com)
  1. Disabling the poll() codepath - this didn't help, I can still reproduce the same problem.

  2. Putting Writelns in DispatchReceivedHandshakePacketConnectionApprovalAcknowledge and DispatchReceivedHandshakePacketConnectionApprovalResponse -- they both always finish OK. Code is not stuck inside them.

  3. Your notes suggesting that "it should not happen" inspired me to test ConsoleOutput. Maybe everything is working OK, only the output to the console is "stuck"? Indeed! If I replace the ConsoleOutput implementation to simply do Writeln(s); then everything works!

What's more -- it means that I can press "Enter" on the "stuck" client, and see that it's not "stuck" at all. Only the output was stuck. Once I press "Enter", all the messages are printed OK. (Previously, I was always killing the "stuck" client with Ctrl + C after some time, instead of pressing "Enter" -- not smart of me.)

So it's just that the output from the ConsoleOutput is sometimes not displayed in any finite time. At the end you always do FlushConsoleOutput and it shows that all the messages are OK.

I'm attaching the client5.txt . The ConsoleOutput looked like this for this test:

procedure ConsoleOutput(const s:string);
begin
 Writeln('Direct Writeln: ', s);
 ConsoleOutputQueue.Enqueue(s);
 ConsoleOutputConditionVariable.Signal;
end;

As you can see in client5.txt, the messages are received OK, because I get the

Direct Writeln: Client: A message received on channel 0: "Hello world!"

early. But then I have to press "Enter" to see

Direct Writeln: Client: Disconnecting

...and then all other client messages are displayed (without Direct Writeln prefix). In this case even the

Console output: Thread started

is displayed after the client actually disconnected.

It seems that ConsoleOutputConditionVariable.Signal is not delivered when it should. I placed the writeln('wrSignaled'); right when we have wrSignaled, but in client5.txt it does not occur until we actually disconnected.

I'm attaching testing.patch.txt so you know what code modifications I did for testing.

client5.txt

testing.patch.txt

1. Disabling the poll() codepath - this didn't help, I can still reproduce the same problem. 2. Putting Writelns in `DispatchReceivedHandshakePacketConnectionApprovalAcknowledge` and `DispatchReceivedHandshakePacketConnectionApprovalResponse` -- they both always finish OK. Code is not stuck inside them. 3. Your notes suggesting that "it should not happen" inspired me to test ConsoleOutput. Maybe everything is working OK, only the output to the console is "stuck"? Indeed! If I replace the ConsoleOutput implementation to simply do `Writeln(s);` then everything works! What's more -- it means that I can press "Enter" on the "stuck" client, and see that it's not "stuck" at all. Only the output was stuck. Once I press "Enter", all the messages are printed OK. (Previously, I was always killing the "stuck" client with Ctrl + C after some time, instead of pressing "Enter" -- not smart of me.) So it's just that the output from the ConsoleOutput is sometimes not displayed in any finite time. At the end you always do `FlushConsoleOutput` and it shows that all the messages are OK. I'm attaching the client5.txt . The ConsoleOutput looked like this for this test: ``` procedure ConsoleOutput(const s:string); begin Writeln('Direct Writeln: ', s); ConsoleOutputQueue.Enqueue(s); ConsoleOutputConditionVariable.Signal; end; ``` As you can see in client5.txt, the messages are received OK, because I get the ``` Direct Writeln: Client: A message received on channel 0: "Hello world!" ``` early. But then I have to press "Enter" to see ``` Direct Writeln: Client: Disconnecting ``` ...and then all other client messages are displayed (without `Direct Writeln` prefix). In this case even the ``` Console output: Thread started ``` is displayed _after_ the client actually disconnected. It seems that `ConsoleOutputConditionVariable.Signal` is not delivered when it should. I placed the `writeln('wrSignaled');` right when we have `wrSignaled`, but in client5.txt it does not occur until we actually disconnected. I'm attaching testing.patch.txt so you know what code modifications I did for testing. [client5.txt](https://github.com/BeRo1985/rnl/files/1371370/client5.txt) [testing.patch.txt](https://github.com/BeRo1985/rnl/files/1371374/testing.patch.txt)
BeRo1985 commented 2017-10-10 14:58:04 +00:00 (Migrated from github.com)

Ok, since it seems, that it is only a issue of the combinedexample project itself on some systems, I'll close this issue now :-)

Ok, since it seems, that it is only a issue of the combinedexample project itself on some systems, I'll close this issue now :-)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
BeRo1985/rnl#1
No description provided.