About the TCP MSS and wrong checksums
During the last days, I had to refresh my knowledge about the gory details of the TCP protocol. By far I'm not versed enough to get into something like this or that, but at least my current problem was solved.
I spent a few hours troubleshooting a strange TCP connection problem. HTTP connections to one particular host were no longer working since we got a new Cisco ASA 5510 firewall managed by an external provider, set up to NAT outgoing connections. Initial connections to that host and the first HTTP request returning a redirect worked, but the following request would only return a few KB of HTML and then stall. Only that host was affected, at least we had not observed the problem somewhere else.
I did a tcpdump of the traffic between the firewall and our SDSL uplink and got the following.
mp@blackbook:~$ sudo tcpdump -v -ttt -s0 -n host remote.host tcpdump: listening on en0, link-type EN10MB (Ethernet), capture size 65535 bytes
000000 IP ([...], length 48) firewall.26167 > remote.host.80: S, cksum 0x0b7b (correct), 3325973421:3325973421(0) win 65535 <mss 1380,nop,nop,nop,nop>
213304 IP ([...], length 44) remote.host.80 > firewall.26167: S, cksum 0xe020 (correct), 83736657:83736657(0) ack 3325973422 win 32768 <mss 1380>
000320 IP ([...], length 40) firewall.26167 > remote.host.80: ., cksum 0x778e (correct), ack 1 win 65535
000555 IP ([...], length 396) firewall.26167 > remote.host.80: P, cksum 0x3f23 (correct), 1:357(356) ack 1 win 65535
225193 IP ([...], length 317) remote.host.80 > firewall.26167: P, cksum 0xdd51 (correct), 1:278(277) ack 357 win 32768
017471 IP ([...], length 1420) remote.host.80 > firewall.26167: P, cksum 0x9fce (incorrect (-> 0x9fcd), 278:1658(1380) ack 357 win 32768
016391 IP ([...], length 1420) remote.host.80 > firewall.26167: P, cksum 0x766d (incorrect (-> 0x766c), 1658:3038(1380) ack 357 win 32768
203816 IP ([...], length 40) firewall.26167 > remote.host.80: ., cksum 0x762a (correct), ack 278 win 65258
226371 IP ([...], length 1420) remote.host.80 > firewall.26167: ., cksum 0x6120 (incorrect (-> 0x611f), 3038:4418(1380) ack 357 win 32768
2. 398071 IP ([...], length 1420) remote.host.80 > firewall.26167: P, cksum 0x9fce (incorrect (-> 0x9fcd), 278:1658(1380) ack 357 win 32768
2. 947090 IP ([...], length 1420) remote.host.80 > firewall.26167: P, cksum 0x9fce (incorrect (-> 0x9fcd), 278:1658(1380) ack 357 win 32768
5. 479496 IP ([...], length 1420) remote.host.80 > firewall.26167: P, cksum 0x9fce (incorrect (-> 0x9fcd), 278:1658(1380) ack 357 win 32768
That looked as if the remote host was sending a lot of packets with wrong TCP checksums. Our firewall would just trash those packets, not sending out ACKs for them and wait for a retransmission. Those retransmissions can be seen at the end of the trace and trickled away with geometrically increasing intervals. As the retransmissions show the same error, the bottom line is that nothing gets along anymore.
Interestingly, the checksums were always off by just one. Had the packets really been altered (for example, corrupted by bit errors on the wire) during transport, the deviations should be arbitrary.
So my first guess was that the remote host or some component along the path had a bug in the procotol stack and was putting in incorrectly calculated checksums. But OTOH it's hard to imagine that we would be the first to discover such a bug.
When taking the firewall out of the mix the connections worked fine and all packets were ok. This is strange, as the TCP checksum is calculated based on a single TCP packet with some additional information from the IP layer. So as to the correctness and verification of the checksum, it is irrelevant what has been sent in the other direction before and whether such previous packets have passed the firewall or not.
Now one important observation was that the incorrect checksum was only in packets that were 1420 bytes in total size. This is the maximum possible size one can observe in this connection for a MSS of 1380 is negotiated during the TCP handshake and 40 byte for IP and TCP header come on top of that.
Using a test host placed outside (before) the firewall would result in working connections with an MSS of 1460, which is the Ethernet MTU of 1500 minus the 40 bytes. Setting the MTU to 1420 on that host yielded a MSS of 1380, but the returning packets were correctly checksummed in this case.
So although the MSS seemed not to be a sufficient condition to trigger the problem, I started to investigate why the firewall would use a value of 1380 where 1460 should be possible. We found a setting called "force maximum segment size for TCP proxy connections" in the Cisco's admin interface that was set to 1380. We unset that option and voilà, everything worked with the MSS going up to 1460.
What rankles me is that although the symptoms are cured, I still haven't found the cause for the wrong checksums. I don't see why setting the MSS to a lower value than necessary should be a problem at all (performance issues aside). And, having the checksums always off-by-one smells like another bug somewhere along the connection's path. I googled a lot for this one, but the only relevant results were in a discussion in the context of SMTP (but that's TCP after all) over there and there.
Hopefully, this article will save someone some headaches. If you have any ideas regarding the cause of this problem, I'd be glad to hear from you. Feel free to leave a comment!






Comments
the checksum errors could be a result of your network card having TCP offload features.
TCP offload is a feature where the network card takes care of calculating the checksum for packets it sends. When TCP offload is enabled on a host, a network trace of packets sent on this host will show wrong checksums (often just "0") for all packets as the checksum won't be in place unless the packet actually hits the wire.
When observing traffic between two other hosts - as I did in this case - it doesn't matter whether one of them uses TCP offload.
Hi Matthias,
I am having a similar issue, with a Cisco ASA 5505 terminating a site-to-site VPN. The ASA replaced a PIX. Since the upgrade if you SSH to a server behind the ASA, through the encrypted tunnel, and issue a command such as dmesg, the command output breaks. The connection stays in the ASA's connection table until eventually it times out.
The packet capture is showing:
No. Time Source Destination Protocol Info
6957 606.635514 172.17.48.31 192.168.183.25 TCP 59493 > ssh [SYN] Seq=0 Win=49640 Len=0 MSS=1460 WS=0
6958 606.641434 192.168.183.25 172.17.48.31 TCP ssh > 59493 [SYN, ACK] Seq=0 Ack=1 Win=5840 Len=0 MSS=1380 WS=6
6959 606.641483 172.17.48.31 192.168.183.25 TCP 59493 > ssh [ACK] Seq=1 Ack=1 Win=49680 Len=0
Packet 6959 has a bad IP header checksum.
I've disabled the "force mss" option, but have to wait until Monday to confirm if this changed anything. I have a TAC case opened with Cisco, and will see what they come up with. I'll keep you posted.
Vlad
So do you see the same error symptoms here with the checksum being always off by one? The packet capture you posted does not contain enough detail to tell.
No, the checksum was not off just by one. It was 0x0000, which seems to indicate offloading on the server side. It's clearly not the same issue, since the problem was not corrected by disabling the "force mss" feature.
I did not include the entire capture because I didn't mean to take too much space here. I have a different example of problems with this new ASA, that does not involve the checksum error. This is from an FTP transfer, and the capture was taken on the server. Please note how packet 45 from the client has a size of 2760 bytes. The summary does not show it, but it also has the DF bit set. Then, you see 5 retransmissions of smaller packets, and then it stops.
No. Time Source Destination Protocol Info
40 13:56:15.525749 172.17.48.20 192.168.183.25 TCP ftp-data > 43852 [SYN] Seq=0 Win=49640 Len=0 MSS=1380 WS=0
41 13:56:15.525775 192.168.183.25 172.17.48.20 TCP 43852 > ftp-data [SYN, ACK] Seq=0 Ack=1 Win=5840 Len=0 MSS=1460 WS=6
42 13:56:15.531871 172.17.48.20 192.168.183.25 TCP ftp-data > 43852 [ACK] Seq=1 Ack=1 Win=49680 Len=0
43 13:56:15.532159 172.17.48.20 192.168.183.25 FTP Response: 150 Opening BINARY mode data connection for testupload.txt.
44 13:56:15.575477 192.168.183.25 172.17.48.20 TCP 46079 > ftp [ACK] Seq=90 Ack=1472 Win=8384 Len=0 TSV=672847 TSER=649342848
45 13:56:15.579004 192.168.183.25 172.17.48.20 FTP-DATA FTP Data: 2760 bytes
46 13:56:18.579498 192.168.183.25 172.17.48.20 FTP-DATA [TCP Retransmission] FTP Data: 1380 bytes
47 13:56:24.575481 192.168.183.25 172.17.48.20 FTP-DATA [TCP Retransmission] FTP Data: 1380 bytes
48 13:56:36.575497 192.168.183.25 172.17.48.20 FTP-DATA [TCP Retransmission] FTP Data: 1380 bytes
49 13:57:00.575498 192.168.183.25 172.17.48.20 FTP-DATA [TCP Retransmission] FTP Data: 1380 bytes
50 13:57:48.575496 192.168.183.25 172.17.48.20 FTP-DATA [TCP Retransmission] FTP Data: 1380 bytes
51 13:59:24.575502 192.168.183.25 172.17.48.20 FTP-DATA [TCP Retransmission] FTP Data: 1380 bytes
52 19:00:00.000000 Ethernet [Malformed Packet]
Now, the corresponding transfer as seen by the ASA inside interface:
No. Time Source Destination Protocol Info
39 09:51:44.594848 192.168.183.25 172.17.48.20 FTP Request: STOR testupload.txt
40 09:51:44.617033 172.17.48.20 192.168.183.25 TCP ftp-data > 43852 [SYN] Seq=0 Win=49640 Len=0 MSS=1380 WS=0
41 09:51:44.617353 192.168.183.25 172.17.48.20 TCP 43852 > ftp-data [SYN, ACK] Seq=0 Ack=1 Win=5840 Len=0 MSS=1460 WS=6
42 09:51:44.623151 172.17.48.20 192.168.183.25 TCP ftp-data > 43852 [ACK] Seq=1 Ack=1 Win=49680 Len=0
43 09:51:44.623441 172.17.48.20 192.168.183.25 FTP Response: 150 Opening BINARY mode data connection for testupload.txt.
44 09:51:44.666972 192.168.183.25 172.17.48.20 TCP 46079 > ftp [ACK] Seq=90 Ack=1472 Win=8384 Len=0 TSV=672847 TSER=649342848
45 09:51:44.671107 192.168.183.25 172.17.48.20 FTP-DATA FTP Data: 1380 bytes
46 09:51:44.671260 192.168.183.25 172.17.48.20 FTP-DATA FTP Data: 1380 bytes
47 09:51:47.670940 192.168.183.25 172.17.48.20 FTP-DATA [TCP Retransmission] FTP Data: 1380 bytes
48 09:51:53.665660 192.168.183.25 172.17.48.20 FTP-DATA [TCP Retransmission] FTP Data: 1380 bytes
49 09:52:05.663188 192.168.183.25 172.17.48.20 FTP-DATA [TCP Retransmission] FTP Data: 1380 bytes
50 09:52:29.658062 192.168.183.25 172.17.48.20 FTP-DATA [TCP Retransmission] FTP Data: 1380 bytes
51 09:53:17.647885 192.168.183.25 172.17.48.20 FTP-DATA [TCP Retransmission] FTP Data: 1380 bytes
If you are interested in this puzzle, I can certainly email you the detailed captures. At the same time I would not want to "spam" your entry with comments that are not necessarily related to the original topic.
Cheers!
Vlad
What was the fix for the FTP issue? I found your blog based on a similar issue I am seeing on my network.
TIA
Hello,
Cisco TAC was unable to tell for sure if this behaviour was because of a bug on the ASA, or because of something else in the middle. However, the workaround was to lower the TCP MSS on the ASA to 1300 bytes.
Thanks,
Vlad
Hello,
Cisco TAC was unable to tell for sure if this behaviour was because of a bug on the ASA, or because of something else in the middle. However, the workaround was to lower the TCP MSS on the ASA to 1300 bytes.
Thanks,
Vlad
Post your own comment