WHO IS TO BLAME FOR THIS OUTRAGE? (Network error: XXX)


This is a very basic set of troubleshooting steps for remote TCP services, which include a couple of tests to determine the important thing in these situations...

"WHO IS TO BLAME FOR THIS OUTRAGE?"

So its not working.

What is not working? Just some random TCP service on the network, MySQL-port 3306, MS SQL 1433, tomcat 8080, apache 80, you can check the IANA database of registered port assignments for 57333 million other examples.

(in fact this has run a bit long, so this is just Part 1 - the common error messages, the payoff, ie who to blame has got pushed back to part 2, or 3...)



Caveat - there are a whole bunch of reasons why your network thing might not be working such as authentication failure or protocol mismatch, but these steps only cover the initial TCP handshake. After that you are on your own...

So we have all seen these various connections errors, and bashed the keyboard in furious anger:

Network error: Connection timed out
Network error: Connection refused
Network error: No gateway
Network error: No such route
connect: Network is unreachable

 
the reason you have seen these is because that they are the types of errors that are returned by a failure to make a TCP connection, and its likely that you are reading this because you are a sysadmin or a developer. 

(obviously some of this applies to UDP and other protocols, but I'm talking TCP here (and ipv4 at that))

So how do we find out who to send that grotty email to?

The first step

Use an IP address not a hostname. We are trying to test the network, not some random DNS installation or one of the 6 random different IP addresses that your "working" hostname resolves to;

$ dig @ns1.google.com www.google.com +short
www.l.google.com.
173.194.67.104
173.194.67.105
173.194.67.99
173.194.67.147
173.194.67.103
173.194.67.106

2) ping the host;  (see the appendix problem with ping below if you get zilch back)
 ping  -c1 173.194.67.104
PING 173.194.67.104 (173.194.67.104) 56(84) bytes of data.
64 bytes from 173.194.67.104: icmp_req=1 ttl=50 time=21.2 ms

3) Establish the "error message"

Use telnet, netcat, ssh or whatever you have available. the underlying message is returned by the networking stack so it doesn't really matter which tool.

Obviously we are looking for that glory of glories "Connected to xxx.yyy.zzz.aaa"

# telnet 123.123.123.22 20000
Trying
123.123.123.22...
Connected to 123.123.123.22.
Escape character is '^]'.

But if you are reading this, its more likely you are seeing something other than that.

So lets step through so of the various possibilities and examine the messages. 


The simple case is trying to connect to a local box that is either powered off, or you have the wrong address etc.

Trying to connect to a LAN address, and the server is "missing"

$ telnet 192.168.1.77
Trying 192.168.1.77...
telnet: connect to address 192.168.1.77: No route to host

So whats going on there?

"No route to host" is an ICMP message from a router indicating that it has no record of this host, and nowhere to forward it to. 


Think if it like a postman with a letter for number 77 Acacia Drive, but there are only 12 houses in the road, and he is standing there outside number 12.

Trying to connect to a WAN address, and the server is missing...?

This is where it gets tricky, because the default CentOS, (and presumably RHEL6) iptables package provides a policy that is configured to send "reject-with icmp-host-prohibited"  messages, which looks mightily similar to the "missing server" message above, for example a iptabes rule like this;


[root@server-64664 ~]# iptables --list -v -n
....
   89  4352 REJECT     all  --  *      *       0.0.0.0/0            0.0.0.0/0           reject-with icmp-host-prohibited
 



results in that same error message for the missing local server;

$ telnet 123.123.123.22 80
Trying 123.123.123.22...
telnet: connect to address 123.123.123.22: No route to host



Network error: Connection timed out







However if you are using Ubuntu, or some other distro that uses a default DROP policy for iptables, then you will see a different message for the situation otherwise being the same, like so;


# iptables --list -n -v | grep INPUT
Chain INPUT (policy DROP 751K packets, 46M bytes)








When you try to connect to something blocked by the firewall, you will instead see this;


$ telnet 123.123.33.101 8080
Trying 123.123.33.101...
telnet: connect to address 123.123.33.101: Connection timed out


The "Connection timed out" message is returned to you by your network stack after it has tried  "tcp_syn_retries" times to open a connection to the remote host.

So basically the connection timed out indicates that some device between you the destination has discarded your request like so much garbage. How very Rude!

@todo - Connection refused, and any other obvious messages...













Appendix - The problem with ping

First ping the remote host. However just because you can't ping it doesn't mean its down. Some network administrators block ping for reasons of security.

I suspect it might have been a response to the popularity of nmap, which would ping first, and then run a port scan on open hosts. Blocking ICMP and hence ping would send a wide enumeration scan skipping on to the next network, hence providing some respite from port scanners via a security through obscurity.






TCP flow diagram with appropriate timeouts and state changes;
http://drupal.star.bnl.gov/STAR/blog-entry/jeromel/2009/feb/18/tcp-parameters-linux-kernel