...making Linux just a little more fun!
By David Dorgan
A quick and dirty guide to debugging tcp/ip
This is a small guide I wrote into debugging TCP/IP networks. It assumes you are using linux, or other unix like o/s.
So it's 5pm on a Friday, a user says he/she cannot connect to $some-web-site, What do you do? There are a few paths, often internal and external sites will require a different approach.
Some things won't change, think if it this way, if layer 1 (physical) isn't working, everything on top of that isn't going to work, it's a good idea to use a number of tools, like telnet, ping, traceroute, tracepath etc... to see where the problem is, and if you can, what layer it's ok. Say if you can ping and traceroute fine to a host, you can get to port 25, but not port 80, then the chances are it's just their webserver dying. Say you can't get past your local gateway, but you can ping you local gateway, it could be the case that it's not forwarding, it's a firewall configuration issue etc...
Try and telnet to thesite.com on port 80, to see if you can connect, if you get connection refused and you don't have any firewalls or proxys blocking outbound communication, the chances are they are having some sort of service outage. If it just stays there for a while and then says 'cannot connect' then continue.
Try and ping the remote side, although this is very useful, some places do block ICMP (they must be under the impression that while creating tcp/ip, for the first six days most of the work was done, and they had nothing to do on Sunday, so they invented ICMP as a joke), so you could try a traceroute, once again this could be blocked, but it'll generally give you a picture of where the problem is. At this point, look at see where it stops, you should see your local gateway and maybe some internal routers of gateway devices, after this you should see it going through ISP networks, if you don't see your ISP anywhere and it stops on the first few hops, it could be a problem with the link to your ISP, if it shows your ISP and then shows an outage just after your ISP's name, maybe the ISP has lost the link to one of their upstreams and the old paths are still valid so packets are being stopped there. Finally you may see it go all the way to the other side, and finish totally, or stop on somecompany-gw.customer.isp.net, in which case the other side may be blocking inbound ICMP.
If they can't seem to connect to anything, ask them to try with an IP, if this works, get them to check their DNS settings.
Some common traits of certain events.
A service dying but the network being fine: If you can ping and traceroute fine, and you can connect to other open ports, say the machine does mail and remote access, if you can get to port 22 (ssh) fine but not port 25 (smtp) there could be a problem with the MTA only.
A firewall blocking a port: This doesn't work so well when routers or firewalls have a 'deny all' by default, but most people don't do that. Let's say you can't get to port 25 on this machine, when you telnet, it just times out, but you know it's a mailserver, try to telnet to a few ports, just random high number ports, like 8274 or 9274 and see if you get connection refused. If you get connection refused, the chances are the firewall is just blocking port 25 due to your IP, because the machine responded that you couldn't get on those ports.
The link to your ISP is dead: Try and traceroute to anywhere, and you will see that the last hop that doesn't time out is an internal one from your company, and that you never see anything with link-whatever.isp.net.
The link is gone on the other side: In this case, maybe you know they don't block traceroute packets, so when you do traceroute, it goes through your isp, to a carrier, to their isp and then stops on what is normally their ISP's link.
Your ISP is having a routing issue: This often happens with some providers *ahem*, I have accounts on a few machines, based in physically different locations and using different providers, so if I can't get to a resource I want, ill try and traceroute it from a machine hanging straight off LINX and a few just off some US ISP's, if they all seem to work from there, and a traceroute from you shows timeouts or unusually high latency in say london.isp.net, then their links to london maybe overused or having issues.
It's taking about two minutes to login to a machine: When you do login, type w or who, and check to see if it says you are coming from an IP or hostname? It shouldn't say hostname, basically it's waiting that long on DNS, it should use an internal DNS server that will reply quickly, or else you should use reverse lookups for IP addresses.
Somebody is complaining they can't connect to a service, you try a manual connection from an outside host and it doesn't work. Then go onto the machine and try and telnet to the port or do an netstat -an | grep LISTEN and look for the port number it should be listening on. If it is there, it could be filtered somewhere along the path, or even at the local host. If it isn't listen, then doing an fstat or lsof and and grepping for the process name may show IPv4 or internet entry, showing the ip address and host it's listening on.
$Id: debugging-tcpip.html,v 1.5 2003/09/18 18:33:47 davidd Exp $
David has been a very productive writer and plans to contribute more of his
work in the future.