Tuesday, April 6, 2010

Troubleshooting Lessons Learned

I’ve told this story many times as it has been one of my many anecdotes for Job Interviews and while chatting with fellow geeks for some time now. The story comes early on in my career about two years out of college. A time when I was just a little wet behind the ears and probably a little more cocky then I should have been. I was a Network Administrator for a small company and had recently enabled Internet email on a Microsoft Exchange Server. Prior to this Exchange was only utilized for sending internal messages. It was a very simple network, a Cisco PIX520 firewalls and a Cisco Router with an Internet T1.

The problem I faced was that random users would complain that emails they sent were not being received. I noticed that the messages were being held in the outbound message queue. I asked to be copied on some of these messages and did receive them internally however when sent to my external email address they were held up in the queue. I also noticed that these messages had attachments. At this point I opened a trouble ticket with Microsoft to work on the issue.

We discovered that when Exchange was configured for MIME encoding the message would get hung up and when set to Uuencoding  it would work fine. We made the change to UUencoding and left it figuring we had solved the issue with a workaround. A month or so later I discover that messages would still randomly get hung up. Less frequently then with MIME encoding but the problem did persist. So I went back to Microsoft for additional troubleshooting. At this point Microsoft had me perform some sniffing to see where the problem was. They found that something was missing from the network side of things and suggested a firewall issue. This didn’t sound right to me since port 25 was open for outbound traffic and the PIX doesn’t allow filtering at the application layer (attachments) however I opened a case with Cisco. Maybe it was a bug or something else strange.

Cisco had me go a little further with the sniffing and I monitored traffic both inside and outside the firewall for comparison. They found the same information internally and externally and thus told me that I was missing traffic from the Internet. At this point I was a little perplexed so I decided to build an Exchange server outside my firewall for testing. Within DNS I created a bogus DNS zone to send messages to and configured this domain on the test Exchange server. I then sent mail from the production Exchange server to the test server and low and behold it worked fine.  By now I also discovered one more unusual piece of information, with MIME encoding enabled the only attachments that had an issue were Microsoft Office documents.  I could send PDF or JPG files without issue.

So the challenge now was to convince the ISP that they were blocking MIME encoded Microsoft Office documents and no other Internet traffic was impacted. I opened a ticket with them and described my troubleshooting process to the Engineer. They did not believe they could be the cause and although the evidence pointed in their direction I was a little concerned that I missed something along the way. So their only idea was that it was some form of virus. So they had me build a laptop with a CD installation of Windows with Outlook Express. This system was never attached to my internal network to avoid infection. I used a crossover cable to connect it to the Internet router directly and used Outlook Express to send email using both MIME and UU encoding. Sure enough UU encoding worked and MIME encoding did not. Still unconvinced and determined to prove me wrong the Engineer and his boss drove from Virginia (I think but I remember it was far) to NJ with a router and laptop to test with. When they arrived in the office they connected to our T1 and experienced the same issue. I must say that this took all of 15 minutes, much less time then it took them to drive up. Surprised but convinced they contacted the LEC and initiated a call with them to check the circuit.

When the LEC engineer arrived he attached his T-Bird (T-BERD) to run test patterns on the circuit. Every single test passed until he ran a 1’s and 8’s test from the CO. This last test failed so he tried to re-punch the DMARC. At this point the circuit completely failed and he could not get the circuit back up reusing the original pair of wires between the multiplexer and the DMARC. He simply replaced the pair of wires with a free pair and reran his tests successfully. Once the line came up I also tested sending MIME encoded Office documents through the Exchange server successfully.

A number of months later I was discussing a different issue with an engineer at the same ISP. During our conversation he mentioned that the issue we were working on was strange. Now of course I mentioned the email issue as being significantly stranger. When I started the story he exclaimed “You’re the email guy?” They had found this problem so intriguing that they incorporated it into their customer service training. The key point being sometimes the customer is right.

Think about the probability of an issue with a pair of wires filtering MIME encoded Microsoft Office documents sent via SMTP with no other noticeable issues and I hope you will come to the same conclusion as I did. At some point when you’ve exhauster all logical explanations the illogical becomes possible and probable.

I hope you’ve enjoyed my little trip down troubleshooting lane. As always comments and feedback are welcome.

No comments:

Post a Comment