I'm not actually exaggerating. Windows updates rolled through our network on Wednesday and hit one of a pair of our development servers. The second server, which had recently been spooled up and was just started to be configured had been unknowingly set in such a way that it was dependent upon the first system. Once that unit went off line due to the typical reboot required after most Windows updates, that second sever went crazy trying to find it. Email after email shot out of it as it retried and retried to contact the first unit. Over 6,000 emails later, the MIS team was able to catch it and silence the noise.
So what's the problem? Well information overload is clearly an issue. It's great to get one email when there's a problem or perhaps one every 5 minutes, but computers are capable of retrying operations so quickly that it becomes a nightmare to try to sort through that type of flow.
Secondly, a problem upstream, creates another one downstream. Our email server certainly isn't scaled to handle that level of traffic. We only have so much CPU and storage capacity on the system and other messages end up getting lost (in the worst case) or taking a back seat (in the very least) while the spasm of email continues.
What to do? Well, we can limit who gets the email. Formerly, it was an entire department. And that impacts fewer people on the notification, but still creates downstream problems for others. We can look to filter the information that gets sent through the development of triggers based on logged activity rather than application activity. This would be a significant architectural shift for our applications, but is probably something we're going to have to do in order to better deal with these issues on a long term basis.
Life is pretty interesting sometimes.
No comments:
Post a Comment