Dec. 29, 2015 - Post Mortem for 12-28-15 Mail Delay Event

8:38 AM Pacific

We found a server with a bad clock chip.  It was causing the timestamps on various files to be written improperly and interfered with the mail flow, but also the fail over of that mail to other servers.  It also kept the issue from showing up in our regular monitors.


Our engineers removed the server from service, and moved the mail manually to a different cluster, removed the mistimed mail header timestamps and re-processed all the mail.  It was all delivered by around 8pm last night.

Nothing was lost, but a portion of mail was delayed (overall, it was about 5-10% of total mail volume, but of course that number varies by customer). 

The server will be repaired or replaced. We are adding new checks to our monitoring software to catch abnormal time jumps, and we are going to redo some of the fail over code to not rely on timestamps of files.  These changes should be completed within the  next few days.

We have never had this kind of issue before, and we do not expect it to ever happen again like this. If it does, we will be able to catch it and fix the issue automatically before it affects anything at all.

Have more questions? Submit a request

2 Comments

  • 0
    Avatar
    Daniel Penley

    Anyone know how to subscribe to these as they are released?  The last one said to do so, but lacked any link or instructions.

  • 0
    Avatar
    John

    Wow, a bad clock chip... so random, how do you plan for that? And on Christmas Eve. 

    Well, they say lightning doesn't strike twice (OK, scientifically that's completely not true), so you should be safe from gremlin attacks for the new year! And a Happy New Year to all your staff!

Please sign in to leave a comment.
Powered by Zendesk