8:38 AM Pacific
We found a server with a bad clock chip. It was causing the timestamps on various files to be written improperly and interfered with the mail flow, but also the fail over of that mail to other servers. It also kept the issue from showing up in our regular monitors.
Our engineers removed the server from service, and moved the mail manually to a different cluster, removed the mistimed mail header timestamps and re-processed all the mail. It was all delivered by around 8pm last night.
Nothing was lost, but a portion of mail was delayed (overall, it was about 5-10% of total mail volume, but of course that number varies by customer).
The server will be repaired or replaced. We are adding new checks to our monitoring software to catch abnormal time jumps, and we are going to redo some of the fail over code to not rely on timestamps of files. These changes should be completed within the next few days.
We have never had this kind of issue before, and we do not expect it to ever happen again like this. If it does, we will be able to catch it and fix the issue automatically before it affects anything at all.