On Thursday morning, one of our replica database clusters failed over from master to backup. This is most likely due to an unexpected hardware failure. 27 minutes later, the backup database cluster also had an issue and failed over again.
Initial calculations puts the estimate at 10 hours. One of our team members threw up, literally.
While the first resync was still progressing, two new methods were coded, tested to the extent possible, and under the time constraints we had - started in parallel. Both of them were indeed faster. One of them saved us a number of precious hours in the end. The process still took a long, long time. The short version is, the full team worked continuously for 34 hours, overcoming many difficulties, to bring the system back online as quickly as possible.
The real helper was Mr. Mcafee, posting an obvious fake image about us being hacked. Everyone pitched in to help defend us. He united the community for us, and rallied such support, during a time when we needed it the most. Sometimes, things that look negative are actually positive. Looking at his previous posts, I now think he was completely innocent, but was just asking the wrong (or right, depending how you look at it) questions. I can understand he didn’t read my article about FUD, how innocent questions are the best carriers of FUD. Regardless, my thanks goes to Mr. Mcafee, I will buy him a drink if we ever meet.