Search Offline
Incident Report for Sifter
Postmortem

Lately, we've had to many incidents of downtime for search. These incidents were the result of an upgrade to our search functionality, Sphinx and Thinking Sphinx if you're curious, and adjustments that we needed to make as a result. We apologize, and we're working hard to ensure that these problems are behind us.

What happened?

We released the updated search code on April 13th.

Search first went offline April 14th. It took a little troubleshooting and took a couple of hours to get updated. At the time, we didn't think to update our status page. Since then, we've added search as a component on our status page and are now making sure to record any incidents of search downtime there.

The second incident was on April 22nd and was resolved in under an hour.

Today, we had our third period of downtime for search, and it appears to have been offline for most of the morning and afternoon. Once we were alerted to the problem, we were able to track down the cause and fix it in a matter of minutes. Unfortunately, our monitoring system failed to alert us to the downtime, so it appears that search was offline for several hours.

Why did this happen?

With any upgrade like this, there were several moving parts and new configuration settings to update. In conjunction with the updated search functionality, we also made updates to improve our release process to work with the updates as well as make it more resilient.

Like most releases processes, we always keep several previous versions of the application on the servers in case we need to quickly rollback in the event of a major problem with a release. With each release, we remove the oldest version so that we don't accumulate too many old releases on the server. Each of these versions of the application share some folders, one of which includes key files for our search functionality.

Unfortunately, the combination of the new release process and the updates to search meant there were new settings that we needed to explicitly configure. Once things were corrected, everything appeared to be working great, but in reality, search still wasn't sharing all of the necessary files because we had only updated a couple out of several new settings that we needed to configure.

With each update after each incident, things worked fine until we had deployed enough new versions to delete one of the folders that search was using. When this happened, the search process couldn't start back up because key files were missing after the deploy process completed. We were able to track down the missing files and update the configuration accordingly.

What are we doing to fix it?

We've now taken a couple of steps to ensure that our updated search functionality is perfectly configured and tuned for our environment by reviewing each and every setting and option available to us to make sure that we haven't overlooked anything that needs to be explicitly configured. We're also exhaustively testing our deploy process and search configuration to ensure that we haven't overlooked anything.

The final problem that turned what should have been a 5 minute hiccup into a 5 hour outage was that our monitoring failed to alert us, and we took too long to respond. This was a case of email filtering rules preventing the notifications from being seen. We've updated those settings and verified that the notifications are getting through.

Again, we sincerely apologize. We expect more and take Sifter's reliability very seriously. We're constantly working to make sure that incidents like this don't happen. Of course, if you have any questions or concerns, or need any help configuring Sphinx or Thinking Sphinx for your own app, please don't hesitate to contact us.

Posted almost 5 years ago. Apr 29, 2014 - 18:32 CDT

Resolved
Search is back up and running. Please let us know if you have any further problems.
Posted almost 5 years ago. Apr 29, 2014 - 14:55 CDT
Identified
We're having some problems with search, but we've identified the problem and are working to fix it. We'll write up a full post-mortem once everything is back on track.
Posted almost 5 years ago. Apr 29, 2014 - 14:50 CDT