Service Incident and Updates

Dec 18, 2014

11:07 PST: Some users may have experienced a temporary interruption in accessing their data starting around 11:07 Pacific Time.

13:30 PST: Service has been fully restored using our failover processes.

23:26 PST: We are about to implement a few additional changes to streamline our return to multiple layers of redundancy – unfortunately, that will mean that all users will have limited access for approximately the next 3 hours. We will post throughout. This is a result of today’s earlier infrastructure issues – we’d rather have a short period of inaccessibility in order to more quickly return to full capacity.

Dec 19, 2014

00:25 PST Update – 1 hour in and still on target to return full access in another 2. We will keep you posted. Thanks for your patience.

01:20 PST..and… we’re back! Herd, thanks again for your patience. We’ve completed the initial work and re-enabled full access!

Summary of Events (non-technical)

We experienced some issues with a core service instance yesterday and noticed that underlying hardware was underperforming. Our operational plan is designed to deal with such contingencies (hey – we’re a backup company, so we’ve got some serious backup plans). We keep redundant systems available and up to date to enable a nearly seamless failover, even when a critical component isn’t functioning. This design allows restoring service in very short time period, but we also like to double check to make sure things are running smoothly. We executed our failover, and after confirming all data was sound, we restored service by 13:30 PST. This completed the first step of our contingency plan to restore service.

Having employed one of our redundant systems, we were now operating with one fewer safety mechanism than normal. Our next step is always to focus on immediately re-integrating a new system to add that extra layer of protection back in, but Murphy’s Law is always in effect. The automated process was moving too slowly for our taste, so our team continued to work late into the night to create an additional copy of the affected component. To exercise the maximum amount of caution during this window, our team felt compelled to limit access to the service for a short period between 23:26 PST and 01:20 PST while snapshots and copies completed. The new redundancy layer was created and service returned to fully normal by 01:20 PST.

All data is intact and no data was lost or corrupted during this incident. We apologize again for any inconvenience and want to assure you that our engineering team is continuing to improve our processes and infrastructure to minimize (goal is to eliminate) any future incidents like this.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s