Incident Report – September 18, 2007

In our ongoing effort to provide full transparency to our users, this will be the first in an ongoing series of posts discussing the technology supporting the ElephantDrive service, our product development roadmap, and any issues that have arisen that could or have effected user experience. While we will attempt to keep our descriptions accessible to a common user, we’ll be providing as much technical detail as is relevant.

Unfortunately our first post is regarding the most latter category: an issue that has effected our user experience. Last night at around 11:35 PM PST, a cluster of web service machines supporting authentication and file transfer went down, leaving over 10% of our users unable to upload, download, or authenticate. We found the problem at approximately 6:16 AM, removed the cluster from the production service, and fixed the problem. As of 10:50 AM PST, all server clusters were again operational. During this time, we were unable to take new data in backups, but no backed up data was lost.

In an effort to provide maximum uptime for our service, we have designed an architecture which allows for numerous failures within our network infrastructure without effecting our user experience. Last night, a group of requests causing errors cascaded throughout one group of servers can caused a shutdown of each web server in that cluster. We have recognized the vulnerability that these requests were exploiting, and have fixed that issue. While this event has a signature very similar to a Distributed Denial-of-Service attack, at this point we cannot confirm it.

We’re welcoming your comments as to how this outage has affected you.

4 thoughts on “Incident Report – September 18, 2007

  1. It’s nice to see that the team at Elephant Drive is making an attempt to keep on open dialog with their user base. In my humble opinion this marks a very important thing about a true agile business model; one that demonstrates taking responsibility for errors and asking users how to improve in the future. Thanks Elephant Drive!

    As a paying customer of the service, I didn’t personally see any effects on my end at all. I’ve uploaded nearly 6GB so far, albeit via the desktop client mostly, without a hitch. (For some reason I lose the network connection when transferring files using the Trunk Drive.)

    On a side note, would it be possible at all to have an Elephant drive forum where users could get together and discuss issues like this one, and maybe have a way for us to better interact with the company? I for one would like to see Elephant Drive stick around for as long as possible, and would like to help in anyway that I can.

    Thanks again for the incredible service that you provide!

  2. @Kory: Thanks for the feedback. There are new versions of ElephantDesktop and TrunkDrive that are currently in QA and will be released within the next 2 weeks. Both are going to provide better speed and reliability. We’re constantly developing to optimize the software on both fronts…

    Also – your idea for an open forum for our users is an excellent one. While we’re committed to continuing the one-on-one dialogs, a community round-table that puts feature requests and general questions out into the open can only assist us in refining the product roadmap and having relevant conversations with users. Expect to see something soon!

  3. i’m impressed with the transparency of this incident report. Perhaps this is a reflection of a healthy work and management philosophy at elephant? i hope so.
    i am a potential new customer for elephant. this incident is worrisome because my biggest fear is theft or sabotage of my (potential) elephant data stores (mostly our family digital photo album). so the news that you are receiving effective attacks on your system is troubling.
    I am also concerned that an incident like this will impact my initial 32gb upload, causing me to restart the upload. i have not yet researched the details of your upload client. I am hopeful that your client can resume failed uploads.
    I do have a question about your systems monitoring. It seems unusual to have file transfer functionality on one cluster down for 7 hours before you are aware of it. What gives?
    However, your clear communication and receptiveness to customer feedback as evidenced in this incident report are signs of corporate intelligence.

  4. @Chicago Dave:

    You are asking some intelligent and thoughtful questions that may be on the minds of many users, so we wanted to address them publicly.

    First, with regard to the perceived attack: Further analysis has revealed that although the signature resembled a classic distributed-denial-of-service attack, the actual cause an unexpected use-case of a beta version of our software.

    Second, with regard to upload: The ElephantDesktop backup software has bulit-in intelligence to resume failed (or paused or stopped) uploads.

    Third, with regard to our systems monitoring: The extended duration of the problem prior to our reaction was unusually long. We expect strange behavior from our beta products and consequently limit their interaction to isolated nodes of our production architecture. We mitigate these unforeseen scenarios with systems designed to react quickly in the event of problems. In this regard, we had a notable failure and are already working to implement changes to avoid this in the future. We would like to stress, however, that while a small number of users were unable to effect transfers during the incident, no user data was lost or compromised.

    We appreciate your questions and your feedback and hope these answers help address the concerns.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s