In our ongoing effort to provide full transparency to our users, this will be the first in an ongoing series of posts discussing the technology supporting the ElephantDrive service, our product development roadmap, and any issues that have arisen that could or have effected user experience. While we will attempt to keep our descriptions accessible to a common user, we’ll be providing as much technical detail as is relevant.
Unfortunately our first post is regarding the most latter category: an issue that has effected our user experience. Last night at around 11:35 PM PST, a cluster of web service machines supporting authentication and file transfer went down, leaving over 10% of our users unable to upload, download, or authenticate. We found the problem at approximately 6:16 AM, removed the cluster from the production service, and fixed the problem. As of 10:50 AM PST, all server clusters were again operational. During this time, we were unable to take new data in backups, but no backed up data was lost.
In an effort to provide maximum uptime for our service, we have designed an architecture which allows for numerous failures within our network infrastructure without effecting our user experience. Last night, a group of requests causing errors cascaded throughout one group of servers can caused a shutdown of each web server in that cluster. We have recognized the vulnerability that these requests were exploiting, and have fixed that issue. While this event has a signature very similar to a Distributed Denial-of-Service attack, at this point we cannot confirm it.
We’re welcoming your comments as to how this outage has affected you.