PDA

View Full Version : Downtime 14/08/11



Beer Baron
August 14th, 2011, 08:09
More downtime. This time a hard disk failed and brought the entire cloud down. How this is possible i have no idea. I thought that was the whole point of a cloud... To protect against hardware failure.

Anyome else want to take itil on and deal with the whole shambles of dealing with hosting companies?

bwfhosting
August 15th, 2011, 07:53
Hello

I wanted to take a little time to explain what happened to the server this website is located on on behalf of the Hosting Company and to offer our sincere apologies for the outage.

Yesterday at 7.30pm the cluster this server is built on developed a SAN network storage fault causing the drives to become unreachable. The SAN is RAID 6 so in theory it is impossible for one drive failure to take out an entire array. The network SAN is brand new and is less than 4 weeks old.

We replaced the drive and started a rebuild of the array. At the same time we needed to rebuild certain partitions on the SAN. Once that was completed all servers were rebooted and a linux FSCK file system check was performed.

Tonight at 9pm the same thing happened. Tonight we had a different tech on site and he determined the issue was actually caused by a faulty part on the brand new SAN. This part was replaced by a part from stock and all servers were brought on line.

It is most unfortunate this happened. Whilst a cloud based cluster has lots of points of failover taken away the SAN is not one of them unfortunately. The cluster should result in the server staying up if a server develops a fault but the SAN is different. We have access to a similar SAN in our Orlando data centre and it has been working continuously with zero issues for a long time.

Whilst there are no guarantees when it comes to web servers we are pretty confident we have found and solved the root cause of the problem this evening.

Thanks for taking the time to read this and please understand we are truly sorry about this issue.

www.bigwetfish.co.uk (http://www.bigwetfish.co.uk)

Sciby
August 15th, 2011, 08:37
If the SAN failed, why didn't the server migrate over to a second cluster? The whole point of 'high availability' in a virtual environment is for exactly this kind of situation.

Alphabet
August 15th, 2011, 12:42
Yeah it was down for me for about five or six hours today.

bwfhosting
August 15th, 2011, 20:49
Hello

Thanks for the response. The misconception of cloud hosting is that there can never be any downtime. A cloud based cluster will remove certain points of failure. Were a power supply on a hypervisor to fail the Virtual Machines will hot migrate to a spare Hypervisor for example.

In this case the SAN developed a fault. SANs are usually very resilient but in this case a 4 week old SAN developed a fault. SAN is off server storage so when the SAN fails the servers will go down unfortunately. Another SAN in our Orlando data centre has had 100% uptime in 12 months so this is rare.

We have been emailing the site owner regarding this and have sent about 3 update emails in the past 24 hours.

Trust me when I say I am as frustrated as you are with this. We were let down by another supplier and it is disappointing for us that this new cluster we deployed has developed a fault so early in its life. We are doing everything we can to ensure this will not happen again.

Thanks for allowing me to come on line and explain our situation. Unfortunately it is our cloud affected. We have many other servers that have had zero downtime in months.

If there are any more developments i will update this thread.

Sciby
August 16th, 2011, 08:03
Hello

Thanks for the response. The misconception of cloud hosting is that there can never be any downtime. A cloud based cluster will remove certain points of failure. Were a power supply on a hypervisor to fail the Virtual Machines will hot migrate to a spare Hypervisor for example.

In this case the SAN developed a fault. SANs are usually very resilient but in this case a 4 week old SAN developed a fault. SAN is off server storage so when the SAN fails the servers will go down unfortunately. Another SAN in our Orlando data centre has had 100% uptime in 12 months so this is rare.

The problem is that the "cloud" is a very unfixed term, which can mean many things to many people. BWF's 'Cloud' product description says: "the system gives High Availability and in the event of hardware failure your site will automatically be hot migrated" - I'm pretty sure that gives customers the impression that if anything fails, the downtime will be minimal as it fails over.

I don't know what your infrastructure is, nor do I know what particular hypervisor software you're using but having one SAN die shouldn't affect services, because surely you would have an extended or second resource cluster ( a second physical SAN, etc) with mirrored customer virtual hosts to provide proper redundancy. If your storage isn't redundant, and you're basing your 'high availability' only around having multiple physical hosts presented into the resource pool, then it's really not what people are paying for when they see your product description.

Regardless of 'cloud' stuff, having a single point of failure when your goal is to provide redundancy is just bad planning.

While I'm probably coming across as accusatory, well, its probably because I am, albeit in a soft way - I know it's not your direct fault, and I certainly can't fault your customer service by coming onto the site and talking to us directly, but if your marketing people are going to write promises that your infrastructure can't provide... just wait until it fails for someone who's entire livelihood relies on your service.

bwfhosting
August 19th, 2011, 03:55
Hello

Sorry I did not update you before now and I do take those comments on board. We have actually removed the 'Cloud' page from our website while we review the wording of our cluster service.

There is hypervisor redundancy and we have tested the 'hot migrate' feature by manually failing a server and the sites do hot migrate. The SAN by its nature should be much more stable than a simple Raid1 or RAID10 array in our standard servers. That is why we call it a 'high availability' service. Perhaps we need to review the wording and explain that better.

We discovered a number of 4 week old drives with many errors on them on the SAN. A batch of new drives arrived at the data centre this morning (24 hours late unfortunately) and we have started the replacement. Every drive on the SAN is being replaced by brand new Seagate Enterprise Drives.

As I type this we just replaced a disk and the raid rebuild has kicked off. There has been zero client impact.

We are of course replacing the disks showing the most errors first. The last 36 hours have been really stable.

We do not have a second redundant SAN but a SAN by its nature can have multiple disk failures and still function. We do have an identical SAN in Orlando that has had 100% uptime for 12 months.

We offered the site owner a full refund if he chose to move from us but I hope we can move forward.

Again thanks for letting me come on the forum and tell you what we are doing to resolve this issue.

www.bigwetfish.co.uk

Beer Baron
August 19th, 2011, 08:29
Hi BWF,
Sorry, another quick post from me as its way past my bedtime and Im up in 5hrs to go to work :)

Thanks for the updates. Also thanks for the offer of a refund. However, as i said elsewhere (directly to you) I'm seeing progress and happy to wait it out as things appear to now be improving. Its still early days though. Uptime today appeared ok to me, but I wasnt checking it all day.

Please keep me updated, either directly via email or if you wish you can continue to give updates to our users here as well. It is the first time I have seen a host come on a customers site and interact with the community like this. I like it. Its that extra mile that other hosts wouldnt have gone to. Thanks

Scrotty
August 19th, 2011, 10:29
We have access to a similar SAN in our Orlando data centre and it has been working continuously with zero issues for a long time.

Another SAN in our Orlando data centre has had 100% uptime in 12 months so this is rare.

We do have an identical SAN in Orlando that has had 100% uptime for 12 months.
dear big wet fish guy,

thankyou for your updates.

just a quick question - do you have a SAN running in your orlando data center, and if so, what kind of uptime is it recording?

your pal,
socratty

Sciby
August 19th, 2011, 10:56
It's hard to tell, Scrotty, it's so vaguely referenced.


We do not have a second redundant SAN but a SAN by its nature can have multiple disk failures and still function.

... until a controller or a fibre channel switch dies. ;) Fair cop that a SAN is inherently fault-tolerant to a certain extent, but it just seems a little odd to not have a second SAN to provide external redundancy... but maybe that's an acceptable risk level for you, especially since SAN's aren't cheap, so replicating one is more expensive to BWF than the pain of an outage.