Results 1 to 10 of 10

Thread: Downtime 14/08/11

  1. #1

    Default Downtime 14/08/11

    More downtime. This time a hard disk failed and brought the entire cloud down. How this is possible i have no idea. I thought that was the whole point of a cloud... To protect against hardware failure.

    Anyome else want to take itil on and deal with the whole shambles of dealing with hosting companies?
    Last edited by Beer Baron; August 15th, 2011 at 15:07.
    Need a website hosted? ITIL runs on webhosting provided by BigWetFish. They're really great - check them out at bigwetfish.com. Sign up using the link and you'll also be helping ITIL with commission!

  2. #2
    Junior Member
    Join Date
    Aug 2011
    Posts
    3

    Default Re: Downtime 14/08/11

    Hello

    I wanted to take a little time to explain what happened to the server this website is located on on behalf of the Hosting Company and to offer our sincere apologies for the outage.

    Yesterday at 7.30pm the cluster this server is built on developed a SAN network storage fault causing the drives to become unreachable. The SAN is RAID 6 so in theory it is impossible for one drive failure to take out an entire array. The network SAN is brand new and is less than 4 weeks old.

    We replaced the drive and started a rebuild of the array. At the same time we needed to rebuild certain partitions on the SAN. Once that was completed all servers were rebooted and a linux FSCK file system check was performed.

    Tonight at 9pm the same thing happened. Tonight we had a different tech on site and he determined the issue was actually caused by a faulty part on the brand new SAN. This part was replaced by a part from stock and all servers were brought on line.

    It is most unfortunate this happened. Whilst a cloud based cluster has lots of points of failover taken away the SAN is not one of them unfortunately. The cluster should result in the server staying up if a server develops a fault but the SAN is different. We have access to a similar SAN in our Orlando data centre and it has been working continuously with zero issues for a long time.

    Whilst there are no guarantees when it comes to web servers we are pretty confident we have found and solved the root cause of the problem this evening.

    Thanks for taking the time to read this and please understand we are truly sorry about this issue.

    www.bigwetfish.co.uk
    Last edited by bwfhosting; August 19th, 2011 at 03:59.

  3. #3
    Australian Sciby's Avatar
    Join Date
    Nov 2006
    Location
    Brisbane, Australia.
    Posts
    6,434

    Default Re: Downtime 14/08/11

    If the SAN failed, why didn't the server migrate over to a second cluster? The whole point of 'high availability' in a virtual environment is for exactly this kind of situation.
    For relaxing times when ITIL explodes, make it Japanistan time. (Actually, don't, it's broken forever)

    Quote Originally Posted by lego
    My hobby is play sex play masturbation hand shikoshikoshiko.

  4. #4
    Bad Guy Alphabet's Avatar
    Join Date
    Jul 2008
    Location
    Doomworld
    Posts
    6,366

    Default Re: Downtime 14/08/11

    Yeah it was down for me for about five or six hours today.
    if you want proof, try math or alcohol

  5. #5
    Junior Member
    Join Date
    Aug 2011
    Posts
    3

    Default Re: Downtime 14/08/11

    Hello

    Thanks for the response. The misconception of cloud hosting is that there can never be any downtime. A cloud based cluster will remove certain points of failure. Were a power supply on a hypervisor to fail the Virtual Machines will hot migrate to a spare Hypervisor for example.

    In this case the SAN developed a fault. SANs are usually very resilient but in this case a 4 week old SAN developed a fault. SAN is off server storage so when the SAN fails the servers will go down unfortunately. Another SAN in our Orlando data centre has had 100% uptime in 12 months so this is rare.

    We have been emailing the site owner regarding this and have sent about 3 update emails in the past 24 hours.

    Trust me when I say I am as frustrated as you are with this. We were let down by another supplier and it is disappointing for us that this new cluster we deployed has developed a fault so early in its life. We are doing everything we can to ensure this will not happen again.

    Thanks for allowing me to come on line and explain our situation. Unfortunately it is our cloud affected. We have many other servers that have had zero downtime in months.

    If there are any more developments i will update this thread.
    Last edited by bwfhosting; August 15th, 2011 at 20:52.

  6. #6
    Australian Sciby's Avatar
    Join Date
    Nov 2006
    Location
    Brisbane, Australia.
    Posts
    6,434

    Default

    Quote Originally Posted by bwfhosting View Post
    Hello

    Thanks for the response. The misconception of cloud hosting is that there can never be any downtime. A cloud based cluster will remove certain points of failure. Were a power supply on a hypervisor to fail the Virtual Machines will hot migrate to a spare Hypervisor for example.

    In this case the SAN developed a fault. SANs are usually very resilient but in this case a 4 week old SAN developed a fault. SAN is off server storage so when the SAN fails the servers will go down unfortunately. Another SAN in our Orlando data centre has had 100% uptime in 12 months so this is rare.
    The problem is that the "cloud" is a very unfixed term, which can mean many things to many people. BWF's 'Cloud' product description says: "the system gives High Availability and in the event of hardware failure your site will automatically be hot migrated" - I'm pretty sure that gives customers the impression that if anything fails, the downtime will be minimal as it fails over.

    I don't know what your infrastructure is, nor do I know what particular hypervisor software you're using but having one SAN die shouldn't affect services, because surely you would have an extended or second resource cluster ( a second physical SAN, etc) with mirrored customer virtual hosts to provide proper redundancy. If your storage isn't redundant, and you're basing your 'high availability' only around having multiple physical hosts presented into the resource pool, then it's really not what people are paying for when they see your product description.

    Regardless of 'cloud' stuff, having a single point of failure when your goal is to provide redundancy is just bad planning.

    While I'm probably coming across as accusatory, well, its probably because I am, albeit in a soft way - I know it's not your direct fault, and I certainly can't fault your customer service by coming onto the site and talking to us directly, but if your marketing people are going to write promises that your infrastructure can't provide... just wait until it fails for someone who's entire livelihood relies on your service.
    Last edited by Sciby; August 16th, 2011 at 08:05.
    For relaxing times when ITIL explodes, make it Japanistan time. (Actually, don't, it's broken forever)

    Quote Originally Posted by lego
    My hobby is play sex play masturbation hand shikoshikoshiko.

  7. #7
    Junior Member
    Join Date
    Aug 2011
    Posts
    3

    Default Re: Downtime 14/08/11

    Hello

    Sorry I did not update you before now and I do take those comments on board. We have actually removed the 'Cloud' page from our website while we review the wording of our cluster service.

    There is hypervisor redundancy and we have tested the 'hot migrate' feature by manually failing a server and the sites do hot migrate. The SAN by its nature should be much more stable than a simple Raid1 or RAID10 array in our standard servers. That is why we call it a 'high availability' service. Perhaps we need to review the wording and explain that better.

    We discovered a number of 4 week old drives with many errors on them on the SAN. A batch of new drives arrived at the data centre this morning (24 hours late unfortunately) and we have started the replacement. Every drive on the SAN is being replaced by brand new Seagate Enterprise Drives.

    As I type this we just replaced a disk and the raid rebuild has kicked off. There has been zero client impact.

    We are of course replacing the disks showing the most errors first. The last 36 hours have been really stable.

    We do not have a second redundant SAN but a SAN by its nature can have multiple disk failures and still function. We do have an identical SAN in Orlando that has had 100% uptime for 12 months.

    We offered the site owner a full refund if he chose to move from us but I hope we can move forward.

    Again thanks for letting me come on the forum and tell you what we are doing to resolve this issue.

    www.bigwetfish.co.uk
    Last edited by bwfhosting; August 19th, 2011 at 04:17.

  8. #8

    Default Re: Downtime 14/08/11

    Hi BWF,
    Sorry, another quick post from me as its way past my bedtime and Im up in 5hrs to go to work

    Thanks for the updates. Also thanks for the offer of a refund. However, as i said elsewhere (directly to you) I'm seeing progress and happy to wait it out as things appear to now be improving. Its still early days though. Uptime today appeared ok to me, but I wasnt checking it all day.

    Please keep me updated, either directly via email or if you wish you can continue to give updates to our users here as well. It is the first time I have seen a host come on a customers site and interact with the community like this. I like it. Its that extra mile that other hosts wouldnt have gone to. Thanks
    Need a website hosted? ITIL runs on webhosting provided by BigWetFish. They're really great - check them out at bigwetfish.com. Sign up using the link and you'll also be helping ITIL with commission!

  9. #9
    Global Moderator Scrotty's Avatar
    Join Date
    Aug 2008
    Location
    сводник
    Posts
    4,689

    Default

    Quote Originally Posted by bwfhosting View Post
    We have access to a similar SAN in our Orlando data centre and it has been working continuously with zero issues for a long time.
    Quote Originally Posted by bwfhosting View Post
    Another SAN in our Orlando data centre has had 100% uptime in 12 months so this is rare.
    Quote Originally Posted by bwfhosting View Post
    We do have an identical SAN in Orlando that has had 100% uptime for 12 months.
    dear big wet fish guy,

    thankyou for your updates.

    just a quick question - do you have a SAN running in your orlando data center, and if so, what kind of uptime is it recording?

    your pal,
    socratty
    look all i'm saying is if you go here and you die, it's not my fault

  10. #10
    Australian Sciby's Avatar
    Join Date
    Nov 2006
    Location
    Brisbane, Australia.
    Posts
    6,434

    Default

    It's hard to tell, Scrotty, it's so vaguely referenced.

    Quote Originally Posted by bwfhosting View Post
    We do not have a second redundant SAN but a SAN by its nature can have multiple disk failures and still function.
    ... until a controller or a fibre channel switch dies. Fair cop that a SAN is inherently fault-tolerant to a certain extent, but it just seems a little odd to not have a second SAN to provide external redundancy... but maybe that's an acceptable risk level for you, especially since SAN's aren't cheap, so replicating one is more expensive to BWF than the pain of an outage.
    For relaxing times when ITIL explodes, make it Japanistan time. (Actually, don't, it's broken forever)

    Quote Originally Posted by lego
    My hobby is play sex play masturbation hand shikoshikoshiko.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •