My day job is in a big Fortune 500 company. We do have systems that have 100% uptime.
- Mainframes. That's what they're designed for. Heck, about 80% of the kernel is error-checking, everything executes through multiple paths, RAM is RAID, etc. Pricetag: millions + a lot of people.
- HP Nonstop. We retired this but it was genuinely 100% available, even while being patched, etc. Of course, it's ridiculous expensive, a unique OS, etc. Pricetag: millions.
- Oracle RAC. Take a node out, patch it, put it back in, all day long. Designed for 0% downtime. Pricetag: DB + software is about $60K per CPU list price So, fewer millions but still millions.
- And even if you take the DC off line with a meteor strike or something, we'd still be up because we have metro DR, then continental DR on top of that. Pricetag: many millions.
And guess what...not only is this stuff fabulously, ridiculously expensive, its also fabulously, ridiculous complex. We have a big IT department. More millions!
You won't pay for that with $5/month plans
Really, getting to 95%-99.9% is not that hard. The vast majority of our systems have failover protection and failover only takes a few minutes even with a big database to recover or something. Most of our systems are not in the "needs 100%" tier - if an app is down for 5 minutes once every 5 years because it has to failover, that's not a big deal for most. For some it is.
Going from 99.9 to 99.99 or 99.999 is a huge jump for each 9, in both complexity and cost.
I believe it is possible to engineer for 100%, but it's extremely expensive and you can only do it if you control everything from start to finish. Within our datacenter, yes, we can do it. Within our WAN, yes - and it's really expensive to maintain a second datacenter, etc. To someone in China looking at us...no, because we can't control the network in between.
BTW, a few years ago, an operator panicked at a smoke alert, ripped off the "are you really really sure you want to do this" tape over the emergency power-off switches (to be used if the building is on fire or something) and pushed a button...and the DC went completely dark. There is no way to protect against that, some suicidal nut, etc.
Fundamentally, the deeper you get into the nines, the more control over the gear and environment you need to have. I think you can still have an excellent service - either a VPS provider who can move your node around live, or having a failover cluster if you're running your own gear, etc. (though really how many do that?) But there are limits.
BTW, from my perspective, what really is 99.99% uptime? Most hosts say "excluding routine maintenance"...great, but I'm still down
I'm working on a hosting company. I've selected a top VPS provider, and if there is a problem with the server they can move me to another node. Inevitably I'll need to have some short maintenances to upgrade a web server or MySQL or something. I think that's the best I can provide. Anything more would require either a SAN ($$$$$) and cluster, or some kind of heartbeat/DRBD/etc.
Well. That post got carried away