Re: 99.999% Uptime For A Cluster - Real World Comments



Simon
It does sounds like your problems are planning related, not technically
related. You need to figure out what your clusters are doing. What are the
cpu cycles, memory cycles, access trends, etc. Was the blue screens simple
overloads?

40 resources? Are we suppose to know what that means. Do you know what that
means? If 40 are file shares being accesses once a day, your good. If 40 are
SQL databases getting hammered with thousands of transaction per second,
your beat.

Between the Microsoft sites, the sites of the guys in this news group,
(Rodney's for one) you should be able to become a cluster expert. I strongly
recommend you set up a cluster in a lab so you can test and learn. I've put
some info on my site on how to do that in vmware, (its free) so all you need
is a workstation and your learning clustering. Don't try things on your
production network!!

A file share cluster should be able to cruise through less then 32 minutes a
year, without worry. If not, go back to the drawing boards. Remember, if
your going to do an active/active cluster, you need **less than** 40%
utilization on each node.

--
--------
Hope It Helps!

dw
_______________________________
Don Wilwol
Distributed Application Technologies.
dwilwol(DELETE)@datbusiness.com
www.AtTheDataCenter.com (personal website)
www.skysphere.com (hosting available)
"Simon" <Simon@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message
news:EAF35876-7204-483E-88A6-5A89545872ED@xxxxxxxxxxxxxxxx
John
But this is what is happening. One Node has over 40 resources, which I
dont
know is good or bad. Finding any real good working documentation outside
of
microsoft is extremely hard. Even a book is hard to come by here in the
UK.

Is there a limit to the amount of resources a node can host. I do plan to
add another node, but was concerned I may be ADDING to the problem, rather
than relieving.

And 2 nodes have failed with BSOD - which I did post on here before, and
never really received any real satisfactory response as the message is
extremely vague.

Also John, when a failure occurs, not all resources come back on line.
Worse
still, is the resources move onto another node, and then all go offline
and
attempt to go back to its host node. Only to then go down and revert to
the
node it initially attempted to move to! In other words, its playing
ping-pong!

Simon

"John Toner [MVP]" wrote:

Simon,

If you plan out your clusters correctly, you should never get into most
of
the situations you describe. You should rarely experience a hardware
failure
on both nodes at the same time, though it can surely happen. Cluster has
the
solution for this, though...add another node :)

One node should never be in the situation where it dies because "it
cannot
take on all the resources." This is clearly a planning issue rather than
a
cluster issue.

Regards,
John

"Simon" <Simon@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message
news:4EA4153E-16A9-4056-BF90-E4908D7C0400@xxxxxxxxxxxxxxxx
Thanks Rodney
Cluster in question is a File Server on 2 nodes. We dont have an SLA
for
99.999 uptime - someone said that in a year, you could achieve 32mins
downtime.

I sort of disagree - because if you have a hardware problem or both
nodes
fail and its a complete disaster your stuffed really. and if one of
your
node
dies because it cannot take on all resources, your also stuffed.

Again, all depends on setup as well I guess. I wish there was a better
book
out there!!!!

But I wanted to get the feel from other people who work day in and day
out
with clusters. Surely, you have come across sites where the cluster is
in
a
mess, and you are able to sort things out?

I think that is part of my problem - fully understanding the cluster,
the
hardware config, why there are SO many resources on each node!

Simon


"Rodney R. Fournier [MVP]" wrote:

Short answer - Yes

Medium answer - Depends. How do you define uptime? Application
uptime?
Per
Server? If you define it on a per server at the hardware layer, then
probably not. Patches, BIOS/Firmware updates, etc. will kill a few
9s.

Long answer - I sure hope so.
What does your SLA define for uptime? You have an SLA right?
Hopefully you monitor it with user application availability.
Hopefully you have a monitoring system in place that can send out
alerts.
Both proactive and reactive.
Hopefully you have standard - well defined maintenance windows, patch
management, virus protection, firewalls, policies, etc.
Hopefully you have a fully trained staff on hand 24x7.
Hopefully you have vendor support and a good working relationship
with
them
already.
Hopefully you have hardware from the Clustering HCL.
Hopefully your organization from the top down understands and wants
to
maintain an HA environment.
Hopefully you have configuration and change management.
Hopefully you have a complete and accurate documentation for every
component. Documentation is very important.

If everything you do throughout the entire organization is with HA in
mind,
you can indeed achieve 5 9's. I know we do on most of our clusters
here
(we
have 30 clusters).

Cheers,

Rodney R. Fournier

MVP - Windows Server - Clustering
http://www.nw-america.com - Clustering Website
http://msmvps.com/clustering - Blog
http://www.clusterhelp.com - Cluster Training
ClusterHelp.com is a Microsoft Certified Gold Partner


"Simon" <Simon@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message
news:382FFAA7-86D7-4B80-A2D6-08D4CCBDD2FA@xxxxxxxxxxxxxxxx
Hi guys
I just wanted to get an idea if anyone really believes that its
possible
to
have a 99.999% uptime for a Win2k3 Cluster.

Our cluster has been quite unreliable and in fact our stand alone
servers
are well behaved compared to the cluster!

Any comments would be appreciated - its just to get an overall
picture
:-)
Greetings
Simon








.



Relevant Pages

  • Re: 99.999% Uptime For A Cluster - Real World Comments
    ... Is there a limit to the amount of resources a node can host. ... Cluster has the ... How do you define uptime? ... Hopefully you have a complete and accurate documentation for every ...
    (microsoft.public.windows.server.clustering)
  • Re: Blue Screen on 2 nodes - Please Help
    ... The Disks are what are known as VDisks - they are created on a HP EVA 5000 ... With regards to the resources taking 5mins, it does seem excessive, ... Either a bad driver or hardware is always the cause. ... is, the cluster has not been right for sometime, and I am unable to narrow ...
    (microsoft.public.windows.server.clustering)
  • Re: Changing Node & Virtual IPs for Print Server Cluster
    ... It's a clustered print server. ... Cluster Node 1 Name: cluster1a.domain.com ... Move all resources to Node A. ... IP addresses for all virtual servers including the cluster itself. ...
    (microsoft.public.windows.server.clustering)
  • Re: Exch 2003 SP2 - applied on one node, but cant move resources
    ... resources to Node2, the failover did not complete because 'system attendant' ... Virtual Exchange server and failover occurred normally again upon taking ... cluster resources oline. ...
    (microsoft.public.exchange.admin)
  • Re: Failed cluster node confusion!
    ... Blue exclamation marks usually means that the cluster service has terminated ... If this fails then the heartbeat will go over the teamed NIC ... the second node did NOT failover the resources. ... working node when one node has completely died (blue screen, ...
    (microsoft.public.windows.server.clustering)