Re: Fault Tolerence on SBS2003 Prem.

From: Jeff Middleton [SBS-MVP] (jeff_at_cfisolutions.com)
Date: 05/17/04


Date: Mon, 17 May 2004 09:20:14 -0500

Too often this topic is approached without defining any scale or costs. It
leads to some interesting debate, but not nearly as much useful strategic
information for a practical decision. Fault Tolerance and Disaster Recovery
are important topics, no doubt, but they are topics that need to be mapped
on a scale of cost management, business priorities, and technical resource
realities.

The thing I noticed in your post, Chris, before I started, you are an MS
partner, therefore you are not likely asking this question for your own
consumption, rather for a strategic policy or plan for your own customers,
right?

There are no absolute answers when it comes to spending other people's
money, or managing other people's risks if you don't bother to find out what
THEY think about it. Therefore, I would recommend a really practical set of
questions you should ask your customers in order for you to design an FT and
DR plan that suits their needs. The key point that is so often missed is
that most really small businesses are not actually going to prefer to pay a
contractor $12000 a year to avoid 1 lost day of work for the business. I'm
not saying that it's never going to be the case, I'm offering that for most
small businesses, spending money on low probability risk protections isn't
necessarily better than accepting the possibility of an unforeseen "snow
day" in which the network is down due to a failure.

Of course, prevent any preventable condition that is easily cost justified,
but going to the extremes of labor intensive preventative steps isn't the
best answer. There is always a compromise involved in any business
decisision, and it's the IT consultants job to include the owner of the
business in arriving at a suitable answer.

So, let's take a look a list of questions you could ask the customer:

1. Hypothetically, if a technical failure were likely to occur once in each
year that caused the total loss of function for your business for a period
of 4 hrs, how much would you be willing to spend in costs paid out monthly
to make that time be reduced to 1 hr?
    $200/mo?
    $500/mo?
    $1000/mo?
2. Same question, but what if the cost were a one-time expense when you
bought your new server, essentially something that you could spend on the
server equipment that improved the recovery time from an annual event from
4hrs to 1hr?
    $200/mo?
    $500/mo?
    $1000/mo?
3. Same question as the first one, but this time, what if the risk was a
failure that would cause you to lose an entire day of work, and the
improvement was only reducing the downtime from the full day to a 4hr
downtime?
    $200/mo?
    $500/mo?
    $1000/mo?

Next set of questions, this time on data loss and loss of operations:

1. If your company suffered an unpreventable incident that cause your server
to crash at some point during the day, and a recovery of the system required
a choice between longer downtime to recover data changes that day, how would
you prioritize the following:

4 hrs into the day's work, the crash occurs and in order to recover the
first 4hrs of data changes, you must keep the server down for 4 hrs., or
forfeit that technical data recovery process by returning to the start of
day condition within an hour and having your staff reconstruct the data by
re-entry? Would you prefer to miss the rest of the day's work, or give up
the data changes recovery?

Same scenario, but what if the crash implied loss of 1 week of data, but
required 1 day of technical work?

What is the maximum number of hours or days of work you would feel
comfortable in your company's ability for reconstructing if a technical
recovery was cost prohibitive or simply unavailable?

2. List the types of information that you business maintains as data you
expect to keep stored on your server, and assign a value on a scale of 1 - 5
(5 is critical) what your priority would be in recovering information if the
cost was excessively high, but unavoidable:

Email
Electronic Faxes
Word, Excel, Powerpoint files
Accounting/Line of business application based data
Technical or creative business records (Autocad files, scanned documents,
graphic design)
Contact lists, electronic calendar schedules
Legal documents
Records with Federal/State/Local requirements to maintain
Files and Data you possess which represent your own customer's
investment/expense to create
Photographs, Digital Video

3. If it were possible to keep your business operating in some partial sense
for a day, 2 days, 3 days or a week, identify from the list you created just
above, how long your business could operate without the ability to update or
use each of those items, but assuming that you would regain use without loss
of the historical information, just the time delay?

>From these questions, and others like them, you should be able to develop a
profile of the customer's needs that helps you to understand what costs and
tradeoffs for downtime they are willing to choose....if they have the option
to make the choice.

Many small businesses will prefer to take a chance on missing a day of work
if it saves them $5000 per year simply because many small businesses operate
on a basis where a delay of a day in work isn't that expensive to them. Not
many businesses would prefer to miss a day or a week of work, but that's not
the question here. The question for the owner is if you must pay in
advance...forever losing that money invested for a risk with only a low
probability of impact...would you simply pay or would you take your chances?

At some point, this translates back to risk aversion and return on
investment information that helps decide the budgets and expectations this
owner has.

The IT consultant's job is to then translate the options back to available
technology and strategic planning initiatives.

Clearly, most IT consultants should make some basic decisions going in that
are just part of a baseline assumption if at all possible:
- a UPS on the server
- a regular system and data backup
- hardware that can be maintained by a vendor who will be around in a year
or more
- a reasonably standard installation that can be recreated and repaired

But when it comes to the FT and DR issues, most of the issues are going to
be measured as
- downtime for maintenance
- invested cost of equipment which provides no added value, only FT/DR
functionality
- routine fee costs of preventative maintenance

...that vs.
- cost of critical response
- response time for a loss of operations event
- downtime for recovery
- data loss tradeoffs for technical recovery
- unavoidable data loss due to a "window of time" during which there is no
data protection in-place
- emergency expedite cost for equipment replacement vs. stocking spare
parts.
- unforeseen downtime that an FT/DR plan doesn't address

When you present all this information, if the FT/DR plan calls for taking a
customer from a $3000 server up to needing another $3000 server, plus
another $3000 worth of other hardware and software, plus $12,000/yr in
preventative labor....you might find the owner just doesn't see this as a
great idea to invest in so that you have a "nothing can go wrong plan" which
in fact, isn't really a fact anyway. You know, if the power goes off, even
if you have that backup generator in the yard to run the server, if you
can't run the workstations, telephones and air conditioning/heat....chances
are the owner is sending the staff home anyway.

If the DR plan calls for rebuilding a server in 1hr by spending $15000/yr in
prep work for that event, could just be the owner would rather take a day to
go golfing and let the staff go home while you do a $2000 repair day on the
server. The owner might even forfeit the previous day's work rather than
paying $3000 for more stuff or services. You don't know if you don't ask.

And in the final level of details, the ones that other's posted thoughts on
this thread with, there are very many good practical steps you can take to
improve FT/DR that include better hardware to begin with, reliable backup
operations, and strategic DR snapshots with drive images.

As a rule of thumb, sort of arbitrary, but I start with getting a validation
from the owner that most of my customers are able to survive a 4hr downtime,
unless they identify why that's not the case in fact. A single server
environment with a contractor as IT support should generally be able to
address a four hour recovery in most situations, and the server should be
designed with that thought in mind. However, if you look at the nightly
back, you may well realize that if it takes 6 hrs to restore from tape, a 4
hr recovery may be hard to hit, right?

In this case, having a second server and splitting the roles of the servers
is probably the most likely way to cut the risks in half, or at least, split
the risk by improving survival of some more critical operations. I rarely
find that having a duplicate server sitting cold at a customer office is
more valuable than having that second server operating in a valuable role,
but that's not an absolute situation. In offices where I have 4 or more
servers, I usually do have a strategic plan for switching roles of servers,
or bringing in a suitable alternate package of hardware as needed.

In the long run, the single most valuable skill an IT person can have in a
DR role is experience in rebuilding an installation on different hardware,
and the experience to know how long that will take them given a specific set
of tools. Identify what those tools are for your experience and technical
level, then practice with them. Quote your customer based upon this
experience and set of tools. For instance, it's my baseline preference to
have the following available to me at every customer server site:

- FT drives, preferably RAID5 because a drive failure still isn't an
emergency if there's a hot spare, and a mirror is likely to cause a boot
failure even if the system/data is still preserved
- nightly backups to removeable media such as tapes
- UPS on the server
- Server is not used as a workstation or by local logons
- A system partition drive image has been prepared at some point in the last
yr., or during the last lift of major Service Pack update level.
- The server is running AV on the local system
- The server is running a backup program that provide job by job logging
history, not just last job
- I have either at my office or at the customer's location, another computer
which is reasonably suitable to load that disk image as needed.
- I have either at my office or at the customer's location, another drive
suitable to boot that drive image.

The last two items in the list deserve a little more detail.

Many people do not know or have the technical skill to reliably or
consistently implement a server recovery on different server hardware, or
even from a different set of boot drive hardware. For instance, I prepare
all my servers to boot from a drive image either on the native production
boot controller (typically a SCSI RAID) or in addition, from the onboard
EIDE controller. It's pretty simple to make this happen. Once you finish
installing the server OS, or at any time in the future, if you simply plug
in an EIDE drive and then perform a complete boot cycle and shutdown, you
will probably now be able to install a drive image of the RAID as a restore
onto EIDE drive in that same computer and boot from the EIDE with not
additional steps. (This assumes that the SCSI subsystem drives are not
attached at the time, otherwise you do need to indicate boot preference for
one of the two bootable subsystems)

Furthermore, booting a drive image on different server hardware (different
motherboard) isn't really that complicated either. In fact, I think it's a
good idea for an IT consultant to be aware of what similar and dissimilar
server hardware they handle that is compatible to transfer and boot from. As
a general rule, if you have the same boot controller (SCSI card or PCI EIDE
or SATA card), you probably can boot most any dual processor based
motherboard with the same Windows install provided the motherboard is recent
within the last 4 yrs and therefore ACPI compliant. Similar comment for
single CPU P4 is going to boot from a Windows Dual CPU install. Note that if
the motherboard is dual CPU socket, it's not relevant that you have only a
single CPU, it's a Dual CPU motherboard.

The point at which you realize that most of the over-zealous FT/DR plans are
just overkill is when you see the look on a customer's face when you find
any means to get them back up and running quickly, using whatever resources
you have available.....and they are not paying a fortune for those
resources. Most customers really don't care if you use a PIII workstation
running on EIDE as a temporary workaround to having their production Dual
Xeon server with RAID5 go down by a lightning strike, or burst waterpipe, or
fire in the server closet. All that matters is that you can get them back up
and running on something.

Building a technical nerve center that involves multiple servers adding to
the upgrade costs, the maintenance costs, and the purchase costs isn't
really the right answer for most businesses. The right answer is letting
them have a way to contribute to determining the costs to run their
business, to participate in the risk analysis, and to have an IT contractor
who is both competent technically, and has a sense of business reality as
well.

"Chris White - Stirling" <cw@NOSPAMstirtech.co.uk> wrote in message
news:da6201c43b76$30e6fa80$a401280a@phx.gbl...
> Hi,
>
> I had my questions sort of answered in my last post a few
> days ago. Anyway i wondered what is the best advice for
> fault tolerence on SBS2003 Premium?
>
> Cheers.
>
> Regards
>
> Chris White
> Stirling Technical Engineering Ltd.
> Microsoft Partners
>
>



Relevant Pages

  • Re: why have a server?...advice needed
    ... responsible mainly for the SG&A costs (the other buyers will be overseeing ... Virtually none of the business ... But without a server, this gets very complicated. ... why not use an online backup tool for each computer (e.g. ...
    (microsoft.public.windows.server.general)
  • Re: Fault Tolerence on SBS2003 Prem.
    ... Perhaps we could convice Microsoft to bring the cost down a little. ... There is always a compromise involved in any business ... > server equipment that improved the recovery time from an annual event from ...
    (microsoft.public.windows.server.sbs)
  • Re: Fault Tolerence on SBS2003 Prem.
    ... > Too often this topic is approached without defining any scale or costs. ... There is always a compromise involved in any business ... > server equipment that improved the recovery time from an annual event from ...
    (microsoft.public.windows.server.sbs)
  • RE: HP hostility to alpha holding back itanium?
    ... Geeezz - saying that one cpu architecture is going to force all others ... Server performance and other HW ... support and software costs typically dwarf the overall hw ... investment in business logic to another OS because that OS has a new ...
    (comp.os.vms)
  • Re: compiling a Lisp source to exe
    ... They would like to do this but the bandwidth costs of a server based Photoshop are prohibitive and will continue to be for many years. ... But at the same time AMD will offer me their 32-core CPU with 4 GHz each. ...
    (comp.lang.lisp)