Re: Failure recovery for ADAM master



On 13 Mar, 06:00, "Lee Flight" <l...@xxxxxxxxxxxxxxx> wrote:
Hi

so if I am understanding this you have created site links  between
(subnetted) sites to enforce this topology?

On DR I think you have covered the bases although there did not seem to be
any mention of (tape/media) backup for your instances in particular the
central site instance.
In (1) by "unrecoverable" do you mean finding that your backups are no good?
On (2) the ADAM validFSMO roles are just naming context and schema, if you
totally lose those roles then you will need to seize them before you can
create naming contexts or modify schema. You should be able to bring up a
replica by pointing at any reachable instance in the configuration set, have
you tried this when FSMO roles are not reachable? However you would need to
think about the impact on replication if you lose your hub site as the NTDS
settings will be tied to the hub site instance guids as you are forcing all
replication through that site (server)? Use repadmin to see what's in place.

On (3) I think most folks would run two instances at the hub site even if
one were only a virtual machine.

I guess you need to spend a few days in your test dev environment to
simulate your failure scenarios.

HTH
Lee Flight

<compurhyt...@xxxxxxxxx> wrote in message

news:333f75ec-1957-4cf1-afdd-8be7978a8858@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx



I'm trying to develop a failure recovery plan for our master ADAM
instance.  We have 170+ machines running ADAM replicating our
partitions with our sites configured in a "hub and spoke" type
fashion.  For us this means that there is one Site that contains the
master instance (I mean 'Site' in AD/ADAM jargon, a logical collection
of instances close together on the network) and then their are some
number of other sites that contain all the other instances.  The
replication links only run between each individual Site and the master
Site, thus the "hub and spoke" analogy.  This was not our intent but
we must do this because there are different departments that run the
networks at each site and they only want replication traffic from the
central site.

We now need a failure recovery option for the master instance, e.g.:
what happens if it spontaneously combusts? So I have a few questions:

1. If my master is unrecoverable, can I add a new instance to the
configuration set in the absence of the master? if I could somehow do
this then I could just bring up a new instance at the master site and
then assign FSMO roles to it.  But I know my replica installs require
specifying the name of the computer with the master instance (which is
theoretically dead), so I'm assuming adding a new instance without a
master may be problematic.  That leads me to...

2. Is my only option to switch FSMO roles to another instance, bring
up a new central site box initially as a replica and then switch the
FSMO roles back to the new box?

Again, the goal here is for the FSMO roles to stay at the central
site, but the master is the only instance there and now it's dead
(theoretically). And with the replication topology a key, I'm skiddish
on even temporarily assigning master roles to a machine not at the
central site, but if that is my only option I can try to swing that.
I suppose option 3 is to have one machine dedicated as a replica at
the central site and switch FSMO roles to that machine, but getting
customer to accept ew hardware at this point may be a problem.

Thanks in advance- Hide quoted text -

- Show quoted text -

Lee, thanks for the response.

We are definitely spending some time getting this right in our test
environment.

1. Yes, all remote sites have been subnetted and site links have been
created accordingly between each remote site and the central site.
This works great (although there are zero good tools to do this in
ADAM, had to write a bare bones c# app to get some of the work done)
2. We do backups. I've had problems w/restore from backup for ADAM
which I will describe below.
3. Sometimes I get myopic about these things so let me just may sure
I'm clear: You can bring a replica instance online by pointing it at
any instance in the configuration set, not just the master? It may
seem like a naive question .. I've just never tried it (even with 170+
instances in production for 1+ years) :) I will in our lab. If this
works this actually alleviates a lot of my fears.
3. Impact on replication is exactly what we're concerned about. As
you said there are going to be stale NTDS entries in all the replica
instances if the hub server goes away. I guess it boils down to this:
If I bring another instance online after failure of the hub server and
give FSMO roles to that instance, as long as that instance resides in
my "central" Site in the replication topology, will the configuration
set be able to recalculate the new replication topology over a period
of time? Will the stale NTDS entries from the old hub just "go away"
or will manual removal be necessary?

On the subject of backups, we've never had good luck with restored
ADAM images. In our test environment we done multiple restores of
Ghost images for both the master and replicas and there has been more
than one occasion where replication stops functioning for the re-
imaged machine. It has been a while since I looked into it so I can't
remember the exact errors repadmin commands were giving me but it has
become our standard practice to completely remove the affected
instance from the configuration set and to reinstall the instance if
we need to get an instance back online from some sort of outage or
failure. This works without fail. Mind you this has only been for
replica instances (not master) and so here we find ourselves trying to
essentially make sure we can have the same level of confidence for the
master/hub instance.

I know Ghost is a bit heavy handed, but there's precious little out
there about ADAM backup/restore procedures out there. Our production
hub server backup is more sophisticated, but my concern about reliable
restoration of the ADAM image still stands.

Thanks for your comments, they're always insightful.

Mike






.



Relevant Pages

  • Re: Please Clarify
    ... I have two DCs in the hub site one holds the FSMO and the other is the ... replication partners, while increasing DC2. ... Bridge all Site links is enabled. ... Site A has the FSMO roles (2003 ...
    (microsoft.public.windows.server.active_directory)
  • Re: Failure recovery for ADAM master
    ... central site instance. ... you tried this when FSMO roles are not reachable? ... think about the impact on replication if you lose your hub site as the NTDS ... master instance (I mean 'Site' in AD/ADAM jargon, ...
    (microsoft.public.windows.server.active_directory)
  • Failure recovery for ADAM master
    ... I'm trying to develop a failure recovery plan for our master ADAM ... replication links only run between each individual Site and the master ... Is my only option to switch FSMO roles to another instance, ... up a new central site box initially as a replica and then switch the ...
    (microsoft.public.windows.server.active_directory)
  • Re: robust OpenLDAP installation using replication in production env
    ... >>In order to get a failsafe environment we need the replication. ... >>specification because write requests to the ldap server should ... > Actually they are redirected to the master. ...
    (comp.os.linux.misc)
  • Re: Question about domainprep
    ... ADPREP /FORESTPREP on the schema master (update schema) ... ADPREP /DOMAINPREP on the infrastructure master ... To disable SYSVOL replication on a DC: stop and disable the NTFRS service ...
    (microsoft.public.windows.server.active_directory)

Loading