Re: Data Warehousing
- From: "AMDRIT" <amdrit@xxxxxxxxxxx>
- Date: Wed, 17 May 2006 14:16:00 -0500
Kevin Spencer's response is more eloquent than mine in explaining the guid
contention issue.
Essentially, the guid's used on the client side are just place holders so
that work can be completed. guid's on the server side are assigned as the
stored procedures are processed. Since the server is assigning the guid's,
the likely chance of a duplicate guid in the same table is like 2 billion to
1.
The paradigm that I proposed treats each survey as "unprocessed" work and it
remains unprocessed until the server side logic processes the data and the
data is then sent back to the client. Essentially, it is a round trip, just
a very slow round trip.
possible flow
1. create a dataset for a new survey
2. modify the data (insert, update, delete)
3. save the dataset to the disk
4. copy the file(s) to the thumb drive
5. load the file(s) from the thumb drive into the server control program
6. process the data
7. commit the data to the data warehouse
8a. save the processed data back the thumb drive
8b. save lookup and reference data to disk
9. copy the data back onto the client drive from the thumb drive
10. load the file(s) back into the client application
11. commit the data to the local client
The client code will manage the initial guid assignment on the client side.
The dataset itself, along with the stored procedures will manage the guid
assignment on the server side, ignoring the initial guids assigned by the
client.
Since the data was not initially committed to a local data store (SQL or
Access), when you store the data in the dataset on the local server, the
guids will be assigned the values the server had given them. There is no
contention on guid assignment, since the data warehouse is assigning the
guid's.
I do not think that log shipping is a viable solution here
I do not think that replicated databases are a viable solution here
I do not think that hitting the local SQL server, prior to commiting data to
the warehouse is a feesible solution
If the XML files become to large to handle via the thumb drive, there is
always csv and binary formats to work with. By using either the XML, CSV,
and binary solutions, you are only moving affected data and not the entire
database.
I hope that makes it more intelligable.
"David Harris" <DavidHarris@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message
news:94D6D7E7-5624-4390-8F1C-2E8310E4C01E@xxxxxxxxxxxxxxxx
Unfortunately, a secondary problem is that our server is not located
physically near the machines, it must be in a separate building.
Radioactive
environments are usually dusty and noisy, so our server and processing
station is located outside the buildings being surveyed.
A tertiary problem is that this is sensitive information and must 1) stay
on
the machine until it has been processed and finalized, 2) must stay in the
data warehouse until all cleanup efforts have completed, and 3) must have
the
ability to put back on the client machines even after deletion. The reason
this is a problem is that if we only had to keep the survey up long enough
to
process it, there would be no GUID duplicates. And it must remain on the
machine also, until the survey is complete, processed, and marked as ok
(no
contamination).
Not sure what you mean about there being no GUID contention. Currently,
the
lowest level identity is an Acquisition, a member of Strip, which is a
member
of Survey. There may be tens of thousands Acquisitions in a survey,
sometimes
more. For a few seqential surveys on the same machine, this is a
non-issue.
However, when you take into account there are ten machines, each with a
hundred surveys, each with 10000 acquisitions, your chances of duplicating
one of those Acquisition GUIDs increase by a large enough margin to
matter.
So just forcing the whole dataset over without redoing all the GUIDs is
not
an option. It doesn't collide on just a few surveys, but the average seems
tobe eight surveys from various machines, before one Acquisition GUID is
duplicated.
Hence my current solution of serializing them as-is, and replacing all the
GUIDs on the server side to new ones with no collision. I guess the only
way
to speed that up would be first checking if any of the GUIDs actually
collide, and only go through the replacement if they do, otherwise just
pop
the serialized dataset into the server's database. Only when we get to
about
a thousand surveys in that data warehouse (not uncommon, our largest
project
had a few hundred thousand surveys), those collisions will happen more and
more often.
Unless I'm missing the point somewhere?
Thanks again for the help,
David
"AMDRIT" wrote:
This solution still applies, only the transport is SneakerNet, you would
simply serialize the datasets to the hard drive. Pick them up with your
thumb drive, process them, push them back to the thumb drive, then copy
them
back to the client machine. GUID or Autonumber contention is not an
issue
here, especially since this process seems more serial and sequential than
random updates.
I could make an argument for parallel port, USB, or serial port
communication similar to the old laplink days, and using it to push
update
in batch once per day. Certainly, that would be better than manually
touching every computer and inserting a "floppy disk"
.
- References:
- Re: Data Warehousing
- From: AMDRIT
- Re: Data Warehousing
- From: AMDRIT
- Re: Data Warehousing
- Prev by Date: Re: ISortable
- Next by Date: Re: Data Warehousing
- Previous by thread: Re: Data Warehousing
- Next by thread: Re: Data Warehousing
- Index(es):
Relevant Pages
|