Re: Generating one table with a terabyte of data




See comments inline

--
Jim Vierra
"lee" <lee@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message news:F0589343-FBE1-4699-8F78-FE8395E11676@xxxxxxxxxxxxx...
> Hey Jim,
>
> Thanks for all the information. I'll start analyzing the perflogs. I'm doing
> 90% of what you said already, did you ever solve the Index creation problem.
> If you're inserting 100 million rows in 4 hours (I do that with my config in
> about 90 minutes and I run for days on in...so as you can imagine <<I got me
> a BIG A** DB>>) then you must have had the same problem as me when it comes
> time to create the index. Should I just leave the index(es) on and take the
> hit with the update of the index instead of the speed of inserts with out the
> index? It's my next test.
>
 
The perflogs are generally a best place to start as they help to eliminate guess work.
I have inserted in excess of 700 million rows in less that 2 hours and index time was about one hour. Tis was on a Redbrick DW but todays processors are so much faster tat I believe 100 M is more than possible.  Again, it's perfmon that will tell you much of the answer.
 
Indexing is faster after the fact in my experience.  Less IO conflict.
What kind of indexes have been declared?  You should avoid using any indexes until you know why they are needed.
Be dareful of using clustered indexes in this scenario unless you have tested them fo performance.

> I really hate N(N-1) problems...especially when N is BIGGER then a 32 bit
> number.
>
> I didn't know we could split a database file group over multiple HDs.
 
I am not sure about spliting the file group but you can cretainly have multiple file groups.  Look into partitioning.  I don't remember exactly how SQS 2000 handles this but it is standard for most DBs.  In most cases you can release one thread for each file and, I believe, on SQS it is also limited by logical drive channel.  One drive, one thread.  Look at the tuning documents to verify this.  If perfmon shows processor pegged it is not threads at this point.
 
I'll
> try that since I'm reaching HD limits also. I've got me a process that dumps
> about 200 GIGS in 1 table every day. You would be surprised at how much code
> duplication most enterprises have. It's insane...40 million lines of code
> will on average have 12-19% duplication and they're not straight forward copy
> paste. You have to catch all those subtle differences and that generates TONS
> AND TONS AND TONS of data.
 
Given the task at hand I would suggest a solution I have used in the past.  Take any simple string matching issues and run them as external C code on the files before loading.  If you can tokenize common elements it can reduce the size of teh problem. Databases are not designed to do what you are doing.  This is exactly how I got into the last issue trying to optimize a database.  Much of what was being done in the dtabase should have been done with C in an external prep process.
 
Consider deploying workstaions (XP) tuned to parse the files before feeding them to the SQS server.
Definitely go to the 4Gb limit.  Memory is cheaper than time.
 
How many RAID channels?
What is the RAID channel? 160, 320 or less.
What model Proliant?  Sometimes the hardware may not match the job but can be adjusted for greater performance by changing memory type or adjusting IO layouts.  The bigger Proliants are very flexible and the sales reps give out deals that configure the box at it's minimal level then they show of the specs for the maxed out box.  Non HP sales reps tend to be charlatans when they are not HP Certified distributors.
 
At 3Ghz Xeon I suspect you have a G4 series. Is it 1 DL580 or better.  If so add two more processors.  Go to 8 Gb memory and upgrade to Windows Datacenter edition so you can take advantage of the extra memory.
 
Of course start with the permon logs as ther may be an easier answer.  My thumbnail calculatins say that a dual 3ghz should be able to pump more data with more tuning.  Adding hardware is easier and faster though.

>
> Thanks for all the help, if you have any suggestions on the index let me know
>
> Lee
>
> "Jim Vierra" wrote:
>
>> Consider un-raiding the disks for the bulk insert tables.  It will give you
>> a 15% or greater push.  Ad the full 4Gb.
>>
>> Run perfmon counters to a log and analyze them after a bulk load.  Try to
>> see if memory or processor is in need.  I suspect that with two 3Ghz
>> processors that you are having an IO problem which can be solved by
>> partitioning the table.onto multiple disks depending on the SCSI card.  I
>> hope it is one the dual channel 320  Proliant cards.
>>
>> Be sure to run the Bulk loader on the SQLServer and not over the network
>> unless you have an optimized 1Gb net.  The UI slowdown seems to indicate
>> that the bottleneck is either with the processor or memory.  As you clear up
>> each bottle neck you will probably find a new one further down the line.
>> You will have to do some analysis to chase this chain of events.
>>
>> Remember - you can use multiple copies of BCP to load different sections of
>> the file by filtering the file.  Be sure that each copy of BCP is targeting
>> a different set of output files for maximum load speed.
>>
>> Run perfmon to analyze IO channel usage on you raid.  If the card is to slow
>> it will not keep up with a 15K disk set.  If write buffering is turned on
>> the buffer may fill up to often but this usually shows as an active UI due
>> to IO stalling.
>>
>> Watch the page file usage.  Ideally it should not show any major activity.
>> If it does something is wrong.  Be sure to not allow any transactional
>> relations to occur on the instance during loading as they can block the
>> load, intermittently, but enough to degenerate performance. Monitor locks
>> (you should have only a table lock) for interference.
>>
>> --
>> Jim Vierra
>>
>> "Jim Vierra" <
jvierra@xxxxxxx> wrote in message
>>
news:OSFJuJNVFHA.1508@xxxxxxxxxxxxxxxxxxxx...
>> > On fast bulk loads you should make sure that the database is not logging.
>> > If you keep track of an identity or some key field you can undo the load
>> > if necessary.  Another thing is to split the updates across file sets.
>> > Each file can get it's own thread where, I believe, only one thread can be
>> > release per file for IO.  This can be done by portioning the data.
>> >
>> > You are running up against the classic issues of Data Warehousing.  Fast
>> > loading of daily data in the fastest time.  The rules for a data warehouse
>> > fir perfectly here.  No transactions and many small file sets to allow
>> > parallel loading of a table.
>> >
>> > The more memory you can add to the box the better.  More processors - at
>> > least two.  Use /3GB to give SQLServer a larger address space but only if
>> > you have 4GB of memory or more on the Datacenter Editions.
>> >
>> > The slowness of the UI says that the database is working.  On SQS this is
>> > what should happen. The UI is low priority on a server.
>> >
>> > You failed to say which OS and hardware specs.  This is very important to
>> > tuning.
>> > Consider that you can set the database and OS up for loading and then
>> > reboot with an operational configuration.
>> >
>> > If you use multiple files for partitioned data be sure the files sets are
>> > on different drives.  The drives should NOT be raided for maximum
>> > performance. If the bulk load is fast enough you can always reload (like
>> > we do in a DW) instead of doing backups.  The raw data is archived to a
>> > fast tape and a loader can reload it faster than you can restore a
>> > database from tape.
>> >
>> > In the past I have split the databases up.  Fast Loader.Read only
>> > datawarehouse tables and transactional tables in different databases. The
>> > read-only DW databases are not backed up but the others are.
>> >
>> > On a 3 Tb DW we could loaded close to over 1.5 G records per day with an
>> > average record greater than 700 bytes.  The loading had to be complete by
>> > 7 AM and did not start until after 2 AM.  4 1.2Ghz (64bit) Processors and
>> > a RAID farm.  We didn't have to tear down the RAID 5 farm because we had
>> > hundreds of file sets and could optimize parallel loading.  We also had 8
>> > Channels of Raid and 4Gb of memory.  Building this beast we learned a lot
>> > about optimizing loading.
>> >
>> > I applied the lessons to a SQLServer 7.0 instance on  a 4 Processor
>> > Proliant with 4Gb and one RAID channel.  We were able to blast a few
>> > hundred million records in daily within a four hour window something
>> > everyone said would be impossible.  Like you we needed to do repeated
>> > passes on these records to find subtle matches in the data. (Data Mining).
>> > Applying the DW rules worked for this application.
>> >
>> > --
>> > Jim Vierra
>> >
>> > "Kevin3NF" <
KHill@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message
>> >
news:%23i42snMVFHA.628@xxxxxxxxxxxxxxxxxxxx...
>> >> I've locked SQL server to 1/3 physical memory usage. DB growth is 512Meg
>> >> chunks
>> >>
>> >> You need to free it up to fly, and pre-allocate the db size.  In-process
>> >> file growth will kill performance
>> >>
>> >> --
>> >> Kevin Hill
>> >> President
>> >> 3NF Consulting
>> >>
>> >>
www.3nf-inc.com/NewsGroups.htm
>> >>
>> >>
www.DallasDBAs.com/forum - new DB forum for Dallas/Ft. Worth area DBAs.
>> >>
>> >>
www.experts-exchange.com - experts compete for points to answer your
>> >> questions
>> >>
>> >>
>> >> "lee" <
lee@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message
>> >>
news:B772A8C2-F009-4E29-816C-5B6C2F45151E@xxxxxxxxxxxxx...
>> >>> Hi All,
>> >>>
>> >>> I need ideas on how to improve a process that can generate a terabyte of
>> >>> data in a week. Our company produces software that analyzes code and one
>> >>> of
>> >>> the things we do is Code Duplication. We're processing hundreds of
>> >>> millions
>> >>> of lines against each other and generate gigabytes of results. In
>> >>> reality we
>> >>> generate over 20 gigabytes of data a day. We are noticing major slow
>> >>> downs in
>> >>> the system. Here's how we're going about it
>> >>>
>> >>> 1. Non-Indexed table residing on it's own FileGroup (which is not on the
>> >>> same drive as the SQL server).
>> >>> 2. A process that generates Text files that we use BULK INSERT to load
>> >>> the
>> >>> data (any other way of doing inserts is just to slow).
>> >>>
>> >>> This process should generate 10 million rows in the table per hour. Only
>> >>> problem is that we're noticing a considerable slow down as we pummel the
>> >>> Database. We're making sure our Bulk Insert has exclusive table locks
>> >>> and
>> >>> nobody else accesses the system while this is happening. We're also
>> >>> noticing
>> >>> a strange behavior at the OS level. The system just slows down to a
>> >>> crawl.
>> >>> Even mouse movement becomes jagged (Win2K Server).
>> >>>
>> >>> I should also mention that once we're done inserting we create indexes
>> >>> at
>> >>> the end. Average time to create 1 indexed field (on a different
>> >>> filegroup) on
>> >>> 100 million rows is about 3 to 4 hours. Seems a little excessive to me,
>> >>> must
>> >>> be doing something wrong.
>> >>>
>> >>> I've gone over this 100 times and I can't see anything else to improve
>> >>> performance. I have a friend that is an ORACLE DBA and he keeps telling
>> >>> me
>> >>> that SQL server just can't handle this type of volume and that I should
>> >>> move
>> >>> to ORACLE. I refuse to believe ORACLE can do more then SQL Server. He
>> >>> actually took one of our 500 meg text files and loaded it in 1/10 the
>> >>> time it
>> >>> took SQL Server. My answer to that is "I must be configuring it
>> >>> incorrectly".
>> >>>
>> >>> I've locked SQL server to 1/3 physical memory usage. DB growth is 512Meg
>> >>> chunks. We don't need any transaction logs, so we're using RECOVERY
>> >>> MODEL =
>> >>> SIMPLE (any other mode kills SQL server anyway, LOG fills to fast with
>> >>> our
>> >>> data creation speed). I've thrown everything at this except the kitchen
>> >>> sink.
>> >>>
>> >>> Any suggestions on how I can improve my Bulk Inserts and how to improve
>> >>> my
>> >>> Index creations? Should I consider SQL Server 2005? Will it handle my
>> >>> needs
>> >>> better?
>> >>>
>> >>
>> >>
>> >
>> >
>>
>>
>>


Relevant Pages

  • Re: Generating one table with a terabyte of data
    ... I didn't know we could split a database file group over multiple HDs. ... > Run perfmon counters to a log and analyze them after a bulk load. ... Another thing is to split the updates across file sets. ... >> loading of daily data in the fastest time. ...
    (microsoft.public.sqlserver.dts)
  • Re: Performance extreme deterioration
    ... programs from loading at startup. ... Pentium M 1.4 GHz processor and 512 MB of memory. ... If I click on my system tray, there are 24 icons of programs ... programs that automatically load into memory from doing so. ...
    (microsoft.public.windowsxp.general)
  • Memory limitation with owc?
    ... spreadsheets that I want to load into memory. ... loading a number of large xml files. ...
    (microsoft.public.office.developer.web.components)
  • Re: thread texture loading
    ... it just seem to be the inverse for me, load file from disk to memory using ... // creation de la texture en memoire ... >loading thread, just read the whole file into memory using your favourite ...
    (microsoft.public.win32.programmer.directx.graphics)
  • Re: Browser Issue : IE 6 -Error message "Memory could not be read"
    ... The memory could not be "read". ... Quite often when I’m loading any page, like Amazon.com, I find that all the images won’t load and come up as boxes with red crosses inside. ... Other sites don't even load, I get the initial "Run Time Error Message" followed by the "Memory could not be read message...and then IE is terminated. ...
    (microsoft.public.windows.inetexplorer.ie6.browser)