Re: StreamReader->regex->Win32 api marshalling
- From: Tim <Tim@xxxxxxxxxxxxxxxxxxxxxxxxx>
- Date: Thu, 1 Sep 2005 09:39:12 -0700
Thats the problem - the text isn't quoted. So I'm using meta-data definitions
of the SQLServer table columns that the csv data is destined for to decide
what sort of char patterns to expect in each field. But where the field is
text all bets are off - you just keep going 'til you reach whatever's defined
as the delimiter char; which is fine unless the last field in the row is text
and potentially includes the row delimiter (\r\n) in its data. As far as I
can see the only way to semi-reliably determine the end of the row,
therefore, is to find the start of the next row. Originally I was using the
entire regex definition of a record to find the start of the next row, but
I've optimised it to just look for the first column of the next row to save
time. I can't think of anything else to do - apart from drowning the person
responsible for the data format. Won't solve the problem but I'll feel a lot
better.
"shriop" wrote:
> I'm pretty sure this is right in line with what my parser is made to
> handle. If you find it can't handle it, then I'd like to see an example
> of the data so I can add the functionality to handle it. You can set
> the delimiter to anything you want. And end of row is commonly ended
> with a /r/n, so it handles that. the /r/n in the data is no problem as
> long as there are double quotes around the field itself, so the parser
> has something to base the logic on and know that this is data, and not
> the end of a row. I've got a free demo version that you can try it out
> with. You can also feel free to email me directly with any questions.
> This kind of data is why I generally recommend people not try to use
> regular expressions to parse the data, the expression just gets too
> hairy, leading to poor performance.
>
> Here's what I'm picturing you're data looking like, and what it can
> handle, although I don't know what specific delimiter you're using.
>
> 1,Bruce,Dunwiddie,"in here, I can have commas and
> \r\n"\r\n2,Bob,Jones,some other quote here
>
>
> Tim wrote:
> > When I said the csv data was rough, I meant it! As an example: it uses a
> > specific configurable field delimiter for all the fields in a record except
> > the last field in the record which uses a \r\n. BUT ... the last field (and
> > in fact any other field) may also contain \r\n in its data! My solution to
> > this was to use the regex definition of the format of the first field in each
> > record to find the end of the last field. Would I be able to handle this with
> > your csv parser?
> > Thanks
> > Tim
> >
> > "shriop" wrote:
> >
> > > Lots of info here, but I'll try to split it out.
> > >
> > > Regex's are a notoriously bad performance wise way to accomplish what
> > > you're trying to do.
> > > http://www.codeproject.com/cs/database/CsvReader.asp
> > > http://www.csvreader.com/csv_benchmarks.html
> > >
> > > I would suggest you either implement the code yourself using raw char
> > > array handling to find the commas, etc, etc, or purchase a commercial
> > > csv parser like the one I sell, http://www.csvreader.com .
> > >
> > > bcp is not what I would recommend for this because it does not properly
> > > handle escape sequences, and does not necessarily have a performance
> > > benefit that I've seen over dts. I would recommend calling a dts job
> > > instead to import this data. If you don't need to do any data
> > > manipulation to the original csv file, then there's no reason to even
> > > do any csv parsing, just let dts do it for you. If you do want to try
> > > the batching of insert statements like one of the previous comments
> > > suggested, make sure you start a transaction before doing the chunks of
> > > inserts, then commit at the end of a chunk. Otherwise, you'll get a
> > > performance hit as the db tries to commit after every single insert.
> > >
> > > Scott wrote:
> > > > Tim wrote:
> > > >
> > > > > db. It all works fine, but the performance is just not good enough -
> > > > > I need to at least halve the time its taking to process the data.
> > > > > I've profiled the code and it looks like the overhead is more or less
> > > > > evenly split between the reading/parsing on the one hand and the data
> > > > > loading/bcp/Win32 api marshalling on the other hand - so I'm looking
> > > >
> > > > Are you using the Compile flag on the regex?
> > > >
> > > > I would also try to insert to the database directly from your code by
> > > > batching the statements: insert xxx xxxx xxx;insert xxx xxxx xxx;insert
> > > > xxx xxxx xxx;... I would start by batching into groups of a 100 and
> > > > then try to tweak from there.
> > > >
> > > > I've had very good performance success with these two strategies.
> > > >
> > > > Scott C.
> > >
> > >
>
>
.
- Follow-Ups:
- Re: StreamReader->regex->Win32 api marshalling
- From: shriop
- Re: StreamReader->regex->Win32 api marshalling
- References:
- Re: StreamReader->regex->Win32 api marshalling
- From: Scott
- Re: StreamReader->regex->Win32 api marshalling
- From: shriop
- Re: StreamReader->regex->Win32 api marshalling
- From: Tim
- Re: StreamReader->regex->Win32 api marshalling
- From: shriop
- Re: StreamReader->regex->Win32 api marshalling
- Prev by Date: Re: StreamReader->regex->Win32 api marshalling
- Next by Date: Re: StreamReader->regex->Win32 api marshalling
- Previous by thread: Re: StreamReader->regex->Win32 api marshalling
- Next by thread: Re: StreamReader->regex->Win32 api marshalling
- Index(es):
Relevant Pages
|