Re: Memory Management extremely poor in C# when manipulating strin



This is for Jon's & Willy's replies.

1.) No. I'm not loading the whole file into memory. I'm reading it line by
line. Originally I tried reading it in in 50M chunks. I slowly whittled it
down to 1M chunks, and still found no relief. Now, I'm reading it in line by
line.

2.) I'm not sure if I can post a "short but complete" version that
demonstrates what I'm seeing. I can post a trimmed down version, but it
encompasses some other data structures that I think are affecting the memory
management(Hashtables). I guess I could post the basic code with some stats
on the hash tables.

The entire algorith is only about 300 lines(with comments) so I'm going to
go ahead and post it here(minus some of the superflous stuff).

Please let me know if it would be more beneficial to post an actual working
program. I would think the idea here is not to identify bugs, but identify
where memory is not getting released.

-----------BEGIN CODE SNIPPET------------------
private void ProcessInputFile(FileInfo f) {
/* NOTE: the UPC object has been previously instantiated & populated
prior to this call. */
string rejectFilePath = null;
string outputFilePath = null;

int RecordsConverted = 0;
int RejectedRecords = 0;
int foundBySKU = 0;
int foundByToy = 0;
int foundByLocalSKU = 0;
int foundByLocalToy = 0;
string inLine = null;
Hashtable lSKU_UPC_HASH = new Hashtable();
Hashtable lTOY_UPC_HASH = new Hashtable();



StreamReader IF = new StreamReader(f.FullName);

//this is how we get the DATE_END value so we can name our output file
//accordingly. First line is junk. 2nd line is what we want.
IF.ReadLine();
inLine = IF.ReadLine();
inLine = Regex.Replace(inLine, @"\s+","");
string[] fieldvalues = Regex.Split(inLine, ",");
string date = fieldvalues[4];
outputFilePath = this.process_path + "\\" + date + "_output_" + f.Name;
IF.Close();


//now start in on the actual processing.
rejectFilePath = this.reject_file_path + "\\" + date + "_reject_" + f.Name;
StreamWriter OFR = new StreamWriter(rejectFilePath ,false);
StreamWriter OF = new StreamWriter(outputFilePath,false);

IF = new StreamReader(f.FullName);
//header information. Need to append "UPC".
inLine = IF.ReadLine();

OFR.WriteLine(inLine);
OFR.Flush();

inLine = Regex.Replace(inLine, @" +, +","\t");
inLine += "\tUPC";

OF.WriteLine(inLine);
OF.Flush();

string prevSKU = null;
string prevToy = null;
string curSKU = null;
string curToy = null;
string sUPC = null;
while((inLine = IF.ReadLine()) != null) {
RecordsConverted++;
StringBuilder buf = new StringBuilder(1024);

string[] fields = Regex.Split(Regex.Replace(inLine,@" +",""), @",");
//split
curSKU = fields[2];
curToy = fields[6];
/*
* The following bit of code is a somewhat vain attempt at some
performance
* improvements for speed. What we have here is a lookup against two
hashes.
* The first hash, the SKU hash, is only about
* 2200 items long. The Toy hash, though, is over 100,000. The files are
organized
* mostly sorted by SKU. So first, I'll store our SKU and TOY # value in
temp variables.
* Look up the values for them, then continue on to the next loop. On
the next loop, if my
* SKU or TOY are the same, we know we'll get the same UPC from it, so
we'll just use the
* stored UPC from the last loop. If we find a new Toy or SKU, then
we'll look up that new value
* and store it the two dynamically built hashes lSKU_UPC_HASH or
lTOY_UPC_HASH. These guys will be
* much smaller than the full hashes for the same type. Since they are
smaller they'll be faster.
* Finally, if we can't find our UPC based on previous value or values
which we've looked up in
* our smaller local hashes, we'll go to the global hashes to find our
UPC. Once we get it, we'll
* store the TOY and SKU in our local Hashes for use next time.
*/
if(! (curSKU == prevSKU || curToy == prevToy)) {
if(lSKU_UPC_HASH.ContainsKey(curSKU)) {
foundByLocalSKU++;
sUPC = lSKU_UPC_HASH[curSKU].ToString();
} else if(lTOY_UPC_HASH.ContainsKey(curToy)) {
sUPC = lTOY_UPC_HASH[curToy].ToString();
foundByLocalToy++;
} else {
//the SKU's have a . behind them in the text file, so we need
//to strip it out
string sSKU = Regex.Replace(curSKU,@"\.","");
sUPC = UPC.GetUPCBySKUShared(sSKU); //UPC.GetUPCBySKUShared(sSKU) is
just a Hash Lookup by sSKU
if(sUPC != null) {
lSKU_UPC_HASH.Add(curSKU, sUPC);
foundBySKU++;
} else {
string sToy = curToy.Length == 4 ? " " + curToy : curToy;
sUPC = UPC.GetUPCShared(sToy); //UPC.GetUPCShared(sToy) is just a
Hash Lookup by sToy
if(sUPC != null) {
lTOY_UPC_HASH.Add(curToy,sUPC);
foundByToy++;
}
}
}
prevSKU = curSKU;
prevToy = curToy;

//if we can't find a UPC, we need to reject the record. Do this by
writing the record
//to the reject file, bump up our reject record counter and continue.
if(sUPC == null ||sUPC.Length < 1) {
RejectedRecords++;
OFR.WriteLine(inLine);
continue;
}
} // if(! (curSKU == prevSKU || curToy == prevToy)) {
OF.WriteLine(string.Join("\t", fields) + "\t" + sUPC);
} //While IF.ReadLine
OF.Close();
OFR.Close();
IF.Close();

}
-------------END CODE SNIPPET---------------
"Jon Skeet [C# MVP]" wrote:

> Segfahlt <Segfahlt@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> > I have a fairly simple C# program that just needs to open up a fixed width
> > file, convert each record to tab delimited and append a field to the end of
> > it.
> >
> > The input files are between 300M and 600M. I've tried every memory
> > conservation trick I know in my conversion program, and a bunch I picked up
> > from reading some of the MSDN C# blogs, but still my program ends up using
> > hundreds and hundreds of megs of ram. It is also taking excessively long to
> > process the files. (between 10 and 25 minutes). Also, with each successive
> > file I process in the same program, performance goes way down, so that by the
> > 3rd file, the program comes to a complete halt and never completes.
> >
> > I ended up rewriting the process in perl which takes only a couple minutes
> > and never really gets above a 40 M footprint.
> >
> > What gives?
>
> It's very hard to say without seeing any of your code. It sounds like
> you don't actually need to load the whole file into memory at any time,
> so the memory usage should be relatively small (aside from the overhead
> for the framework itself).
>
> > I'm noticing this very poor memory handling in all my programs that need to
> > do any kind of intensive string processing.
> >
> > I have a 2nd program that just implements the LZW decompression
> > algorithm(pretty much copied straight out of the manuals.) It works great on
> > files less than 100K, but if I try to run it on a file that's just 4.5M
> > compressed, it runs up to 200+ Megs footprint and then starts throwing Out of
> > Memory exceptions.
> >
> > I was wondering if somebody could look at what I've got down and see if I'm
> > missing something important? I'm an old school C programmer, so I may be
> > doing something that is bad.
> >
> > Would appreciate any help anybody can give.
>
> Could you post a short but complete program which demonstrates the
> problem?
>
> See http://www.pobox.com/~skeet/csharp/complete.html for details of
> what I mean by that.
>
> --
> Jon Skeet - <skeet@xxxxxxxxx>
> http://www.pobox.com/~skeet
> If replying to the group, please do not mail me too
>
.



Relevant Pages

  • Re: Programming in standard c
    ... I consider having the text file size used for reading the file into ... memory to be used insufficiently often to make it worth caching it. ... size may suck significantly more than getting the binary-mode size ... NOT precalculate the size) and reallocing when needed ...
    (comp.lang.c)
  • Re: Large text file - in memory ( > 60mb)
    ... The file is over 64mb in size, reading it line by line to do a search ... while running the app, it would mean reading/searching the>64mb file many ... Then I have to show this record found (wich ... maybe creating a datatable to ease the search but I'm pretty sure memory ...
    (microsoft.public.dotnet.framework)
  • Re: Off Topic: Memory
    ... Given that my memory is something of a concern to me, ... I've only just started reading it but it addresses the causes ... - the rather haphazard way our brains are built, sort of like kludged ... P.S. haven't tried the test yet - keep forgetting ...
    (rec.music.makers.guitar.acoustic)
  • Re: Noise Level of the PowerMac G5
    ... Or is it just a little bit more memory slots, ... If you are spending most of your time reading, thinking and editing ... I may have 10 to 15 apps open at once and switching between them a lot, ...
    (comp.sys.mac.misc)
  • Re: Programming in standard c
    ... may grow larger between the query and the reading. ... That wider programming environment can provide guarantees ... and learned my craft in the days when memory was ...
    (comp.lang.c)