Re: Text Script

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance




"Jim Vierra" <jvierra@xxxxxxx> wrote in message
news:Oxf85s%23UFHA.1152@xxxxxxxxxxxxxxxxxxxxxxx
> Al
>
> I agree that simplicity is the real issue. I suppose it's really a matter
> of what you are comfortable with. The dictionary hash is really not much
> different for speed due to the smallness of the "Key" but is a simple way
of
> doing it.
>
> The point I wanted to make, primarily, is that re-reading a stream has no
> more overhead than reading from an array as it is an in-memory operation.

Agreed. But how does one know when limits have been exceeded that cause the
file to be re-read?

> String matches for short strings, less than 1K, are very fast in VBS.

True, but "very" is relative. Performing a slower operation but performing
it fewer times *can* result in faster code. Also consider the geometric
increase in the number of comparisons in this particular problem.

> For each match of a key in the Dictionary we have to generate the hash for
> the "query" value. We also have to generate an index table when building
the
> Dictionary. For small files this adds much overhead. For large files it
> may gain speed for us down he road.

"Down the road" is a code word for "where we will be so far into the future
that we need not worry about it just yet". In my experience, we often find
ourselves down that road before we know it.

I do not know how much additional overhead is created when using dictionary
objects with a small number of keys. But, even if significant, it is only
significant for small data sets, which, simply by definition, are less
likely to take a long time to run. Putting it this way, if it is good for a
large problem, it can't be terrible for a small one, just perhaps not *the*
fastest.

Conversely, when the problem scales up to the larger end, that is where the
inefficiency of the brute force method starts to be a real problem. Without
doing the math, I would consider it an order of magnitude (at least) more
significant than a bit of dictionary overhead.

That is why I would recommend the most scalable approach, and more
especially because it is simpler in execution.

> In the end, as I stated above, whatever seems easiest for the user is
going
> to be the best approach.

Yes. Quite often what seems easiest is the first method we pick out of our
toolkit. In the case of the newbie, this may include only what appears to
him to be straight forward. When I first heard of the dictionary object I
looked at it a long time before it finally dawned on me that it was much
more than the documentation indicates.

/Al

> --
> Jim Vierra
>
> "Al Dunbar [MS-MVP]" <alan-no-drub-spam@xxxxxxxxxxx> wrote in message
> news:%23aKKBv9UFHA.3312@xxxxxxxxxxxxxxxxxxxxxxx
> >
> > "Jim Vierra" <jvierra@xxxxxxx> wrote in message
> > news:eKlmnEgUFHA.3312@xxxxxxxxxxxxxxxxxxxxxxx
> >> For a file with a few hundred lines the buffering mechanism in W2K and
> > after
> >> will be faster than an array.
> >
> > Having not done any timing tests I'm not sure about *faster*, but if a
> > file
> > can be re-read from a system buffer, this will no doubt improve the
> > performance of the "brute force" approach.
> >
> >> VBScript arrays of more than a few tens of
> >> lines tend to be slow. Dictionary are even slower. Rewinding a file
is
> >> instantaneous and the file is already in memory after the fist pass.
> >
> > Rewinding a file might indeed be instantaneous as you say, but you are
> > comparing apples to oranges here. After rewinding the file, it must be
> > processed again, line by line, and compared against some value to find a
> > match. The rewind part of this operation becomes insignificant to the
> > number
> > of comparisons. IMHO, this type of operation is where the dictionary
> > object
> > becomes more efficient.
> >
> >> RegEx can match by template for file1 and the retrieved value can be
> > quickly
> >> matched against file2.
> >>
> >> Using arrays is nice but I don't see any real performance gains due to
> > array
> >> behavior and slow dictionary response.
> >
> > In all honesty, I was not considering so much the runtime efficiency as
> > the
> > design time efficiency, and the simplicity of using the dictionary
object.
> > Of course, design time efficiency may be in the eye of the beholder -
> > there
> > is no point in using someone else's approach if you think it is not the
> > most
> > natural one.
> >
> >> I will say this. A dictionary could be a nice method for comparing but
> >> the
> >> number of comparisons is still the same.
> >
> > Not necessarily. The dictionary object a hashing technique so that, for
> > example:
> >
> > if dicob.exists(someKey)
> >
> > does NOT compare all of the keys against the value given.
> >
> > In your method, and even given that the files are only read from disk to
> > memory once, each record read from the second file must be compared to
as
> > many records in the first file as it takes before a match is found.
> >
> >> We should try both just to see.
> >
> > I think that it might be difficult to do a valid timing test for such a
> > small problem. I won't bother mainly because I do not see it
specifically
> > as
> > an issue of coming up with the fastest solution possible, as that would
be
> > better done by a compiled language anyway.
> >
> > /Al
> >
> >
> >>
> >> --
> >> Jim Vierra
> >>
> >> "Al Dunbar [MS-MVP]" <alan-no-drub-spam@xxxxxxxxxxx> wrote in message
> >> news:u5wu%23SfUFHA.3244@xxxxxxxxxxxxxxxxxxxxxxx
> >> > Regardless of the actual format, and given that the ordering of
> >> > accounts
> >> > is
> >> > going to be different in the two files, I would strongly (strongly)
> >> > recommend against doing this:
> >> >
> >> >> >> So you need to compare line1 of file1 with every line in file2
and
> > so
> >> > on
> >> >> >> for
> >> >> >> every line in file1.
> >> >
> >> > by brute force, i.e., for each line you read from one file,
> > rewind/re-open
> >> > the other and read in and check every line against the "master" line.
> >> >
> >> > There are two methods that would be useful to consider here:
> >> >
> >> > a) read both files in their entirety using .readall, and then process
> > them
> >> > as arrays of lines using the split function;
> >> >
> >> > b) read in one file, building up a dictionary object with the key
being
> >> > the
> >> > account number, and the data being, well, whatever it is. Then read
the
> >> > second file, extract the account number, check to see if they are
> > present
> >> > in
> >> > the dictionary object, flag an error if they are not, otherwise
"merge"
> > or
> >> > "append" the data from this second file into the dictionary object.
> >> > Then
> >> > write out the records in the dictionary object to a file, noting
which
> >> > were
> >> > not updated with data from the second file.
> >> >
> >> > /Al
> >> >
> >> > "Jim Vierra" <jvierra@xxxxxxx> wrote in message
> >> > news:eMu%238yYUFHA.3152@xxxxxxxxxxxxxxxxxxxxxxx
> >> >> A sample set of lines would be more helpful. You can mask anything
> >> > private
> >> >> and only send 5 or 6 lines that may have different structures.
> >> >>
> >> >> --
> >> >> Jim Vierra
> >> >>
> >> >> "Scott Burns" <ScottBurns@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote in
message
> >> >> news:76A9598C-6B51-4375-8176-2B380752CCD8@xxxxxxxxxxxxxxxx
> >> >> > Basically this is an account number that is 12 digits long found
in
> >> >> > different
> >> >> > parts of each of the files and I need to be able to verify that
they
> >> >> > are
> >> >> > the
> >> >> > same lines you know apples for apples and then merge the data in
the
> >> >> > two
> >> >> > file
> >> >> > together to one line.
> >> >> > --
> >> >> > Scott Burns
> >> >> >
> >> >> >
> >> >> > "Jim Vierra" wrote:
> >> >> >
> >> >> >> So you need to compare line1 of file1 with every line in file2
and
> > so
> >> > on
> >> >> >> for
> >> >> >> every line in file1.
> >> >> >> What is the template for the item you are searching for?
> >> >> >> Is it a stable template (eg 999999 0r 999-99-9999 or
> >> >> >> (999)999-9999 )
> >> >> >> or
> >> >> >> is
> >> >> >> it variable? (eg sometimes it's 99 sometimes 9999) where the "9"
is
> > a
> >> >> >> template for any number.
> >> >> >> Is it always in the same place in the line?
> >> >> >>
> >> >> >> Please understand that you have not given enough information to
be
> >> >> >> able
> >> >> >> to
> >> >> >> design a method for doing this. Suppose more than one number is
in
> >> >> >> the
> >> >> >> line
> >> >> >> and the number you want is the second or third number. How do we
> > know
> >> >> >> which
> >> >> >> number to match?
> >> >> >> --
> >> >> >> Jim Vierra
> >> >> >>
> >> >> >> "Scott Burns" <ScottBurns@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote in
> > message
> >> >> >> news:F3AF02FC-FBAA-4AC7-867B-2F33464BE190@xxxxxxxxxxxxxxxx
> >> >> >> > Below is what I have written so far. I have file1 that comes
> >> >> >> > from
> >> > one
> >> >> >> > computer with dollar figures in it listed by account number in
no
> >> >> >> > sequential
> >> >> >> > order. I have File2 from another computer that is in no order
> >> > either.
> >> >> >> > I
> >> >> >> > need to readline in file2 to find an account number. I then
need
> > to
> >> >> >> > find
> >> >> >> > the
> >> >> >> > same account number in file1 and print specific information in
a
> >> >> >> > specific
> >> >> >> > format to one text file.
> >> >> >> >
> >> >> >> > Both files contain about 200 or so lines and could contain more
> > then
> >> >> >> > that.
> >> >> >> > I know I need to use one file as my master to query the second
> > file
> >> >> >> > with.
> >> >> >> > I
> >> >> >> > have gotten as far as the first line of data but I can't get it
> >> >> >> > to
> >> >> >> > go
> >> >> >> > to
> >> >> >> > the
> >> >> >> > next line of both files. It only queries the file for the
first
> >> > line.
> >> >> >> >
> >> >> >> > Please help.
> >> >> >> >
> >> >> >> > Dim fso, Tip, file1, file2
> >> >> >> > Set fso = CreateObject("Scripting.FileSystemObject")
> >> >> >> > set Tip = fso.CreateTextFile("Tip.txt")
> >> >> >> > set file2 = fso.OpenTextFile("79600")
> >> >> >> > set file1 = fso.OpenTextFile("0241_2005A")
> >> >> >> > Do While Not file2.AtEndOfStream
> >> >> >> > str=file2.ReadLine
> >> >> >> > str2=file1.Readline
> >> >> >> > if Mid(str,310,12)=Mid(str2,3,12) then
> >> >> >> > 'Tip.WriteLine "P"&"B"&
> >> >> >> >
> >> >
> >
rtrim(Mid(str,310,12))&"241"&rtrim(Mid(str,12,10))&rtrim(Mid(str,437,9))&rtr
> >> > im(Mid(str2,10,9))
> >> >> >> > 'above is what I want but below is my test of duplicating file
> > names
> >> > to
> >> >> >> > make
> >> >> >> > sure it works
> >> >> >> > Tip.WriteLine
> >> >> >> > "File2"&rtrim(Mid(str,310,12))&"File3"&rtrim(Mid(str2,3,12))
> >> >> >> > End if
> >> >> >> > Loop
> >> >> >> > Tip.Close
> >> >> >> > file2.Close
> >> >> >> > file3.Close
> >> >> >> > Wscript.Quit
> >> >> >> >
> >> >> >> > --
> >> >> >> > Scott Burns
> >> >> >> >
> >> >> >> >
> >> >> >> > "Al Dunbar [MS-MVP]" wrote:
> >> >> >> >
> >> >> >> >>
> >> >> >> >> "Scott Burns" <ScottBurns@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote in
> >> > message
> >> >> >> >> news:A5255BB8-8C97-47F1-972E-CABF4ECA518C@xxxxxxxxxxxxxxxx
> >> >> >> >> > I am looking for some help with a script. I am trying to
read
> >> >> >> >> > two
> >> >> >> >> > text
> >> >> >> >> > documents and read line by line and find similar data.
After
> >> >> >> >> > I
> >> > find
> >> >> >> >> > that
> >> >> >> >> > similar data like an account number I want to then merge the
> > two
> >> > to
> >> >> >> >> > the
> >> >> >> >> same
> >> >> >> >> > line. Please help out if possible.
> >> >> >> >>
> >> >> >> >> Your difficulty may lie in the vagueness of the description of
> > the
> >> >> >> >> problem
> >> >> >> >> and of the problem set.
> >> >> >> >>
> >> >> >> >> By "merge the two to the same line" do you mean to write out
the
> >> >> >> >> two
> >> >> >> >> similar
> >> >> >> >> records to a third file on the same line, or to cause the
> > original
> >> >> >> >> similar
> >> >> >> >> lines to be changed in the two input files such that they both
> >> > contain
> >> >> >> >> a
> >> >> >> >> copy of each of the two similar values?
> >> >> >> >>
> >> >> >> >> By similar data, do you mean numerical values that are
> >> > mathematically
> >> >> >> >> close
> >> >> >> >> to each other? How close? Or do you mean words that sound the
> > same,
> >> >> >> >> like
> >> >> >> >> "bow" and "bough"? Or would "vvvvvvvv" be similar to
"wwwwwww"?
> >> >> >> >>
> >> >> >> >> If the value in line 10 in file A is similar to the value in
> >> >> >> >> line
> >> >> >> >> 15
> >> >> >> >> in
> >> >> >> >> file
> >> >> >> >> B would you want to match line 15 in file A with line 10 in
file
> > B
> >> > if
> >> >> >> >> they
> >> >> >> >> happened to be similar?
> >> >> >> >>
> >> >> >> >> Or... is it a line by line match: if line 1 in both files do
not
> >> >> >> >> match,
> >> >> >> >> do
> >> >> >> >> nothing. if line 2 does match, write the output to a third
file?
> >> >> >> >>
> >> >> >> >> Perhaps a sample set of data showing us which data you
consider
> >> >> >> >> similar
> >> >> >> >> and
> >> >> >> >> which you do not, plus the desired result.
> >> >> >> >>
> >> >> >> >> /Al
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >>
> >> >>
> >> >
> >> >
> >>
> >>
> >
> >
>
>


.



Relevant Pages

  • Re: Why "lock" functionality is introduced for all the objects?
    ... By your count, objects are at a minimum 2 to six times larger than four bytes, even discounting the 4 bytes for the monitor, just for overhead alone. ... Dynamic string variables exist, too, of course, and I'm not claiming that a majority are interned. ... A small string hangs onto a much larger char array than is needed, ...
    (comp.lang.java.programmer)
  • Re: Why "lock" functionality is introduced for all the objects?
    ... But I'd say that anything more than about 5% is certainly still a significant overhead. ... and that an array of java.awt.Points ... yadda yadda yadda yadda yadda yadda ... Of course, the design time benefits are reaped, for a given piece of code, only once, while any run time costs are incurred every time that code is run. ...
    (comp.lang.java.programmer)
  • Re: Ada.Containers.Vectors - querying multiple elements
    ... there would be a distributed overhead. ... compilers store it separately. ... In order to be able to represent an array as ... so I don't think it would make sense to allow 'First to be an invalid ...
    (comp.lang.ada)
  • Re: Derived types and allocatable
    ... (snip on overhead for allocate on assignment) ... In order to do a whole array assignment at all you must ... where R does allocate on assignment. ...
    (comp.lang.fortran)