Re: What is a good tool to search for files/text?

From: David Candy (david_at_mvps.org)
Date: 04/05/04


Date: Tue, 6 Apr 2004 08:43:58 +1000


Most text files are unicode.

In basic syntax

Chr(0) & E & Chr(0) & d & chr(0) & i (ecetera)

However the inbuilt search for plain text won't find unicode either unless it starts with a unicode header (unicode text files have a two byte binary header). So in a database, that won't have that header, the inbuilt plain text search won't find EditPad if it's stored as a unicode string either (nor will 95/98).

I've attached a RegExp searcher. It searches files for a string (it's very powerfull and very slow). Syntax is below. If searching non unicode files for unicode use

E\x00d\x00i\x00t\x00p\x00a\x00d

      Character Description
      \ Marks the next character as either a special character or a literal. For example, "n" matches the character "n". "\n" matches a newline character. The sequence "\\" matches "\" and "\(" matches "(".
      ^ Matches the beginning of input.
      $ Matches the end of input.
      * Matches the preceding character zero or more times. For example, "zo*" matches either "z" or "zoo".
      + Matches the preceding character one or more times. For example, "zo+" matches "zoo" but not "z".
      ? Matches the preceding character zero or one time. For example, "a?ve?" matches the "ve" in "never".
      . Matches any single character except a newline character.
      (pattern) Matches pattern and remembers the match. The matched substring can be retrieved from the resulting Matches collection, using Item [0]...[n]. To match parentheses characters ( ), use "\(" or "\)".
      x|y Matches either x or y. For example, "z|wood" matches "z" or "wood". "(z|w)oo" matches "zoo" or "wood".
      {n} n is a nonnegative integer. Matches exactly n times. For example, "o{2}" does not match the "o" in "Bob," but matches the first two o's in "foooood".
      {n,} n is a nonnegative integer. Matches at least n times. For example, "o{2,}" does not match the "o" in "Bob" and matches all the o's in "foooood." "o{1,}" is equivalent to "o+". "o{0,}" is equivalent to "o*".
      {n,m} m and n are nonnegative integers. Matches at least n and at most m times. For example, "o{1,3}" matches the first three o's in "fooooood." "o{0,1}" is equivalent to "o?".
      [xyz] A character set. Matches any one of the enclosed characters. For example, "[abc]" matches the "a" in "plain".
      [^xyz] A negative character set. Matches any character not enclosed. For example, "[^abc]" matches the "p" in "plain".
      [a-z] A range of characters. Matches any character in the specified range. For example, "[a-z]" matches any lowercase alphabetic character in the range "a" through "z".
      [^m-z] A negative range characters. Matches any character not in the specified range. For example, "[m-z]" matches any character not in the range "m" through "z".
      \b Matches a word boundary, that is, the position between a word and a space. For example, "er\b" matches the "er" in "never" but not the "er" in "verb".
      \B Matches a non-word boundary. "ea*r\B" matches the "ear" in "never early".
      \d Matches a digit character. Equivalent to [0-9].
      \D Matches a non-digit character. Equivalent to [^0-9].
      \f Matches a form-feed character.
      \n Matches a newline character.
      \r Matches a carriage return character.
      \s Matches any white space including space, tab, form-feed, etc. Equivalent to "[ \f\n\r\t\v]".
      \S Matches any nonwhite space character. Equivalent to "[^ \f\n\r\t\v]".
      \t Matches a tab character.
      \v Matches a vertical tab character.
      \w Matches any word character including underscore. Equivalent to "[A-Za-z0-9_]".
      \W Matches any non-word character. Equivalent to "[^A-Za-z0-9_]".
      \num Matches num, where num is a positive integer. A reference back to remembered matches. For example, "(.)\1" matches two consecutive identical characters.
      \n Matches n, where n is an octal escape value. Octal escape values must be 1, 2, or 3 digits long. For example, "\11" and "\011" both match a tab character. "\0011" is the equivalent of "\001" & "1". Octal escape values must not exceed 256. If they do, only the first two digits comprise the expression. Allows ASCII codes to be used in regular expressions.
      \xn Matches n, where n is a hexadecimal escape value. Hexadecimal escape values must be exactly two digits long. For example, "\x41" matches "A". "\x041" is equivalent to "\x04" & "1". Allows ASCII codes to be used in regular expressions.

-- 
----------------------------------------------------------
http://www.g2mil.com/Dec2003.htm
"jerry s" <js@spam.com> wrote in message news:uOffFq1GEHA.3164@TK2MSFTNGP11.phx.gbl...
> --
> In your sample (via the command line search script), the search engine only
> looks for "html" matches.  For example, it found text that have *.html (such
> as web addresses).
> 
> My sample:
> I changed the path in my cmd prompt:
> c:\program files\document and settings\<user>\my documents\
> 
>   for /r  %A in (*.*) do find /n "EditPad" "%A"&&pause
> 
> The result:
> It found text files that contains an web-address of "...EditPadLite.com".
> 
> However, in a text file that don't have a web-address, but contain the word
> "EditPad", it did "not" listed as a match.
> 
> Also, the command prompt search disregards WinZip files (*.zip). I have
> "thousands of files" and if I need to look up a document that talks about
> "EditPad" (without a web address in it), my result is zero.
> 
> Side notes:
> I think a lot of Users want things that are simple.  Perhaps, a User may
> post his comments (with "attitude" as seen in the original poster of this
> thread) may due to his frustrations, as opposed for his intent to offend the
> volunteers of this forum.  Personally, I look directly at a poster's
> question (or his intent) and ignore all his "frustrated add-on remarks".
> --
> ---------------
> "David Candy" <david@mvps.org> wrote in message
> news:%231fuzA1GEHA.1180@TK2MSFTNGP09.phx.gbl...
> 
> It actually to stop irrelevent hits.
> 
> EG If you're a bridge designer you may deal with spans. But span is also a
> html keyword. If this user searches for span they will find nearly every
> html document on their computer (in a 95 search), none will be relevent. XP
> searches for user data not formatting or other internal binary data.
> 
> Best way to think of it is like google. Google does the same things.
> 
> And like google it can do summaries of pages. XP also supports metadata
> searching (eh pictures 640 px wide).
> 
> You can always use a command prompt
> for /r c:\ %A in (*.*) do find /n "XP" "%A"&&pause
> 
> Which searches all files for XP, displaying results and waiting on each file
> with the term found.
> -- 
> ----------------------------------------------------------
> http://www.g2mil.com/Dec2003.htm
> "jerry s" <js@spam.com> wrote in message
> news:ubl1910GEHA.3068@TK2MSFTNGP11.phx.gbl...
> > --
> > I understand the XP team designed it that way.  I believe one rationale
> > behind that design is to "limit" the number of hits. The XP design team
> > assumes many of the results would be unrelated to the User's query.
> >
> > The trade off for this scheme is the OS will also omit legit matches.
> >
> > I prefer that opposit theory.  Let the OS do the shotgun approach. Look
> for
> > all matches, then let the User "refine" the query by his/her filter
> > criterias.
> > --
> > --------------
> > "David Candy" <david@mvps.org> wrote in message
> > news:e4rf9a0GEHA.704@tk2msftngp13.phx.gbl...
> > It has no problems. It works exactly as designed.
> > -- 
> > ----------------------------------------------------------
> > http://www.g2mil.com/Dec2003.htm
> > "jerry s" <js@spam.com> wrote in message
> > news:u5gnDY0GEHA.4012@TK2MSFTNGP09.phx.gbl...
> >
> > Unfortunately, Win XP's search for "text" engine has some problems with
> > some filter components that ignores certain "text" words or phrases.  You
> may
> > need to register some of the filter components.
> >
> > To answer your question, there is an alternative utility that can do
> > searches for "text" as in all previous versions of Windows.
> >
> > It's called  Salamander File Manager.
> >
> >     http://www.altap.cz/download.html#salrel
> >
> >  Look for version 1.52 (free version).
> > --
> > --------------
> > "P. Burrows" <me@privacy.net> wrote in message
> > news:MPG.1adbb1b2e51d5636989bbe@news.usenetserver.com...
> >
> > I was seaching for some text in a file and the windows (lame) find
> > files/text system didn't list any files? Then i opened some at random in
> > an editor and found the file i was looking for. It was there, in the
> > directories i had searched - but still their find function didn't show
> > anything (why am i not surprised)
> >
> > Can anyone suggest a good replacement for it?
> >
> 
> 




Relevant Pages

  • Re: Nachteil von TNT?
    ... Unicode ist nicht das wichtigste der Welt, zumindest nicht in Deutschland. ... dass der Anwender das Produkt gut anschauen (gutes ... Design) und gut bedienen kann ... ...
    (de.comp.lang.delphi.misc)
  • Re: Nachteil von TNT?
    ... Unicode ist nicht das wichtigste der Welt, ... dass der Anwender das Produkt gut anschauen (gutes Design) und gut bedienen kann ... ...
    (de.comp.lang.delphi.misc)