Re: Searching web sub-pages




You could probably just parse it, but it might
be easier to use the DOM (after loading the page
into a WebBrowser control or IE instance). If you get
the document all collection, you can walk it by index,
checking...
If all(i).tagname = "A" Then

to filter links. An A tag has an "href" property that gives
you the link URL.

Then you just download each page, or navigate
to it with a Web Browser control. I assume you
weren't thinking of using DOM to look for the keywords.
You could make it a little bit easier by getting the
document.body.innerText before searching for
your keywords, but assuming that you're not searching
for words like "TABLE" or "DIV", there's probably not
much point. If it were me, I'd just download the 100
pages as files, so as to avoid the bloat of loading them
in IE and also avoid the security risks of loading 100
pages into IE without knowing for sure that they're
safe. Then I would just parse them with InStr.


Okay, here's the scenario:

I've got a web page. That web page has many, MANY links (hundreds, I
think...I haven't counted). I want to search those pages for specific
keywords.

Having defined the requirement, I have NO clue where to start. This
doesn't
HAVE to be done programmatically, necessarily, so I could use some kind of
web-crawling software if there's something free & easy to use, but I
thought
it might not be a bad idea to try it programmatically, just to get my feet
wet. I'm sort of assuming I'd want to use DOM, but I am utterly clueless
to
ANYTHING about it at all.

Am I generally on the right track here, or what would people suggest?
This
is entirely a personal project, so I'm open to any solution of any kind.



Rob




.



Relevant Pages

  • Re: Web page load timer app
    ... then I will pass the url + post data in turn to the ... >> Webbrowser control to the measure the time it takes to load the page. ... > When a Page Is Done Loading in WebBrowser Control" ...
    (microsoft.public.inetsdk.programming.webbrowser_ctl)
  • Re: Focus in webbrowser Control
    ... I'm using a webbrowser control in my project, ... You will have to manipulate the DOM (exposed as the webbrowser.Document ... exposes a Document property; it is the document object of the web page ... Giving focus to an element within an HTML document requires DOM-level calls; ...
    (microsoft.public.vb.controls)
  • Re: Grouping of elements?
    ... If I parse the XML into DOM, ... new element using more memory at runtime? ...
    (comp.text.xml)
  • Re: Searching web sub-pages
    ... be easier to use the DOM (after loading the page ... weren't thinking of using DOM to look for the keywords. ... so as to avoid the bloat of loading them ...
    (microsoft.public.vb.general.discussion)
  • Re: Get MSHTMLs DOM Hidden Elements
    ... I can't parse the Document to get properties/attributes because it is not consistent. ... You don't "parse the DOM", the DOM is the result of parsing the document. ...
    (microsoft.public.dotnet.languages.csharp)

Loading