Re: Convert HTML to XML or Paser HTML

Tech Tip: Click here to run a free scan for Windows Errors and optimize PC performance

From: David Elliott (DavidElliott_at_BellSouth.net.nospam)
Date: 02/11/04


Date: Wed, 11 Feb 2004 11:02:11 -0500

I have tried the SgmlReader but am having difficultly with some sites, such as www.msn.com

If I could find a way to do parsing on HTML using C/C++/C# I would be happy. All I really
need is a way to have an array of <tag> and <data>. Finer grainularity is not necessary. Just
the raw information. I do need the entire page though from opening <html> to the closing </html>.

I would prefer an HTML to XML conversion, but as time is limited, any solution would be
appreciated.

Thanks,
Dave

On Fri, 09 Jan 2004 03:23:29 GMT, v-schang@online.microsoft.com (Steven Cheng[MSFT]) wrote:

>Hi Q.Z,
>
>
>Thank you for using Microsoft Newsgroup Service. Based on your description,
>you are looking for some COM or dotnet components which can convert the
>html document into XML (XHTML) style document. Is my understanding correct?
>
>If so, I think Ken Cox've provided some good sites on this topic, they
>shows two components of COM. You may have a try on them to see whether they
>help.
>
>Steven Cheng
>Microsoft Online Support
>
>Get Secure! www.microsoft.com/security
>(This posting is provided "AS IS", with no warranties, and confers no
>rights.)



Relevant Pages

  • Re: Parsing large amounts of data (200,000 entries) with XML?
    ... I was testing using a plain old HTML document handwritten. ... you won't see the problem Larry stated. ...
    (microsoft.public.vb.general.discussion)
  • Re: Size optimization of a HTML document
    ... >> There are MANY ways to optimize the size of a html document. ... >> there is more then one whitespace. ... To edit HTML documents, I use nvu. ...
    (Fedora)
  • Re: InnerHTML not grabbing entire HTML if

    is present
    ... Well it's not an HTML document then, ... DTD of any version of HTML. ... "tag soup", with gratuitous amounts of error correction. ... The parser that processes the character sequence and turns it ...
    (comp.lang.javascript)

  • Re: Controls v HTML
    ... Whatever your perception of ASP.Net Controls is, ... HTML elements in a web page are not necessarily unrelated to ... provide a user interface in the form of an HTML document. ... The ASP.Net object model ...
    (microsoft.public.dotnet.framework.aspnet)
  • Re: How to disable "view source" in Internet Explorer
    ... "Kevin Spencer" wrote in message ... >> If you want to hide the html from a casual observer, ... >> start of the html document with white space and hope that they would not ...
    (microsoft.public.frontpage.client)