how to parse XML files with html text in them

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance



I'm trying to parse some XML files that contain newspaper articles. Each
file is a separate article. Each element in the file is going to be posted
to a database. I wrote some code previously to read XML files that were laid
out rigidly and had no trouble. But these are not cooperating. They contain
lots of spacing, are not organized nicely line by line and some of the
elements are going to contain html tags (for example the article itself will
have <p>, <b>, <i> and other formatting tags in them). I need to be able to
read the XML tags into variables that I can post to the database. But my old
code for reading XML is not workign in this situation. I've used some
differetn examples I found on various sites but nothing seems to work so
far.

Here is a sample file:

<company_main>
<articles>
<id>
558960
</id>
<location_id>
1
</location_id>
<title>
<p>NY Times counsel</p>
<p>speaks at MSU Law</p>
</title>
<summary>
This is just a bunch of summary information about the article that is in
this file.......
</summary>
<author_id>
1
</author_id>
<text>
<p>
This is<i> paragraph</i> 1 of the article itself. Lorem ipsum dolor sit
amet, consectetur adipiscing elit. Duis nec lorem a tellus pulvinar dapibus.
Proin ut lectus magna. Morbi velit mi, faucibus a malesuada non, vehicula a
leo. Nam dolor elit, adipiscing blandit aliquet non, pellentesque sit amet
justo. Nulla tempor risus in sapien rhoncus mollis. Suspendisse potenti.
Integer vel pulvinar risus.
</p>
<p>
This is<i> paragraph</i> 1 of the article itself. Mauris non dolor erat,
vitae elementum nisl. <b>Sed ac ante ac purus</b> hendrerit tincidunt quis
eget augue. Nam orci mauris, pulvinar vitae faucibus ac, varius quis nunc.
Vestibulum sed feugiat magna.
</p>
<p>
This is<i> paragraph</i> 1 of the article itself. Nam bibendum aliquam
adipiscing. Sed congue rutrum sagittis. Ut neque felis, scelerisque a
adipiscing sit amet, pulvinar sed nisl. Praesent metus tortor, iaculis vitae
tempor at, rhoncus eu felis. Proin luctus, magna sit amet dapibus bibendum,
leo urna semper velit, venenatis dictum quam enim at sem.
</p>
<p>
This is<i> paragraph</i> 1 of the article itself. Proin quis dolor vel
mauris vehicula lobortis in vel nunc. Nullam neque neque, auctor et rutrum
vitae, ultrices in nunc. Sed adipiscing interdum risus et euismod.
</p>
</text>
<date>
10/27/09
</date>
<type>
Published
</type>
<url>
</url>
</articles>
</company_main>

I'm sure it's obvious but I need to read the following:

id
location_id
title
summary
author_id
text
date
type
url

This didn't work (kept finding tags that are not actually XML elements):

Dim xrdr As New XmlTextReader(textFilesLocation &
sArticleToPost)
xrdr.WhitespaceHandling = WhitespaceHandling.None

While xrdr.Read()

If String.Compare(xrdr.Name, "id", True) = 0 Then
ArticleID = Trim(xrdr.ReadElementString())
End If

If String.Compare(xrdr.Name, "location_id", True) = 0
Then
LocationID = Trim(xrdr.ReadElementString())
End If

If String.Compare(xrdr.Name, "title", True) = 0 Then
ArticleTitle = Trim(xrdr.ReadElementString())
End If

If String.Compare(xrdr.Name, "summary", True) = 0 Then
ArticleSummary = Trim(xrdr.ReadElementString())
End If

If String.Compare(xrdr.Name, "text", True) = 0 Then
ArticleText = Trim(xrdr.ReadElementString())
End If

If String.Compare(xrdr.Name, "author_id", True) = 0 Then
AuthorID = Trim(xrdr.ReadElementString())
End If

If String.Compare(xrdr.Name, "date", True) = 0 Then
ArticleDate = Trim(xrdr.ReadElementString())
End If

If String.Compare(xrdr.Name, "type", True) = 0 Then
ArticleType = Trim(xrdr.ReadElementString())
End If

If String.Compare(xrdr.Name, "url", True) = 0 Then
ArticleURL = Trim(xrdr.ReadElementString())
End If

End While

xrdr.Close()

And this errored as well (error said it found invalid encoding):

Dim m_xmld As XmlDocument
Dim m_nodelist As XmlNodeList
Dim m_node As XmlNode

'Create the XML Document
m_xmld = New XmlDocument()

'Load the Xml file
m_xmld.Load(textFilesLocation & sArticleToPost)

'Get the list of name nodes
m_nodelist = m_xmld.SelectNodes("/company_main/articles")

'Loop through the nodes
For Each m_node In m_nodelist

ArticleID = m_node.Attributes.GetNamedItem("id").Value
LocationID =
m_node.Attributes.GetNamedItem("location_id").Value
ArticleTitle =
m_node.Attributes.GetNamedItem("title").Value

Next

Any help would be greatly appreciated!

Keith


.



Relevant Pages

  • Re: [PHP] Re: About XSLT/XML Pagination
    ... approach for the xsl files. ... imagining take some simple xml data, but imagine it has a DTD or XMLSchema ... database to create 1 or more xml files. ...
    (php.general)
  • Re: [PHP] Re: About XSLT/XML Pagination
    ... approach for the xsl files. ... imagining take some simple xml data, but imagine it has a DTD or XMLSchema ... > database to create 1 or more xml files. ...
    (php.general)
  • Re: Many concurrent users of site using XPathDocument and XslTransform objects efficiently?
    ... Output caching might be very helpful, depending on how often those XML docs ... > Likely the greatest cost is reading the XML files ... XPathDocument, but noticed this on the Microsoft site concerning the ...
    (microsoft.public.dotnet.framework.performance)
  • Re: FoX - an XML toolkit for Fortran
    ... tested against the W3C XML testsuite. ... FoX also understands all of XML Namespaces, ... FoX is completely written in Fortran, ... XML files in Fortran - http://xml-fortran.sourceforge.net ...
    (comp.lang.fortran)
  • Re: XML Import options
    ... OpenXML may work as well, but if the documents are too big, you get into ... > XML was ... > and have looked at openXML, XMLBulkLoad/DTS and Updategrams. ... > data (i.e. in XML files on the filesystem) I would have to activate the ...
    (microsoft.public.sqlserver.xml)