Re: Junk characters when using StreamReader and StreamWriter

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance




á is just a single-byte ASCII character and Word just puts it into
the html file without encoding.

Why would you want to use System.Text.Encoding.Default?

Then you do not know what you are getting. You might get something
to work - until you run the application somewhere else in the world.

If you write the data out without going through the CleanHTML
function, do you then get a file which byte for byte is identical to
the original?

Regards,

Joergen Bech



On Wed, 20 Jun 2007 06:46:38 -0700, Rob <robert@xxxxxxxxxxx> wrote:

Hi,
I have a VB.Net application that parses an HTML file. This file was an
MS Word document that was saved as web page.

My application removes all unnecessary code generated by MS Word and
does some custom formatting needed by my client.

I use a StreamReader to read in the file...regular expressions to parse
and clean up the file...and a StreamWriter to write the new file. On
some HTML files that I parse, I get this character "á" showing up in a
lot of places.
I've used different types of encoding for my StreamReader and
StreamWriter which works sometimes but then my application doesn't work
when I parse a French HTML file.

This is what I'm doing:

Dim filename As String = myfileinfo.OriginalFileName
Dim sr As New StreamReader(filename, System.Text.Encoding.Default)

Dim textstream As String = sr.ReadToEnd()
'close the stream...don't need it anymore
sr.Close()

Dim newtext As String

'send the stream for cleaning
newtext = CleanHTML(textstream)

Dim OutputFileName As String = myfileinfo.NewFolderPathToFile

Dim fs As New FileStream(OutputFileName, FileMode.Create,
FileAccess.Write)

Dim sw As New StreamWriter(fs, System.Text.Encoding.Default)

sw.WriteLine(newtext)
sw.Close()

These are the results:

<p class="MsoToc1" ><span
<a href="#_Toc169072483"><span>99(15)0<span>á </span>INSTALMENT
PROGRAM</span></a></span></p>

<p class="MsoToc1" ><span
<a href="#_Toc169072484"><span>99(15)1<span>á </span>OVERVIEW OF THE
INSTALMENT PROGRAM</span></a></span></p>

<p class="MsoNormal" ><span>áá </span>OP:<span>á
</span>BB<span>áááááááááááá </span>ACCT:<span>á </span>123456789
YR:<span>á </span>2004<span>áááááááááááááááááááááá </span>PG:<span>á
</span>1 of 1<span>áááááááááááááááááá </span>26FEB
2004 USER-ID</p>

I also have different characters representing quotes such as "ô" and "ö"
(not shown here).

When I do a "Save as Web Page" from MS Word 2000, the characters aren't
there but they show up after runnung through my program.

I'm assuming they represent characters that it doesn't understand but
does anyone know how to reslove this?


Thanks
Rob



*** Sent via Developersdex http://www.developersdex.com ***

.



Relevant Pages

  • Junk characters when using StreamReader and StreamWriter
    ... I have a VB.Net application that parses an HTML file. ... I use a StreamReader to read in the file...regular expressions to parse ... Dim sr As New StreamReader ...
    (microsoft.public.dotnet.languages.vb)
  • Getting around .Net Strings being UTF-16 encoded only
    ... I have a character that is return by a SQL Server database "É" to be ... I've seen a lot of quotes saying "All String datatype ... Dim utf8 As New UTF8Encoding ... New StreamReader(New MemoryStream(fNameBytes), utf8) ...
    (microsoft.public.dotnet.xml)
  • Problem with read file
    ... Dim sr As StreamReader ... character '£', this character is omitted and my line in strRiga is ...
    (microsoft.public.dotnet.framework.compactframework)
  • Re: DOM manipulation crashes with IPersistStreamInit
    ... Public Function GetHTMLDocument_WebRequest(ByVal sURL As String) As ... Dim oResponse As WebResponse ... 'Request the HTML file and read the response into a string ... And SUCCESS!! ...
    (microsoft.public.inetsdk.programming.webbrowser_ctl)
  • Re: Variables interpolated in character classes?
    ... by generating an html file with the same content, ... Here's an edited-for-brevity ... Then I realized, the regex contains "$_", which was embedding ... I had thought that character classes removed the special ...
    (comp.lang.perl.misc)