Re: Junk characters when using StreamReader and StreamWriter
- From: Joergen Bech <jbech<NOSPAM>@<NOSPAM>post1.tele.dk>
- Date: Wed, 20 Jun 2007 20:22:55 +0200
á is just a single-byte ASCII character and Word just puts it into
the html file without encoding.
Why would you want to use System.Text.Encoding.Default?
Then you do not know what you are getting. You might get something
to work - until you run the application somewhere else in the world.
If you write the data out without going through the CleanHTML
function, do you then get a file which byte for byte is identical to
the original?
Regards,
Joergen Bech
On Wed, 20 Jun 2007 06:46:38 -0700, Rob <robert@xxxxxxxxxxx> wrote:
Hi,
I have a VB.Net application that parses an HTML file. This file was an
MS Word document that was saved as web page.
My application removes all unnecessary code generated by MS Word and
does some custom formatting needed by my client.
I use a StreamReader to read in the file...regular expressions to parse
and clean up the file...and a StreamWriter to write the new file. On
some HTML files that I parse, I get this character "á" showing up in a
lot of places.
I've used different types of encoding for my StreamReader and
StreamWriter which works sometimes but then my application doesn't work
when I parse a French HTML file.
This is what I'm doing:
Dim filename As String = myfileinfo.OriginalFileName
Dim sr As New StreamReader(filename, System.Text.Encoding.Default)
Dim textstream As String = sr.ReadToEnd()
'close the stream...don't need it anymore
sr.Close()
Dim newtext As String
'send the stream for cleaning
newtext = CleanHTML(textstream)
Dim OutputFileName As String = myfileinfo.NewFolderPathToFile
Dim fs As New FileStream(OutputFileName, FileMode.Create,
FileAccess.Write)
Dim sw As New StreamWriter(fs, System.Text.Encoding.Default)
sw.WriteLine(newtext)
sw.Close()
These are the results:
<p class="MsoToc1" ><span
<a href="#_Toc169072483"><span>99(15)0<span>á </span>INSTALMENTPROGRAM</span></a></span></p>
<p class="MsoToc1" ><span
<a href="#_Toc169072484"><span>99(15)1<span>á </span>OVERVIEW OF THEINSTALMENT PROGRAM</span></a></span></p>
<p class="MsoNormal" ><span>áá </span>OP:<span>á
</span>BB<span>áááááááááááá </span>ACCT:<span>á </span>123456789
YR:<span>á </span>2004<span>áááááááááááááááááááááá </span>PG:<span>á
</span>1 of 1<span>áááááááááááááááááá </span>26FEB
2004 USER-ID</p>
I also have different characters representing quotes such as "ô" and "ö"
(not shown here).
When I do a "Save as Web Page" from MS Word 2000, the characters aren't
there but they show up after runnung through my program.
I'm assuming they represent characters that it doesn't understand but
does anyone know how to reslove this?
Thanks
Rob
*** Sent via Developersdex http://www.developersdex.com ***
.
- Follow-Ups:
- Re: Junk characters when using StreamReader and StreamWriter
- From: Herfried K. Wagner [MVP]
- Re: Junk characters when using StreamReader and StreamWriter
- References:
- Prev by Date: Re: Polygons in VS8?
- Next by Date: .AddHours doesn't work!
- Previous by thread: Re: Junk characters when using StreamReader and StreamWriter
- Next by thread: Re: Junk characters when using StreamReader and StreamWriter
- Index(es):
Relevant Pages
|