Re: c# .net write html to word special characters not writing

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance



On Thu, 21 May 2009 12:13:46 -0700, rhitam <rhitamsanyal@xxxxxxxxx> wrote:

Hi all,

I am trying to write a html to a word file. Basically i havea list of
html files and i need their content accumulated and put into one word
file .That is not the problem. The problem is handling special
characters. My code is something like this:

string Htmltext = somehtml;
WriteToFile("somefile.doc", Htmltext);

You've got a problem right there. You are using the ".doc" extension but saving a file that is HTML (well-formed or otherwise).

In this case, you're lucky because Word has a lot of code in it to deal with users that lie to it, so once it fails to open the file you gave it as an actual Word document, it goes through other file formats it understands, detects the data as HTML, and interprets it that way. But you can save it a lot of time and trouble if you'd just use a proper extension, such as ".html" or ".htm" in the first place.

Also, other applications may not be so forgiving, so you might as well get into the habit of using the file extensions correctly.

public static void WriteToFile(string strPath, ref string strData,
FileMode FM, FileAccess FA, FileShare FSHR)

This is not the same method signature as the one you claim to be calling.

{
FileStream FS = File.Open(strPath, FM, FA, FSHR);
byte[] b = Encoding.UTF8.GetBytes(strData);
FS.Write(b, 0, b.Length);
FS.Close();
}

Now there are characters like the trademark character (&trade;) and
many others which are encoded with correct escape sequence in the
original file . When they are written into the word file, they show up
as junk chracters. But if i just open the existing html file in
word , then it shows correctly. Really need urgent help with this.

Hard to say without a specific example of the data and a concise-but-complete code example that reliably reproduces the problem.

Given what little code you've posted, there's no reason that the character entities such as "&trade;" shouldn't be preserved correctly. After all, your code is doing nothing more than (possibly, depending on the original source of data) converting from one character encoding to another and then changing the file extension from one that's correct to one that's incorrect. Word should wind up interpreting the document the same either way.

But without knowing exactly what you're doing, it's impossible to say for sure. If the "existing html file" is not in fact exactly the same characters you wind up writing to the new ".doc" file, then any number of differences in the way that Word ultimately winds up parsing the file data could explain what you're seeing. For example, if you don't write a BOM at the beginning of your UTF-8 file (which you don't), and the character is not in fact written in the file as "&trade;" but rather as the actual character, then Word may not realize the file is UTF-8 and instead interpret the bytes as something else (e.g. some ANSI code page).

By the way, if your WriteToFile() method doesn't modify the value of the strData argument, there's no reason to pass the argument as "ref".

Pete
.



Relevant Pages

  • Re: Variables interpolated in character classes?
    ... by generating an html file with the same content, ... Here's an edited-for-brevity ... Then I realized, the regex contains "$_", which was embedding ... I had thought that character classes removed the special ...
    (comp.lang.perl.misc)
  • Re: Problem round-tripping with xml.dom.minidom pretty-printer
    ... I have run into a problem using minidom. ... I have an HTML file that I ... If you toprettyxml an XML document twice in a row, then the second one will also add newlines and tabs around the newlines and tabs added by the first. ... Finally, normalizeis supposed to merge consecutive sibling character nodes, however it will never remove character contents even if they are blank. ...
    (comp.lang.python)
  • Re: Junk characters when using StreamReader and StreamWriter
    ... is just a single-byte ASCII character and Word just puts it into ... I have a VB.Net application that parses an HTML file. ... I use a StreamReader to read in the file...regular expressions to parse ... Dim sr As New StreamReader ...
    (microsoft.public.dotnet.languages.vb)
  • Re: Stupid Newbie Question Concerning CGI and Reading Forward Slashes
    ... read inputs from an HTML file. ... A couple of these inputs are textboxes that the user inputs a path in. ... there isn't do you suggest having the user use a different character ... It's not a common question, ...
    (comp.lang.python)