Re: c# .net write html to word special characters not writing
- From: "Peter Duniho" <NpOeStPeAdM@xxxxxxxxxxxxxxxx>
- Date: Thu, 21 May 2009 12:34:49 -0700
On Thu, 21 May 2009 12:13:46 -0700, rhitam <rhitamsanyal@xxxxxxxxx> wrote:
Hi all,
I am trying to write a html to a word file. Basically i havea list of
html files and i need their content accumulated and put into one word
file .That is not the problem. The problem is handling special
characters. My code is something like this:
string Htmltext = somehtml;
WriteToFile("somefile.doc", Htmltext);
You've got a problem right there. You are using the ".doc" extension but saving a file that is HTML (well-formed or otherwise).
In this case, you're lucky because Word has a lot of code in it to deal with users that lie to it, so once it fails to open the file you gave it as an actual Word document, it goes through other file formats it understands, detects the data as HTML, and interprets it that way. But you can save it a lot of time and trouble if you'd just use a proper extension, such as ".html" or ".htm" in the first place.
Also, other applications may not be so forgiving, so you might as well get into the habit of using the file extensions correctly.
public static void WriteToFile(string strPath, ref string strData,
FileMode FM, FileAccess FA, FileShare FSHR)
This is not the same method signature as the one you claim to be calling.
{
FileStream FS = File.Open(strPath, FM, FA, FSHR);
byte[] b = Encoding.UTF8.GetBytes(strData);
FS.Write(b, 0, b.Length);
FS.Close();
}
Now there are characters like the trademark character (™) and
many others which are encoded with correct escape sequence in the
original file . When they are written into the word file, they show up
as junk chracters. But if i just open the existing html file in
word , then it shows correctly. Really need urgent help with this.
Hard to say without a specific example of the data and a concise-but-complete code example that reliably reproduces the problem.
Given what little code you've posted, there's no reason that the character entities such as "™" shouldn't be preserved correctly. After all, your code is doing nothing more than (possibly, depending on the original source of data) converting from one character encoding to another and then changing the file extension from one that's correct to one that's incorrect. Word should wind up interpreting the document the same either way.
But without knowing exactly what you're doing, it's impossible to say for sure. If the "existing html file" is not in fact exactly the same characters you wind up writing to the new ".doc" file, then any number of differences in the way that Word ultimately winds up parsing the file data could explain what you're seeing. For example, if you don't write a BOM at the beginning of your UTF-8 file (which you don't), and the character is not in fact written in the file as "™" but rather as the actual character, then Word may not realize the file is UTF-8 and instead interpret the bytes as something else (e.g. some ANSI code page).
By the way, if your WriteToFile() method doesn't modify the value of the strData argument, there's no reason to pass the argument as "ref".
Pete
.
- Follow-Ups:
- References:
- Prev by Date: Re: Design question abt how to handle an abundance of properties.
- Next by Date: Debuging ASP.NET in Visual Studio 2008
- Previous by thread: c# .net write html to word special characters not writing
- Next by thread: Re: c# .net write html to word special characters not writing
- Index(es):
Relevant Pages
|