C# ANTLR can't parse the Java Code with Non-latin/Non-ascii characters

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance



Hi all,

I am using ANTLR 2.7.5 for parsing Java files. I have following code in
the Java one of the files ,

(ch == 'ä') ||
(ch == 'ü') ||
(ch == 'ö') ||
(ch == 'ß') ||
(ch == 'Ä') ||
(ch == 'Ö') ||
(ch == 'Ü') ||
(ch == 'é'))

And I can't parse that file using the ANTLR parser. I get the following
error,

parser exception: Test1.java:17:33: unexpected char: '''
at JavaLexer.nextToken() in
c:\maulin\csharp\javaparser\javalexer.cs:line 541

at antlr.TokenBuffer.fill(Int32 amount)
at antlr.TokenBuffer.LA(Int32 i)
at antlr.LLkParser.LA(Int32 i)
at JavaRecognizer.identPrimary() in
c:\maulin\csharp\javaparser\javarecognize
r.cs:line 5663
at JavaRecognizer.primaryExpression() in
c:\maulin\csharp\javaparser\javareco
gnizer.cs:line 5180
at JavaRecognizer.postfixExpression() in
c:\maulin\csharp\javaparser\javareco
gnizer.cs:line 4828


I did some research and found that, this happens as those characters
are not latin or ASCII and as Java compiler (javac) converts them at
compile time to Unicode, in Java it works fine but this parser fails to
parse the file. I used 'native2ascii.exe' file to run on this file to
convert them to native to ascii characters and it worked BUT I can't do
this because I have tons of such files parsed programatically and this
is not feasible for me.

So I tried to do following which is supposed to covert the input stream
into the encoding we specify in the constructor...But still it doesn't
work. I don't get the exception any more BUT it converts those
characters into spaces instead of Unicode characters I see in .class
file like 'javac' (of course after decompiling).

JavaLexer lexer = new JavaLexer(new
StreamReader(s,System.Text.Encoding.GetEncoding("ISO-8859-1")));

I tried ASCII and Unicode encoding as well but nothing works. I keep
getting blank spaces in the parsing results.

Does anybody know how to resolve this issue? It would be great help.

Regards
Maulin

.



Relevant Pages

  • Javac-compilor error
    ... discipline id.e.programming Java. ... from standard input and writes to standard output, but it is possible to redirect the input ... error occurs while trying to open the file, an exception of type IllegalArgumentException ... then this number of characters, then extra spaces are added to the front of x to bring ...
    (Fedora)
  • Re: gotchas.html: Missing Hex
    ... >> There is a difference between entering characters into ... existence of Java than the amount of words in the JLS and VMS (VM ... The problem with Java is that it is, in fact, a necessary restriction ... Java is a strongly typed language, ...
    (comp.lang.java.help)
  • Re: Cons cell archaic!?
    ... So you are saying it's a blob of molton, but better blob than C or Java? ... indeed we will no longer need either assembly languages or C, ... practical way to learn what these recognized characters mean. ...
    (comp.lang.lisp)
  • Re: Reading LAST line from text file without iterating through the file?
    ... utterly meaningless and the program doesn't care whether the file ... C using fgets, Java using BufferedReader.readLine, then this is ... \u000A characters in the source file will be lost and can never be ... Surely you agree that a file format cannot be regarded as a true text ...
    (comp.lang.java.programmer)
  • Re: Is anything easier to do in java than in lisp?
    ... > Java chars are now just like C chars, only they are fixed to 16 bit ... they are not unicode chars. ... using the first 128 characters as-is and the last 128 characters only ... the claim that a java character is a Unicode character is not ...
    (comp.lang.lisp)