Java Mailing List Archive

http://www.junlu.com/

Home » Home (12/2007) » JDOM User »

Re: [jdom-interest] UTF8 charset issues...

Patrick JUSSEAU

2003-10-10


Alex,

Well I am pretty sure it is not working because if I save my XML
document and then I try to read it back in my java app I get the
following exception:

java.io.UTFDataFormatException: Invalid byte 1 of 1-byte UTF-8 sequence.
     at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown
Source)
     at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
     at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
....

The scenario to get this exception is:

1 - Create a jdom Document and call element.setText("Æ") to set an
element's text value

2 - Save this Document (ie create a local XML file) test.xml

3 - Read this XML document back which leads to the above exception.


Note: If I use a XML aware tool like oxygen to look at test.xml, the
'Æ' character shows up as '�'
However if I save my document using:
String text = "Æ";
byte[] bytes = text.getBytes("UTF8");
text = new String(bytes);
setText(text);


In that case my document is properly saved and I am able to read it
back in my Java app

I am using Java 1.4.1 on MacOSX

Thanks again

Patrick



On 10 Oct 2003, at 6:34 PM, Alex Rosen wrote:

> "just calling Element.setText("Æ") does not generate a correct UTF-8
> encoded document."
>
> How did you determine this? I.e. what tool did you use to look at the
> document? What I'm getting at is, I think that the document was right,
> but the tool you used to look at it made it look "wrong". Realize that
> the *bytes* of the UTF-8 encoding of Æ are going to look like garbage
> characters. If you view the file using a tool that uses any encoding
> other than UTF-8, it'll look mangled, even though it's not. The viewer
> you used (e.g. maybe Notepad or another text editor) probably read it
> using your machine's default encoding (such as Latin 1), so it looked
> garbled even though it was really OK (i.e. if your viewer used UTF-8
> to show it to you, it would be fine.)
>
> Encoding issues are really confusing, unfortunately.
>
> Alex
>
>>>> Patrick JUSSEAU <patrick@(protected) >>>
> Hi all,
>
> I am trying to understand how jdom handles character encodings. Here is
> what I am doing:
>
> I have a java app which reads data from a xml file (UTF-8 encoded). I
> am able to get text just fine using
> String str = anElement.getText();
>
> The resulting str string (Unicode encoded) contains exactly what was
> defined in my xml file. The charset translation is here transparent for
> me. For example if my xml document is:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE DOCUMENT SYSTEM "annonce.dtd">
> <DOCUMENT>
>    <TEXT>Æ</TEXT>
> </DOCUMENT>
>
> I get Æ in my str string.
>
>
> However when I am trying to generate a xml document with this exact
> same Æ value, just calling Element.setText("Æ") does not generate a
> correct UTF-8 encoded document. I have first to manually do this in my
> code:
>    String text = "Æ";
>    try{
>      byte[] bytes = text.getBytes("UTF8");
>      String newText = new String(bytes);
>      setText(newText);
>    }catch(UnsupportedEncodingException uee){
>      uee.printStackTrace();
>    }
>
> Why do I have to do this for the xml generation to work. Why isn't jdom
> taking care of the charset translation for me since the resulting file
> has UTF-8 encoding specified in it?
>
> Thanks for any help
>
> Patrick
>
> _______________________________________________
> To control your jdom-interest membership:
> http://lists.denveronline.net/mailman/options/jdom-interest/
> youraddr@(protected)
>

_______________________________________________
To control your jdom-interest membership:
http://lists.denveronline.net/mailman/options/jdom-interest/youraddr@(protected)


©2008 junlu.com - Jax Systems, LLC, U.S.A.