William McKee | 3 May 20:29
Favicon

More on entities and Â

Hi all,

Well, the saga continues for me and the capital A circumflex. Most
recently, I am receiving the \302 character on a production server but
not on my test server. Both servers are running Debian Linux with Apache
1.3.29 and mod_perl 1.29.

In this case, I'm not using Petal::HTB which I thought was the culprit
in my previous posts. This time, with the help of Firefox, I was able to
determine that Petal is outputting a nbsp character (\240) but is
prepending it with a \302 character (the capital A circumflex, A0 in
hex). On my test server, the \302 character is not being output.

Chris had indicated this appending behavior with his posts regarding the
copyright character and the capital A circumflex. At the time, I did not
realize that the same behavior was occurring with the nbsp entity. So it
appears to be affecting more than just the nbsp entity.

None of the previous recommendations such as changing the file encoding
or setting the meta tags are helping. For now, I'm going to do a global
search and replace on the output of the process command to remove this
character. However, this is not a good long-term solution due to the
hackish nature and the performance hit. Any suggestions or advice for
tracking down this bug would be most appreciated.

Regards,
William

--

-- 
Knowmad Services Inc.
http://www.knowmad.com

Grant McLean | 3 May 21:09
Picon
Favicon
Gravatar

Re: More on entities and Â

William,

If your output encoding is UTF8 then every character beyond
0x7F will be two or more bytes.  The non-breaking space
character should be A2 A0 (I think).  So as long as you give
the browser the correct charset setting in your headers, it
should do exactly the right thing.

Regards
Grant

William McKee wrote:

> Hi all, > > Well, the saga continues for me and the capital A circumflex. Most > recently, I am receiving the \302 character on a production server but > not on my test server. Both servers are running Debian Linux with Apache > 1.3.29 and mod_perl 1.29. > > In this case, I'm not using Petal::HTB which I thought was the culprit > in my previous posts. This time, with the help of Firefox, I was able to > determine that Petal is outputting a nbsp character (\240) but is > prepending it with a \302 character (the capital A circumflex, A0 in > hex). On my test server, the \302 character is not being output. > > Chris had indicated this appending behavior with his posts regarding the > copyright character and the capital A circumflex. At the time, I did not > realize that the same behavior was occurring with the nbsp entity. So it > appears to be affecting more than just the nbsp entity. > > None of the previous recommendations such as changing the file encoding > or setting the meta tags are helping. For now, I'm going to do a global > search and replace on the output of the process command to remove this > character. However, this is not a good long-term solution due to the > hackish nature and the performance hit. Any suggestions or advice for > tracking down this bug would be most appreciated. > > Regards, > William >
William McKee | 3 May 22:01
Favicon

Re: More on entities and Â


On Tue, May 04, 2004 at 07:09:05AM +1200, Grant McLean wrote: > If your output encoding is UTF8 then every character beyond > 0x7F will be two or more bytes. The non-breaking space > character should be A2 A0 (I think). So as long as you give > the browser the correct charset setting in your headers, it > should do exactly the right thing.
Hi Grant, Thanks for the quick response. How do I know what my output encoding is? I can set the encoding of the file and the meta tag. Should I be modifying the configuration of my Apache server? The output I'm getting right now is C2A0. According to this table[1], nbsp is 00A0 and A2A0 is not defined. Thanks, William [1] http://www.columbia.edu/kermit/utf8-t1.html -- -- Knowmad Services Inc. http://www.knowmad.com
Grant McLean | 3 May 22:35
Picon
Favicon
Gravatar

Re: More on entities and Â

William McKee wrote:

 > On Tue, May 04, 2004 at 07:09:05AM +1200, Grant McLean wrote:
 >
 >>If your output encoding is UTF8 then every character beyond
 >>0x7F will be two or more bytes.  The non-breaking space
 >>character should be A2 A0 (I think).  So as long as you give
 >>the browser the correct charset setting in your headers, it
 >>should do exactly the right thing.
 >
 >
 > Hi Grant,
 >
 > Thanks for the quick response. How do I know what my output
 > encoding is?

It will be UTF8 unless you do something to change it.
For example (assuming Perl 5.8):

   my $html = $template->process (%args);
   open($fh,'>:encoding(iso-8859-1)', $path) or die "open($path): $!";
   $fh->print($html);

 > I can set the encoding of the file and the meta tag.

Yes, this tells the browser how it should interpret the document:

   <meta http-equiv="Content-type" content="text/html; charset=utf-8">

Obviously this needs to match the encoding used to create the file.

 > Should I be modifying the configuration of my Apache server?

The 'meta http-equiv' tag above is equivalent to sending this header:

   Content-type: text/html; charset=utf-8

I've heard that not all browsers honour the charset suffix on the 
Content-type header so it might not be worth the effort.  The meta
tag has the advantage of staying with the document if the user does a 
'Save-as', whereas the HTTP header would be lost.

 > The output I'm getting right now is C2A0. According to this table[1],
 > nbsp is 00A0 and A2A0 is not defined.

Oops, my bad, I meant C2A0 but inexplicably typed A2A0.

Cheers
Grant

Michele Beltrame | 4 May 10:15

Re: More on entities and Â

Hi!

> Thanks for the quick response. How do I know what my output encoding is?

Perl does exactly what he wants, that is to says it sets the encoding
depending on the input: if there are wide charachters it goes with UTF8,
otherwise it stays with ISO8859-1. At list, this is what happens in
my Slackware 9.1's Perl 5.8.3.

> I can set the encoding of the file and the meta tag. Should I be
> modifying the configuration of my Apache server?

First of all you need to ensure your output it UTF8:

use Encode;
my $string = $template->process (%stuff);
$string = Encode::encode ('utf8', $string);

If you don't want to use this every time you output a template, you
can subclass Petal and override process() method. See a recent message
about Jean-Michel Hiver about this.

There are two ways, and you probably should use both. First of all there's
the header:

Content-type: text/html; charset=utf-8;

However, the most important this is the META tag, as it overrides the
header settings. This will seems like scandal to purists, but that's the
way it goes. ;-) Here's the header:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

	Talk to you soon, Michele.

--

-- 
Michele Beltrame
http://www.italpro.net/mb/
ICQ# 76660101 - e-mail: mb@...

Jean-Michel Hiver | 4 May 10:50

Re: More on entities and Â



>There are two ways, and you probably should use both. First of all there's >the header: > >Content-type: text/html; charset=utf-8; > >However, the most important this is the META tag, as it overrides the >header settings. This will seems like scandal to purists, but that's the >way it goes. ;-) Here's the header: >
It is an important tag, however a lot of browsers give precedence to the HTTP headers. Bottom line is, you need to declare your charset in both your http headers and your HTML template to be on the safe side. Cheers, Jean-Michel.
Chris Croome | 4 May 15:35
Picon

Re: More on entities and Â

Hi

On Tue 04-May-2004 at 08:35:13 +1200, Grant McLean wrote:

> > I've heard that not all browsers honour the charset suffix on the > Content-type header so it might not be worth the effort. The meta tag > has the advantage of staying with the document if the user does a > 'Save-as', whereas the HTTP header would be lost.
IE _never_ takes _any_ notice of the charset in the HTTP headers, therefore it is essential to use a meta element in the document (in the same way that it doesn't really care about what mime type things are served as, file extensions are the only things apart from the content of the file it looks at). However as fas as I'm aware all other broswers to follow the HTTP specification (where the charset in the headers takes prescdent over the charset set in the document) so it is important to use both and to have their values the same! Chris -- -- Chris Croome <chris@...> web design http://www.webarchitects.co.uk/ web content management http://mkdoc.com/
Michele Beltrame | 4 May 17:02

Re: More on entities and Â

Hi!

> Grant seems to be saying the default is UTF8 whereas Michele says it is
> iso8859-1.

It really depends on Perl, as it has a "use UTF8 if I find a UTF8 charachter"
behaviour. The only way you can be sure output is *always* UTF8 or
*always* ISO8859-1 is to use the Encode module, as per example I
posted in my previous message.

> The next thing that confuses me is that I have Perl 5.8.3 installed on
> both systems. Only one is showing the extra character.

This is, of course, mistery. ;-)

> Finally, my reading of utf8 docs says that a 00 should be appended to
> ANSI characters. Where is the A0 character coming from?

The 00 is not actually prepended to charachters with code point 0-127
in UTF8. This is one of the things that make UTF8 different from
UCS2 (also known as UTF16), which always used two bytes for a
charachters. UTF8 chars are of variable byte-occupation, and that
allows charachter 0-127 to remains the same, thus maintaining
perfect compatibility with US ASCII documents.

	Michele.

--

-- 
Michele Beltrame
http://www.italpro.net/mb/
ICQ# 76660101 - e-mail: mb@...

William McKee | 4 May 17:50
Favicon

Re: More on entities and Â

On Tue, May 04, 2004 at 05:02:43PM +0200, Michele Beltrame wrote:
> It really depends on Perl, as it has a "use UTF8 if I find a UTF8 charachter"
> behaviour. The only way you can be sure output is *always* UTF8 or
> *always* ISO8859-1 is to use the Encode module, as per example I
> posted in my previous message.

OK, this explanation makes sense.

> > The next thing that confuses me is that I have Perl 5.8.3 installed on
> > both systems. Only one is showing the extra character.
> 
> This is, of course, mistery. ;-)

Figures... :-/

> The 00 is not actually prepended to charachters with code point 0-127
> in UTF8. This is one of the things that make UTF8 different from
> UCS2 (also known as UTF16), which always used two bytes for a
> charachters. UTF8 chars are of variable byte-occupation, and that
> allows charachter 0-127 to remains the same, thus maintaining
> perfect compatibility with US ASCII documents.

Thanks for the lesson. Can you explain what is happening that makes the
A0 character have a C2 appended to it when output as utf-8? My
understanding of utf-8 was that it was compatible with latin1. This
behavior is *not* very compatible from my point of view.

One more point which may be at the root of my problems. I'm trying to
get Apache to add the Content-Type header using the following
declaration in my httpd.conf per the Apache docs:

    AddDefaultCharset utf-8

No matter if I have this in my main server configuration or the virtual
host configuration, if I do a `HEAD http::servername`, I get back a
Content-Type of iso-8859-1. If I view the page in Firefox and manually
tell Firefox to display it as UTF-8, all is well. Any ideas why Apache
isn't playing nice?

Thanks,
William

--

-- 
Knowmad Services Inc.
http://www.knowmad.com

Chris Croome | 4 May 17:59
Picon

Re: More on entities and Â

Hi

On Tue 04-May-2004 at 11:50:14AM -0400, William McKee wrote:
> 
> > > The next thing that confuses me is that I have Perl 5.8.3 installed on
> > > both systems. Only one is showing the extra character.
> > 
> > This is, of course, mistery. ;-)
> 
> Figures... :-/

Perhaps you environment is different?

  $ printenv | grep LANG

?

> My understanding of utf-8 was that it was compatible with latin1.

No, UTF-8 is compatible with US ASCII not Latin 1.

> One more point which may be at the root of my problems. I'm trying
> to get Apache to add the Content-Type header using the following
> declaration in my httpd.conf per the Apache docs:
> 
>     AddDefaultCharset utf-8
> 
> No matter if I have this in my main server configuration or the
> virtual host configuration, if I do a `HEAD http::servername`, I
> get back a Content-Type of iso-8859-1. If I view the page in
> Firefox and manually tell Firefox to display it as UTF-8, all is
> well. Any ideas why Apache isn't playing nice?

Hmm, that's odd. 

I usually do it like this:

  AddType 'text/html; charset=UTF-8' .html

Chris

--

-- 
Chris Croome                               <chris@...>
web design                             http://www.webarchitects.co.uk/ 
web content management                               http://mkdoc.com/   


Gmane