William McKee | 25 Feb 00:08
Favicon

Re:   to  mystery

Chris,

Thanks for your feedback on this character encoding mystery and for the
info about Apache Bench and lynx. Those tips will prove useful in my
education about character encodings.

I've finally had a chance to look into this issue more. When I set the
meta tag, all works as expected. Without it I get the funny A character.

I'm hypothesizing that my test script fails when run in my shell because
LC_CTYPE or LANG or one of the other locale settings is not in utf8
(it's en_US). I tried to change it to check my assumption that setting
it to utf8 would correct the error but had difficulties so have
abandoned the effort.

The thing that still baffles me is why I would get the  when using
Petal v2.02 with Petal::Parser::HTB and not get it with straight Petal.
If, as Jean-Michel says, Petal is outputting everything in UTF8, it
seems that I'd be getting the  in both.

Perhaps that character is being generated by HTB. Petal::Entities is
converting nbsp to \240 which is decimal 160. Is there a way to print
the 160 character on the command line to see what it generates? I guess
a simple Perl script would do the job....

    perl -e 'print "-\240-\n"'

Indeed, the above one-liner prints a space between the dashes. At this
point, I'm betting that it must be HTML::TreeBuilder converting that
character into something it thinks is printable which is causing the
(Continue reading)

Chris Croome | 25 Feb 15:01
Picon
Favicon

Re:   to  mystery

Hi William

On Tue 24-Feb-2004 at 06:08:31 -0500, William McKee wrote:
> 
> I'm hypothesizing that my test script fails when run in my shell
> because LC_CTYPE or LANG or one of the other locale settings is not in
> utf8 (it's en_US). I tried to change it to check my assumption that
> setting it to utf8 would correct the error but had difficulties so
> have abandoned the effort.

Mail the script  -- I have a UTF-8 env and would be happy to test it.

> The thing that still baffles me is why I would get the  when using
> Petal v2.02 with Petal::Parser::HTB and not get it with straight
> Petal.  If, as Jean-Michel says, Petal is outputting everything in
> UTF8, it seems that I'd be getting the  in both.

I don't have any answer but I know that MKDoc has a bug very much like
this at the moment:

  Copyright © 2001-2002 MKDoc Ltd.

  http://mkdoc.com/help/

I haven't been able to work out how to reproduce this but lots of pages
with a (c) symbol in the rights metadata field end up with ©
_sometimes_ ...

Chris

(Continue reading)

Mark Holland | 25 Feb 15:09

Re:   to  mystery

Chris Croome wrote:

>I don't have any answer but I know that MKDoc has a bug very much like
>this at the moment:
>
>  Copyright © 2001-2002 MKDoc Ltd.
>
>  
>
I had the same error on one of my pages. I converted the source template 
file from utf8 encoding to iso-8859-1 and it cleared things up. It's 
really weird but utf8 encoded template files always screw up.

-mark

William McKee | 25 Feb 17:32
Favicon

Re:   to  mystery

On Wed, Feb 25, 2004 at 02:01:56PM +0000, Chris Croome wrote:
> Mail the script  -- I have a UTF-8 env and would be happy to test it.

It's small so I've attached the script and the template. Let me know if
you have any questions about it. I'd love to know if you see the
character in the second example (there's a   char between the words
Sticky and Space).

> I don't have any answer but I know that MKDoc has a bug very much like
> this at the moment:
> 
>   Copyright © 2001-2002 MKDoc Ltd.

I see it as well. In fact, I went back to look at my test site and am
now seeing that character showing up again despite the meta tag. Dunno
why it went away for me yesterday but it's definitely there again
instead of a sticky space.

I even fired up Windows on my laptop to check; it's also showing the
character in both Firefox & IE6. Are you using Petal::Parser::HTB on the
scripts that produce these pages?

> I haven't been able to work out how to reproduce this but lots of pages
> with a (c) symbol in the rights metadata field end up with ©
> _sometimes_ ...

The inconsistency is definitely the most annoying aspect of this whole
problem.

William
(Continue reading)

William McKee | 25 Feb 17:35
Favicon

Re:   to  mystery

On Wed, Feb 25, 2004 at 02:09:29PM +0000, Mark Holland wrote:
> I had the same error on one of my pages. I converted the source template 
> file from utf8 encoding to iso-8859-1 and it cleared things up. It's 
> really weird but utf8 encoded template files always screw up.

Hi Mark,

I think that you've mentioned this before in response to this thread.
Now that I have a better understanding of what you mean, I changed the
meta tag that sets the Content-Type to iso-8859-1 as follows:

  <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

Sure enough the  character is gone. Does it work for you Chris?

William

--

-- 
Knowmad Services Inc.
http://www.knowmad.com

Chris Croome | 25 Feb 18:01
Picon
Favicon

Re: &nbsp; to  mystery

Hi

On Wed 25-Feb-2004 at 11:35:05 -0500, William McKee wrote:
> 
>   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
> 
> Sure enough the  character is gone. Does it work for you Chris?

Not really, if you change the charset using mozilla you get this:

  © 

I'm going to try your script now...

Chris  

--

-- 
Chris Croome                               <chris@...>
web design                             http://www.webarchitects.co.uk/ 
web content management                               http://mkdoc.com/   

Chris Croome | 25 Feb 18:14
Picon
Favicon

Re: &nbsp; to  mystery

Hi

On Wed 25-Feb-2004 at 11:32:05 -0500, William McKee wrote:
> 
> It's small so I've attached the script and the template. Let me know
> if you have any questions about it. I'd love to know if you see the
> character in the second example (there's a &nbsp; char between the
> words Sticky and Space).

UTF-8 env:

  $ printenv | grep LANG
  LANG=en_GB.UTF-8

Running the script I get these two results:

  <title>Petal Test</title>
  <p>Sticky�Space</p>

  <title>Petal Test with Petal::Parser::HTB</title>
  <p>Sticky Space</p>

Is that what you expected?

Chris

--

-- 
Chris Croome                               <chris@...>
web design                             http://www.webarchitects.co.uk/ 
web content management                               http://mkdoc.com/   
(Continue reading)

William McKee | 25 Feb 18:39
Favicon

Re: &nbsp; to  mystery

On Wed, Feb 25, 2004 at 05:14:35PM +0000, Chris Croome wrote:
> Is that what you expected?

Not really. I'm surprised you get the ??? with straight Petal since I
thought the \240 character was a sticky space in utf8. Does the ??? mean
that the terminal just can't print the character? That could make sense.
I don't understand why the  is not printed when using
Petal::Parser::HTB in a UTF8 environment. More mysteries!

Here's my results:

$ printenv|grep LANG
LANG=en_US

    <title>Petal Test</title>
    <p>Sticky Space</p>

    ------------------------------------------------------------------------

    <title>Petal Test with Petal::Parser::HTB</title>
    <p>Sticky Space</p>

In the second example, I get the space but the  character preceeds it
much like your problem with the copyright entity. I had originally
thought the  was replacing the sticky space. 

-Wm

--

-- 
Knowmad Services Inc.
(Continue reading)

Jean-Michel Hiver | 25 Feb 19:53

Re: &nbsp; to  mystery

William McKee wrote:

>On Wed, Feb 25, 2004 at 05:14:35PM +0000, Chris Croome wrote:
>  
>
>>Is that what you expected?
>>    
>>
>
>Not really. I'm surprised you get the ??? with straight Petal since I
>thought the \240 character was a sticky space in utf8. Does the ??? mean
>that the terminal just can't print the character? That could make sense.
>I don't understand why the  is not printed when using
>Petal::Parser::HTB in a UTF8 environment. More mysteries!
>  
>
Just a quick note to tell everybody that I'm not dead, but that I have 
absolutely no idea what's going on. Our software can manage pages and 
pages of arabic / hurdu / chinese but the copyright symbol and the non 
breaking space seem to behave funny sometimes :-/

Maybe a s/Â / / somewhere along the line is what's _really_ needed :-)

William McKee | 25 Feb 21:18
Favicon

Re: &nbsp; to  mystery

Hey Jean-Michel,

Good to hear from you! The devil is in the details.

> Maybe a s/Â / / somewhere along the line is what's _really_ needed :-)

I'm all for this kind of solution. However, I suggest you change it to
the following regex or it will not work in Chris' example:

    s/Â//g;

-Wm

--

-- 
Knowmad Services Inc.
http://www.knowmad.com


Gmane