Re: Wrong charset convert
André Warnier <aw <at> ice-sa.com>
2009-07-01 12:11:18 GMT
ejirkae <at> seznam.cz wrote:
> This is that problem: http://sgo.happyforever.com/test.php
> Try it please, thanks.
> ------------ Původní zpráva ------------
> Od: <ejirkae <at> seznam.cz>
> Předmět: [users <at> httpd] Wrong charset convert
> Datum: 01.7.2009 00:03:06
> I have installed Apache 2.2.11 with PHP 5.2.8 on Windows XP SP3.
> Windows are using Windows-1250 charset (Czech localization). I want
> to install MediaWiki software which uses utf-8 charset.
> When I upload a file with non-english characters in its name, then
> its name is saved in utf-8 format. When I try to open such file in
> web browser it sends 404 not found status.
> Upload a file by using simple html upload form, which is encoded in
> <!-- this is only part of whole code --!>
> <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
> <form enctype="multipart/form-data" action="uploader.php" method=
> <input type="hidden" name="MAX_FILE_SIZE" value="100000" />
> Choose a file to upload: <input name="uploadedfile" type="file" /><
> br />
> <input type="submit" value="Upload File" />
> File named for example "složka.png" is saved to hard drive with name
> "sloĹľka.png" in Windows-1250 encoding.
(This is not true, see below)
If that upload form was
> encoded with charset=Windows-1250 then it'll be right named "složka.
> png", but charset must be utf-8.
> So suppose that we have server with uploaded file: http://something.
> com/složka.png. On linux it is working fine. But on Windows server
> you must use address like that: http://something.com/sloĹľka.png and
> that's not good for MediaWiki.
> I don't know if it's understandably enough, I need set up Apache to
> ignore windows-1250 charset and use original utf-8 for decoding URL.
> httpd.conf is original (with php installation).
> Thanks for help
> Jiri Eichler
the issue you are explaining above is not an easy one.
It will really be solved only, whenever the powers-that-be on the
Internet, finally decide to move to an HTTP version 2.0, where
everything by default would be Unicode, UTF-8 encoded.
Until then, there will be confusion and difficulties for whoever does
not use English as his main language.
--- Part I -------
First, about your last paragraph :
Apache will not use UTF-8 to decode a URL, because that would be wrong
according to the current RFCs that specifiy how the WWW is working.
The "law" in that respect is defined here :
See section : 1.5. URI Transcribability
It is all a bit obscure, but basically what it boils down to is :
when a server receives a URL :
- it first decodes the URL, to convert the "percent-escaped" characters
back into single characters. That means, for instance, that a "%20" is
decoded into a space.
- then it does *no further decoding*, it takes the bytes *as they are*.
They are *not supposed* to be decoded any further, using iso-8859-1,
cp-1250, UTF-8 or whatever.
(If Apache did that, then Apache would not respect the RFC).
Now, let's say that in this URL, is a path pointing to some resource,
which in this case is a file on disk.
Well then, the webserver should take this path exactly as received, and
look for a file on disk whose name matches exactly that path, byte by byte.
But, between the webserver and the disk, there is an operating system.
The webserver does not read the disk directly. It does that through the
OS I/O interface calls. So, it is possible that when the webserver
looks for a file called "xyz123.html", the OS interface translates that
to "XYZ123.HTML" for example, and returns /that/ file.
That is for example the case for Windows. For "xyz123.html", Windows
will return any file that is named "Xyz123.html", or "yYz123.html", or
"XYZ123.html" etc.. because when looking for files, Windows is
case-insensitive. If the webserver does not double-check this (some do),
then it may thus return the wrong file.
The same kind of thing can happen with "diacritic" characters, such as
--------- Part II -----------
Uploading files and writing them to disk.
This is a separate issue.
The script that handles the <form> which is used to upload the file,
knows that the filename is Unicode, encoded as UTF-8.
(It knows that, because you wrote the <form> and the script, and in your
<form>, you have told the browser to send information in UTF-8).
In the UTF-8 encoding, the filename "složka.png", consists of *10
characters*, but of *11 bytes*. That is because the "ž" in the middle,
is encoded using 2 bytes in UTF-8.
If you look at this filename with an editor which understands UTF-8, you
will see this as "složka.png".
If you look at this same filename with an editor which does not
understand UTF-8 (or is set to iso-8859-2), then you will see this same
string as something like "sloĹľka.png" (or something else like that, I
have not really checked).
But back to your upload script.
It has this uploaded file name, in Unicode UTF-8, as "složka.png".
Now it wants to create this file on disk.
For that, it tells the OS : create file "složka.png".
The OS takes this file name, and depending on several conditions (**),
understands this name literally as either a series of *bytes* (11 of
them), or as a series of *characters* (10 of them) in UTF-8 encoding.
And the OS, according to its understanding, creates a directory entry on
disk for this filename.
In your case, it creates an entry in the disk directory, containing the
/bytes/ (or /characters/) "sloĹľka.png".
It does that, because your script does it wrong :
The script "knows" that this filename is encoded in UTF-8.
But the OS does not know that.
The script /should know/ how the OS is going to understand that, and
should, if needed, re-encode this filename in the proper encoding, so
that the OS understands it correctly, and creates a file named "složka.png".
It is not that a file named "sloĹľka.png" is wrong. It is, in itself, a
perfectly valid filename.
But the problem is that, considering Part I above :
- your users are going to type a URL in the location bar of their browser
- for that, they are going to use the keyboard that they have, on their
workstation, with their OS and their browser etc...
(for example, I could never type it, because I don't have a key for "ž"
on my keyboard; so I have to cut and paste from your email )
- So they are going to type, for example :
- The browser is going to URL-encode that, probably replacing the "ž" by
a 3-character "percent-sequence" like %B3 (or even 2 3-character
sequences, if the browser thinks it must encode the URL as UTF-8).
- the browser is then going to "send this URL" to Apache.
- Apache will receive this URL, decode the %-sequences into *bytes*, and
ask the OS for this file.
------ Part III ----
Now, IF the two translations match (the one which happened when you
uploaded the file, and the one which happens between the user and the
server disk), then the file will be found.
And otherwise, it will not be.
Your case is that the two translations do not match.
----- Part IV : how to resolve this --------
My suggestion :
do /not/ allow the users to decide under which name the file is really
stored on the disk.
Create an "alias" for the filename, containing only US-ASCII characters,
and store the file under that name.
And then, arrange that when the users ask for the file "složka.png"
(this name appears for example on an index page that you create), in
reality your webserver is looking for this alias name. (*)
This is the only way to make your application really portable, because
in the end, on the WWW, you never know who or where the user is, what
his workstation is, what his OS is, etc..
So the user could upload a file under a name that gives you a lot of
trouble on your server (as you have discovered already, but not entirely).
For example, one user could upload a file named "složka.png", and
another user could upload a file called "Složka.png". If your server is
Windows, and if you are not careful, the second file will overwrite the
There are many other such problematic cases.
And if MediaWiki does not do that, then MediaWiki is not a portable
application, sorry. The problem is not the webserver, the problem is
(and, in part, HTTP 1.x)
(*) you show for example an index page like :
(**) which can be, for example, the "locale" under which the Apache
process is running.
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe <at> httpd.apache.org
" from the digest: users-digest-unsubscribe <at> httpd.apache.org
For additional commands, e-mail: users-help <at> httpd.apache.org