Re: parsing raw downloaded content thats on file in arbitrary encodings
Luís Manuel dos Santos Gomes <luismsgomes <at> gmail.com>
2006-03-04 04:07:21 GMT
Hi,
The charset parameter of the constructor Parser(String, String) will
be returned when you call getEncoding(). No other effect beside this,
I believe.
To read text from an InputStream (accessing a file, socket, etc) a
Reader should be used.
To create a Reader, an explicit charset should be given (letting the
Reader use the system's default is asking for problems...)
Because the creation of the Reader precedes the reading, the text
encoding must be known prior to reading it. This is why the HTTP
"Content-Type/charset-encoding" header is useful. However, this
header is not always correct (consider it a hint), and sometimes is
not even available (!) and we should consult an oracle then...
If the charset used is not the proper charset, then the String can be
FIXED converting it into bytes (with the same charset used for
decoding) and then back to a String using the correct charset.
How to tell if THE correct charset was used?
Well, for now you can look for an http-equiv meta tag that specifies
the charset. If you find such a tag and the charset is the same
you've used before then you may trust in you conversion.
Otherwise you should choose to believe one of them (the HTTP header
or the HTTP-EQUIV tag) and discard the other.
Otherwise, When can someone detect THE correct charset? The short
answer: it's not easy and not always possible.
I hope this helps you Antony.
By the way, I too have a related question for the developers:
I want to decouple the HTMLParser from the URLConnection where the
network IO is done.
I still want the parser to resolve links against the original URL of
the page and to use the HTTP headers to parse the data (gunzipping
data and charset decoding).
I think that the available constructors for Parser don't allow this
decoupling in a straightforward fashion and without loosing some of
these features.
My current solution is to extend URLConnection and then use that
object to feed the parser.
A, perhaps cleaner, solution would be to have a constructor taking
three args:
URL (for link resolving)
InputStream for the data
HTTP headers
The HTTP headers could be as returned from URLConnection.
getHeaderFields() for interoperability:
public Map<String,List<String>> getHeaderFields();
Returns an unmodifiable Map of the header fields. The Map keys are
Strings that represent the response-header field names. Each Map
value is an unmodifiable List of Strings that represents the
corresponding field values.
The signature of the constructor I'm proposing is:
public Parser(String url, InputStream input, Map<String,List<String>>
httpHeaders);
I will proceed with extending URLConnection and feeding it into the
Parser with the setter setConnection() (I reuse the Parser to parse
several documents)
while no better solution is in my knowledge.
Best Regards
Luís Gomes
On Mar 4, 2006, at 1:51 AM, Antony Sequeira wrote:
> Hi
>
> I am thinking of using htmlparser for a project.
> I have content of urls available in file on disk
> The file contains the headers, followed by the rest of the content as
> received from the webserver (so its just a series of bytes).
> I'll need something that can read and parse the headers, figure out
> the encoding for the rest of the content and then parse the rest of
> the content.
>
> I have seen the javadocs and done some digging.
> Here is what I think I need to do
> Write my own code to read through headers to figure out encoding
> Then call the following
> http://htmlparser.sourceforge.net/javadoc/org/htmlparser/
> Parser.html#createParser(java.lang.String,%20java.lang.String)
>
> The questions I have on this approach is -
> 1. The 'html' parameter is of type 'String', I'd think it would
> automatically imply that strings content is already in java format
> (utf-16 ?) . So what is the point of having the charset argument ?
> I know utf-16 is a encoding and not charset, but I don't understand
> the relevance of charset once something is in a 'java String' which
> can only be unicode AFAIK.
> It would have made sense to me if the html parameter was byte array or
> some such thing.
>
> 2. I guess I could convert to String myself from the byte buffer once
> I have the code for encoding detection. But then what would I pass for
> the charset. It makes no sense to me in Java to say I have some data
> sitting in a 'java String' with charset iso-8859-1. I guess I am just
> confused about the need for charset specification when something is
> already in 'String'.
>
> Thanks in advance for any ideas and help.
>
> -Antony Sequeira
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by xPML, a groundbreaking scripting
> language
> that extends applications into web and mobile media. Attend the
> live webcast
> and join the prime developer group breaking into this new coding
> territory!
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
> _______________________________________________
> Htmlparser-user mailing list
> Htmlparser-user <at> lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642