Vincent Mallet | 1 Mar 01:47 2006
Picon

Lexer differences between 1.4 and 1.6

Hello,

I have some code that uses htmlparser 1.4 and  I am looking at
upgrading it to the latest 1.6 integration build. However, I am seeing
differences in the way the input is processed that make the work more
difficult.

Given the input (note it's missing a quote):
Hello <a href="http://www.foo.com>World</a>

With htmlparser 1.4, I get the following nodes:
Text: Hello
Begin tag: a href="http://www.foo.com"
Text: World
End tag: a

With htmlparser 1.6, I get these:
Text: Hello
LinkTag: link to http://www.foo.com>link</a>

The 1.6 behavior makes error recovery a lot more difficult. Is there a
way to have 1.6 behave like 1.4 in this case?

Thanks for your help,

    Vince.

-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
(Continue reading)

Derrick Oswald | 1 Mar 02:58 2006

Re: Lexer differences between 1.4 and 1.6


No, sorry, there is no 'backwards compatibility' switch.

Vincent Mallet wrote:

>Hello,
>
>I have some code that uses htmlparser 1.4 and  I am looking at
>upgrading it to the latest 1.6 integration build. However, I am seeing
>differences in the way the input is processed that make the work more
>difficult.
>
>Given the input (note it's missing a quote):
>Hello <a href="http://www.foo.com>World</a>
>
>With htmlparser 1.4, I get the following nodes:
>Text: Hello
>Begin tag: a href="http://www.foo.com"
>Text: World
>End tag: a
>
>With htmlparser 1.6, I get these:
>Text: Hello
>LinkTag: link to http://www.foo.com>link</a>
>
>The 1.6 behavior makes error recovery a lot more difficult. Is there a
>way to have 1.6 behave like 1.4 in this case?
>
>Thanks for your help,
>
(Continue reading)

Vincent Mallet | 1 Mar 18:34 2006
Picon

Re: Lexer differences between 1.4 and 1.6

Thanks Derrick.

Are "changes.txt" and "release.txt" all the documents about the evolution of htmlparser between 1.4 and 1.5/1.6, or is there something else that would talk about changes in concepts and design between the different releases?

Thanks,

    Vince.

On 2/28/06, Derrick Oswald <DerrickOswald <at> rogers.com> wrote:

No, sorry, there is no 'backwards compatibility' switch.

Vincent Mallet wrote:

>Hello,
>
>I have some code that uses htmlparser 1.4 and  I am looking at
>upgrading it to the latest 1.6 integration build. However, I am seeing
>differences in the way the input is processed that make the work more
>difficult.
>
>Given the input (note it's missing a quote):
>Hello <a href="http://www.foo.com>World</a>
>
>With htmlparser 1.4, I get the following nodes:
>Text: Hello
>Begin tag: a href="http://www.foo.com"
>Text: World
>End tag: a
>
>With htmlparser 1.6, I get these:
>Text: Hello
>LinkTag: link to http://www.foo.com>link</a>
>
>The 1.6 behavior makes error recovery a lot more difficult. Is there a
>way to have 1.6 behave like 1.4 in this case?
>
>Thanks for your help,
>
>    Vince.
>
>
>
>



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Htmlparser-user mailing list
Htmlparser-user <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/htmlparser-user

Derrick Oswald | 2 Mar 03:35 2006

Re: Lexer differences between 1.4 and 1.6

Those are the primary resources. Mostly it's the Javadocs, for example 
there's a good summary piece of the most major difference (underlying 
lexer) in the lexer package:
http://htmlparser.sourceforge.net/javadoc/org/htmlparser/lexer/package-summary.html

Vincent Mallet wrote:

> Thanks Derrick.
>
> Are "changes.txt" and "release.txt" all the documents about the 
> evolution of htmlparser between 1.4 and 1.5/1.6, or is there something 
> else that would talk about changes in concepts and design between the 
> different releases?
>
> Thanks,
>
>     Vince.
>
> On 2/28/06, *Derrick Oswald* <DerrickOswald <at> rogers.com 
> <mailto:DerrickOswald <at> rogers.com>> wrote:
>
>
>     No, sorry, there is no 'backwards compatibility' switch.
>
>     Vincent Mallet wrote:
>
>     >Hello,
>     >
>     >I have some code that uses htmlparser 1.4 and  I am looking at
>     >upgrading it to the latest 1.6 integration build. However, I am
>     seeing
>     >differences in the way the input is processed that make the work
>     more
>     >difficult.
>     >
>     >Given the input (note it's missing a quote):
>     >Hello <a href="http://www.foo.com>World</a>
>     >
>     >With htmlparser 1.4, I get the following nodes:
>     >Text: Hello
>     >Begin tag: a href="http://www.foo.com"
>     >Text: World
>     >End tag: a
>     >
>     >With htmlparser 1.6, I get these:
>     >Text: Hello
>     >LinkTag: link to http://www.foo.com>link</a>
>     >
>     >The 1.6 behavior makes error recovery a lot more difficult. Is
>     there a
>     >way to have 1.6 behave like 1.4 in this case?
>     >
>     >Thanks for your help,
>     >
>     >    Vince.
>

-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
Konstantine | 2 Mar 19:49 2006
Picon

is htmlparser1_6_20051112 what I need?

Greetings
I have beginner level knowledge of Java so please be gentle with me
:-) I am trying to build a small application where the programs makes
a  number of POST requests and processes the results of the requests.

I came as far as creating a separate thread for each request and
storing the responses (full HTML document) in a StringBuffer belonging
to thread standard packages. Now I want to scan the buffered document
for various strings.

Is HTMLParser write tool to use to do this, is there a standard
package I can use to achieve this?

many thanks in advance
K.

FYI, the link Wiki[1] in left frame of home page and the link
frequently asked questions[2] in the request support page seem to have
problems/

[1] http://htmlparser.sourceforge.net/wiki/index.php
[2] http://htmlparser.sourceforge.net/faq.html
NHS^隊X'u<ڂ.y"*mx%jx.j^קvƩXjبȧmݚv&קv^+jZ{az^h஋n){hا׫+h(mZjYwǥrg <at> y$5Oxḝn5mj5^
abhijeetawasthi | 3 Mar 12:02 2006
Picon

Reading HTML doc


Hi ,
I am going through the HtmlParser classes to develop a utility which reads
HTML from a  java program.

My HTML doc has the info like this

<H2>My Name</H2><H3>Address</H3>
<P>It is not useful</P><H3>Age</H3>
<P>It is important</P>

I have to read the content between <H1><H2> and the corresponding <P> tags
.
 How to do this or how to get started.

Thanks in advance
Abhijeet

************************************************************
HSBC Software Development (India) Pvt Ltd
HSBC Center Riverside,West Avenue ,
25 B Kalyani Nagar Pune  411 006 INDIA

Telephone: +91 20 26683000
Fax: +91 20 26681030
************************************************************

-----------------------------------------
***********************************************************************
This e-mail is confidential. It may also be legally privileged.
If you are not the addressee you may not copy, forward, disclose
or use any part of it. If you have received this message in error,
please delete it and all copies from your system and notify the
sender immediately by return e-mail.

Internet communications cannot be guaranteed to be timely,
secure, error or virus-free. The sender does not accept liability
for any errors or omissions.
***********************************************************************


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
Derrick Oswald | 3 Mar 13:31 2006

Re: is htmlparser1_6_20051112 what I need?


If you just want to scan for strings, you can do that with pure Java.
If you want to extract specific tagged pieces, then HTML Parser is for you.
Use the parser.setInputHTML(String), and then all the API of the parser 
becomes available.

Konstantine wrote:

>Greetings
>I have beginner level knowledge of Java so please be gentle with me
>:-) I am trying to build a small application where the programs makes
>a  number of POST requests and processes the results of the requests.
>
>I came as far as creating a separate thread for each request and
>storing the responses (full HTML document) in a StringBuffer belonging
>to thread standard packages. Now I want to scan the buffered document
>for various strings.
>
>Is HTMLParser write tool to use to do this, is there a standard
>package I can use to achieve this?
>
>many thanks in advance
>K.
>
>
>FYI, the link Wiki[1] in left frame of home page and the link
>frequently asked questions[2] in the request support page seem to have
>problems/
>
>[1] http://htmlparser.sourceforge.net/wiki/index.php
>[2] http://htmlparser.sourceforge.net/faq.html
>N?HS^?隊X???'???u??<?ڂ?.???y?"??*m?x%jx.j???^?קvƩ?X?jب?ȧ??m?ݚ???v&??קv?^?+????j?Z???{az???^??h??஋?n???)?{h?????ا?׫?+h?(m?????Z??jY?w??ǥrg?y$???Oxḝn?mj??^??{f????????j)b?	b???ZZ?ǫ?ǫ?+-??.?ǟ????a??l??b??,???y?+??޷?b????+-?w??f??????ser=
>

-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
Derrick Oswald | 3 Mar 13:35 2006

Re: Reading HTML doc

I would suggest trying the FilterBuilder utility.
You'll want things like TagNameFilter to get the <H3> and 
HasParent/HasChild/HasSibling filters to navigate around the node tree.

abhijeetawasthi <at> hsbc.co.in wrote:

>
>
>Hi ,
>I am going through the HtmlParser classes to develop a utility which reads
>HTML from a  java program.
>
>My HTML doc has the info like this
>
><H2>My Name</H2><H3>Address</H3>
><P>It is not useful</P><H3>Age</H3>
><P>It is important</P>
>
>I have to read the content between <H1><H2> and the corresponding <P> tags
>.
> How to do this or how to get started.
>
>Thanks in advance
>Abhijeet
>
>************************************************************
>HSBC Software Development (India) Pvt Ltd
>HSBC Center Riverside,West Avenue ,
>25 B Kalyani Nagar Pune  411 006 INDIA
>
>Telephone: +91 20 26683000
>Fax: +91 20 26681030
>************************************************************
>
>
>-----------------------------------------
>***********************************************************************
>This e-mail is confidential. It may also be legally privileged.
>If you are not the addressee you may not copy, forward, disclose
>or use any part of it. If you have received this message in error,
>please delete it and all copies from your system and notify the
>sender immediately by return e-mail.
>
>Internet communications cannot be guaranteed to be timely,
>secure, error or virus-free. The sender does not accept liability
>for any errors or omissions.
>***********************************************************************
>
>
>
>-------------------------------------------------------
>This SF.Net email is sponsored by xPML, a groundbreaking scripting language
>that extends applications into web and mobile media. Attend the live webcast
>and join the prime developer group breaking into this new coding territory!
>http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
>_______________________________________________
>Htmlparser-user mailing list
>Htmlparser-user <at> lists.sourceforge.net
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>  
>

-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
Antony Sequeira | 4 Mar 02:51 2006
Picon

parsing raw downloaded content thats on file in arbitrary encodings

Hi

I am thinking of using htmlparser for a project.
I have content of urls available  in file on disk
The file contains the headers, followed by the rest of the content as
received from the webserver (so its just a series of bytes).
I'll need something that can read and parse the headers, figure out
the encoding for the rest of the content and then parse the rest of
the content.

I have seen the javadocs and done some digging.
Here is what I think I need to do
Write my own code to read through headers to figure out encoding
Then call the following
http://htmlparser.sourceforge.net/javadoc/org/htmlparser/Parser.html#createParser(java.lang.String,%20java.lang.String)

The questions I have on this approach is -
1. The 'html' parameter is of type 'String', I'd think it would
automatically imply that strings content is already in java format
(utf-16 ?) . So what is the point of having the charset argument ?
I know utf-16 is a encoding and not charset, but I don't understand
the relevance of charset once something is in a 'java String' which
can only be unicode AFAIK.
It would have made sense to me if the html parameter was byte array or
some such thing.

2. I guess I could convert  to String myself from the byte buffer once
I have the code for encoding detection. But then what would I pass for
the charset. It makes no sense to me in Java to say I have some data
sitting in a 'java String' with charset iso-8859-1. I guess I am just
confused about the need for charset specification when something is
already in 'String'.

Thanks in advance for any ideas and help.

-Antony Sequeira

-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
Picon

Re: parsing raw downloaded content thats on file in arbitrary encodings

Hi,

The charset parameter of the constructor Parser(String, String) will  
be returned when you call getEncoding(). No other effect beside this,  
I believe.

To read text from an InputStream (accessing a file, socket, etc) a  
Reader should be used.
To create a Reader, an explicit charset should be given (letting the  
Reader use the system's default is asking for problems...)
Because the creation of the Reader precedes the reading, the text  
encoding must be known prior to reading it. This is why the HTTP  
"Content-Type/charset-encoding" header is useful. However, this  
header is not always correct (consider it a hint), and sometimes is  
not even available (!) and we should consult an oracle then...
If the charset used is not the proper charset, then the String can be  
FIXED converting it into bytes (with the same charset used for  
decoding) and then back to a String using the correct charset.

How to tell if THE correct charset was used?
Well, for now you can look for an http-equiv meta tag that specifies  
the charset. If you find such a tag and the charset is the same  
you've used before then you may trust in you conversion.
Otherwise you should choose to believe one of them (the HTTP header  
or the HTTP-EQUIV tag) and discard the other.
Otherwise, When can someone detect THE correct charset? The short  
answer: it's not easy and not always possible.

I hope this helps you Antony.

By the way, I too have a related question for the developers:

I want to decouple the HTMLParser from the URLConnection where the  
network IO is done.
I still want the parser to resolve links against the original URL of  
the page and to use the HTTP headers to parse the data (gunzipping  
data and charset decoding).

I think that the available constructors for Parser don't allow this  
decoupling in a straightforward fashion and without loosing some of  
these features.

My current solution is to extend URLConnection and then use that  
object to feed the parser.

A, perhaps cleaner, solution would be to have a constructor taking  
three args:
	URL (for link resolving)
	InputStream for the data
	HTTP headers

The HTTP headers could be as returned from URLConnection.  
getHeaderFields() for interoperability:
public Map<String,List<String>> getHeaderFields();
Returns an unmodifiable Map of the header fields. The Map keys are  
Strings that represent the response-header field names. Each Map  
value is an unmodifiable List of Strings that represents the  
corresponding field values.

The signature of the constructor I'm proposing is:
public Parser(String url, InputStream input, Map<String,List<String>>  
httpHeaders);

I will proceed with extending URLConnection and feeding it into the  
Parser with the setter setConnection() (I reuse the Parser to parse  
several documents)
while no better solution is in my knowledge.

Best Regards

Luís Gomes

On Mar 4, 2006, at 1:51 AM, Antony Sequeira wrote:

> Hi
>
> I am thinking of using htmlparser for a project.
> I have content of urls available  in file on disk
> The file contains the headers, followed by the rest of the content as
> received from the webserver (so its just a series of bytes).
> I'll need something that can read and parse the headers, figure out
> the encoding for the rest of the content and then parse the rest of
> the content.
>
> I have seen the javadocs and done some digging.
> Here is what I think I need to do
> Write my own code to read through headers to figure out encoding
> Then call the following
> http://htmlparser.sourceforge.net/javadoc/org/htmlparser/ 
> Parser.html#createParser(java.lang.String,%20java.lang.String)
>
> The questions I have on this approach is -
> 1. The 'html' parameter is of type 'String', I'd think it would
> automatically imply that strings content is already in java format
> (utf-16 ?) . So what is the point of having the charset argument ?
> I know utf-16 is a encoding and not charset, but I don't understand
> the relevance of charset once something is in a 'java String' which
> can only be unicode AFAIK.
> It would have made sense to me if the html parameter was byte array or
> some such thing.
>
> 2. I guess I could convert  to String myself from the byte buffer once
> I have the code for encoding detection. But then what would I pass for
> the charset. It makes no sense to me in Java to say I have some data
> sitting in a 'java String' with charset iso-8859-1. I guess I am just
> confused about the need for charset specification when something is
> already in 'String'.
>
> Thanks in advance for any ideas and help.
>
> -Antony Sequeira
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by xPML, a groundbreaking scripting  
> language
> that extends applications into web and mobile media. Attend the  
> live webcast
> and join the prime developer group breaking into this new coding  
> territory!
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
> _______________________________________________
> Htmlparser-user mailing list
> Htmlparser-user <at> lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>

-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642

Gmane