Steve McCann | 26 Feb 2004 21:16

Getting title tag text


Using the following code, the assert for the title fails (getTitle()
returns an empty string). Is it not possible to retrieve that
information using the lexer rather than the parser? I am using HTML
Parser Integration Release 1.4-20040125.

Thank you,
Steve

    public void testTitleScan() throws ParserException
    {
	String inputHTML =
"<html><!--remark--><head><title>Yahoo!</title></head>";
		Lexer lexer = new Lexer (new Page (inputHTML));
		
		PrototypicalNodeFactory factory = 
					new PrototypicalNodeFactory(new
TitleTag());
		lexer.setNodeFactory (factory);

		Node node;
		while (null != (node = lexer.nextNode ()))
		{
			if (node instanceof TitleTag)
			{
		        TitleTag titleTag = (TitleTag) node;
			  String test = titleTag.getTitle();
	
assertEquals("Title","Yahoo!",titleTag.getTitle());
			}
(Continue reading)

Derrick Oswald | 27 Feb 2004 02:16
Favicon

Re: Getting title tag text

Steve,

I think you've hit on the nub of the difference between the lexer and 
the parser.
The lexer simply returns nodes, in order, and doesn't try to match end 
tags with start tags. So yes, you will get a TitleTag, but it hasn't 
been fed it's children.
The parser on the other hand will cause collection of the nodes between 
start and end tags so as to "know" that the thing between the TITLE and 
/TITLE tag is the "title of the document". See the home page for another 
explanation: http://htmlparser.sourceforge.net/

What you can do with the Lexer is get the next node *after* the TITLE 
tag and assume it's a plain text title in a string node (people do funny 
things with HTML, so you're bound to see <TITLE><B>My Title</B><TITLE> 
and stuff like that, which I'm not sure is even completely handled by 
the parser code, so you have to be careful). Or perhaps get the *next* 
StringNode from the lexer which is presumably the title for the same 
reasons as outline before, but you have to watch out for empty 
<TITLE></TITLE> constructs. Or you can use the Parser and hope it does 
the 'right thing'.  If it doesn't, let us know.

Derrick

Steve McCann wrote:

>Using the following code, the assert for the title fails (getTitle()
>returns an empty string). Is it not possible to retrieve that
>information using the lexer rather than the parser? I am using HTML
>Parser Integration Release 1.4-20040125.
(Continue reading)

Steve McCann | 27 Feb 2004 04:48

RE: Getting title tag text

Derrick,
Thanks again for your help. By the way, I didn't have much luck with the
threading issue. I figured upgrading would be a good course of action. I
think I am getting the hang of the new parser now...

Once again, thanks and that's two I owe you...

Steve

-----Original Message-----
From: htmlparser-user-admin <at> lists.sourceforge.net
[mailto:htmlparser-user-admin <at> lists.sourceforge.net] On Behalf Of
Derrick Oswald
Sent: Thursday, February 26, 2004 8:17 PM
To: htmlparser-user <at> lists.sourceforge.net
Subject: Re: [Htmlparser-user] Getting title tag text

Steve,

I think you've hit on the nub of the difference between the lexer and 
the parser.
The lexer simply returns nodes, in order, and doesn't try to match end 
tags with start tags. So yes, you will get a TitleTag, but it hasn't 
been fed it's children.
The parser on the other hand will cause collection of the nodes between 
start and end tags so as to "know" that the thing between the TITLE and 
/TITLE tag is the "title of the document". See the home page for another

explanation: http://htmlparser.sourceforge.net/

(Continue reading)

Shantha Jayalal | 2 Mar 2004 10:41
Picon

ParserException: null;

Hi,
I am getting following error when try to extract links from a web site. 
Any help please. Many Thanks
Shantha

D:\htmlparser1_4_2>java Robot 
http://www.keele.ac.uk/depts/cs/dake/vldb2000/pan
l2020/DeenVLDB2/index.htm
Crawlin Site 
http://www.keele.ac.uk/depts/cs/dake/vldb2000/panel2020/DeenVLDB2/
ndex.htm 1
Exception in thread "main" org.htmlparser.util.ParserException: null;
sun.io.MalformedInputException
at sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:152)
at java.io.InputStreamReader.convertInto(InputStreamReader.java:137)
at java.io.InputStreamReader.fill(InputStreamReader.java:186)
at java.io.InputStreamReader.read(InputStreamReader.java:249)
at org.htmlparser.lexer.Source.fill(Source.java:239)
at org.htmlparser.lexer.Source.read(Source.java:322)
at org.htmlparser.lexer.Source.read(Source.java:347)
at org.htmlparser.lexer.Page.setEncoding(Page.java:698)
at org.htmlparser.tags.MetaTag.doSemanticAction(MetaTag.java:115)
at org.htmlparser.scanners.TagScanner.scan(TagScanner.java:69)
at org.htmlparser.scanners.CompositeTagScanner.scan(CompositeTagScanner
java:162)
at org.htmlparser.util.IteratorImpl.nextNode(IteratorImpl.java:92)
at Robot.crawl(Robot.java:200)
at Robot.main(Robot.java:106) 

-------------------------------------------------------
(Continue reading)


Gmane