Shantha Jayalal | 2 Mar 10:41 2004
Picon

ParserException: null;

Hi,
I am getting following error when try to extract links from a web site. 
Any help please. Many Thanks
Shantha

D:\htmlparser1_4_2>java Robot 
http://www.keele.ac.uk/depts/cs/dake/vldb2000/pan
l2020/DeenVLDB2/index.htm
Crawlin Site 
http://www.keele.ac.uk/depts/cs/dake/vldb2000/panel2020/DeenVLDB2/
ndex.htm 1
Exception in thread "main" org.htmlparser.util.ParserException: null;
sun.io.MalformedInputException
at sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:152)
at java.io.InputStreamReader.convertInto(InputStreamReader.java:137)
at java.io.InputStreamReader.fill(InputStreamReader.java:186)
at java.io.InputStreamReader.read(InputStreamReader.java:249)
at org.htmlparser.lexer.Source.fill(Source.java:239)
at org.htmlparser.lexer.Source.read(Source.java:322)
at org.htmlparser.lexer.Source.read(Source.java:347)
at org.htmlparser.lexer.Page.setEncoding(Page.java:698)
at org.htmlparser.tags.MetaTag.doSemanticAction(MetaTag.java:115)
at org.htmlparser.scanners.TagScanner.scan(TagScanner.java:69)
at org.htmlparser.scanners.CompositeTagScanner.scan(CompositeTagScanner
java:162)
at org.htmlparser.util.IteratorImpl.nextNode(IteratorImpl.java:92)
at Robot.crawl(Robot.java:200)
at Robot.main(Robot.java:106) 

-------------------------------------------------------
(Continue reading)

Anthony Labarre | 6 Mar 17:30 2004
Picon

Two problems regarding attributes (assignment and empty attributes detection)

Hi,

I have the following two problems when dealing with attributes:

1) I use the following piece of code in the visitTag(Tag tag) method of 
my visitor inherited from NodeVisitor to detect wether an attribute has 
a multi-word or a single word value, and in the latter case to replace 
it with a non quoted value:

          // Vector attribute_set = tag.getAttributesEx();
          // tmp = (Attribute) attribute_set.elementAt(i);
          StringTokenizer ST = new StringTokenizer(tmp.getValue(), " ");
          if(ST.countTokens() == 1) { // one word, quotes are useless
            System.out.println("Removing quotes for: "+tmp.getValue());
            tmp = new Attribute(tmp.getName(), tmp.getValue());
            attribute_set.setElementAt(tmp, i);
          }

It works fine actually, but not for attributes with values like 
"text/css" or "windowTitle();", and I don't know why - I guess quotes 
are added automatically no matter what, because in those two cases I 
just described, I still get the messages "Removing quotes for: text/css" 
and "Removing quotes for: windowTitle();". By the way, I tried to do it 
in a simpler way with the setQuote method, but it seems to only set the 
ending quote, creating code like:

<a href="somelink>[text]</a>

2) I didn't succeed in detecting empty attributes like alt="" or alt= , 
though I've read the javadoc and played around with tmp.isEmpty() and 
(Continue reading)

Marcin | 6 Mar 19:25 2004
Picon

EncodingChangeException: character mismatch

Hi there,

I get the following error:

org.htmlparser.util.EncodingChangeException: character mismatch (new: ? !=
old:
¬) for encoding change from ISO-8859-2 to ISO-8859-1 at character offset
4162
        at org.htmlparser.lexer.Page.setEncoding(Page.java:702)
        at org.htmlparser.tags.MetaTag.doSemanticAction(MetaTag.java:115)
        at org.htmlparser.scanners.TagScanner.scan(TagScanner.java:69)
        at
org.htmlparser.scanners.CompositeTagScanner.scan(CompositeTagScanner.
java:162)
        at org.htmlparser.util.IteratorImpl.nextNode(IteratorImpl.java:92)
        at org.htmlparser.Parser.extractAllNodesThatMatch(Parser.java:744)
        at LinkExtractor.main(LinkExtractor.java:46)

Output from LinkExtractor example.

If I'll try-catch it I won't get any resoult. What can I do with it?

Regards,
B

-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id70&alloc_id638&op=click
(Continue reading)

Derrick Oswald | 7 Mar 21:12 2004

Re: Two problems regarding attributes (assignment and empty attributes detection)

Hi Anthony,

1) The Attribute constructor attempts to determine correct quoting for 
attribute values based on the HTML 4.01 definition:
  http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.2.2
which, in part, says:

> By default, SGML requires that all attribute values be delimited using 
> either double quotation marks (ASCII decimal 34) or single quotation 
> marks (ASCII decimal 39)... In certain cases, authors may specify the 
> value of an attribute without any quotation marks. The attribute value 
> may only contain letters (a-z and A-Z), digits (0-9), hyphens (ASCII 
> decimal 45), periods (ASCII decimal 46), underscores (ASCII decimal 
> 95), and colons (ASCII decimal 58). We recommend using quotation marks 
> even when it is possible to eliminate them.

The 'useless' quotes you are trying to eliminate are required by the 
standard for those cases you mention because of the slash and brackets.

As for the setQuote problem, I presume you are trying to set the Quote 
to zero. Can you provide a problem case, it seems to work for this:

        attribute = new Attribute ("href", "http://www.cbc.ca");
        System.out.println (attribute);
        attribute.setQuote ((char)0);
        System.out.println (attribute);

output:
href="http://www.cbc.ca"
href=http://www.cbc.ca
(Continue reading)

Derrick Oswald | 7 Mar 21:17 2004

Re: EncodingChangeException: character mismatch

Marcin,

The exception is thrown because some of the nodes already given out are 
in error.  You can try a second time after discarding the information 
you've gained so far, like StringBean does:

            try
            {
                try
                {
                    mParser.visitAllNodesWith (this);
                    updateStrings (mBuffer.toString ());
                }
                finally
                {
                    mBuffer = new StringBuffer (4096);
                }
            }
            catch (EncodingChangeException ece)
            {
                mIsPre = false;
                mIsScript = false;
                mIsStyle = false;
                try
                {   // try again with the encoding now in force
                    mParser.reset ();
                    mBuffer = new StringBuffer (4096);
                    mParser.visitAllNodesWith (this);
                    updateStrings (mBuffer.toString ());
                }
(Continue reading)

Anthony Labarre | 8 Mar 15:55 2004
Picon

Re: Two problems regarding attributes (assignment and empty attributes detection)

On 7/03/2004 21:12, Derrick Oswald wrote:

>Hi Anthony,
>
>1) The 'useless' quotes you are trying to eliminate are required by the 
>standard for those cases you mention because of the slash and brackets.
>  
>
Thanks!

>As for the setQuote problem, I presume you are trying to set the Quote 
>to zero. Can you provide a problem case, it seems to work for this:
>
>        attribute = new Attribute ("href", "http://www.cbc.ca");
>        System.out.println (attribute);
>        attribute.setQuote ((char)0);
>        System.out.println (attribute);
>
>output:
>href="http://www.cbc.ca"
>href=http://www.cbc.ca
>  
>
Your raw code does work (setQuote('\0') too, by the way) - and this is 
even more strange .. I tested it with multiword value, adding the quote 
then removing it, and it also worked. Since I do it exactly the same 
way, I suppose it could be due to the casting from an object ... could 
it? Here's a piece of my code:

  public void visitTag(Tag tag) {
(Continue reading)

Anthony Labarre | 8 Mar 16:40 2004
Picon

Re: Two problems regarding attributes (assignment and empty attributes detection)

Sorry, I forgot the test case ... I used the page packages.html in the archive available on your site.

On 7/03/2004 21:12, Derrick Oswald wrote:
Hi Anthony, 1) The Attribute constructor attempts to determine correct quoting for attribute values based on the HTML 4.01 definition: http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.2.2 which, in part, says:
By default, SGML requires that all attribute values be delimited using either double quotation marks (ASCII decimal 34) or single quotation marks (ASCII decimal 39)... In certain cases, authors may specify the value of an attribute without any quotation marks. The attribute value may only contain letters (a-z and A-Z), digits (0-9), hyphens (ASCII decimal 45), periods (ASCII decimal 46), underscores (ASCII decimal 95), and colons (ASCII decimal 58). We recommend using quotation marks even when it is possible to eliminate them.
The 'useless' quotes you are trying to eliminate are required by the standard for those cases you mention because of the slash and brackets. As for the setQuote problem, I presume you are trying to set the Quote to zero. Can you provide a problem case, it seems to work for this: attribute = new Attribute ("href", "http://www.cbc.ca"); System.out.println (attribute); attribute.setQuote ((char)0); System.out.println (attribute); output: href="http://www.cbc.ca" href=http://www.cbc.ca 2) I'm afraid you've stumbled on a bug there. I've filed it as bug #911565 isValued() and isNull() don't work so that you can track it. Derrick Anthony Labarre wrote:
Hi, I have the following two problems when dealing with attributes: 1) I use the following piece of code in the visitTag(Tag tag) method of my visitor inherited from NodeVisitor to detect wether an attribute has a multi-word or a single word value, and in the latter case to replace it with a non quoted value: // Vector attribute_set = tag.getAttributesEx(); // tmp = (Attribute) attribute_set.elementAt(i); StringTokenizer ST = new StringTokenizer(tmp.getValue(), " "); if(ST.countTokens() == 1) { // one word, quotes are useless System.out.println("Removing quotes for: "+tmp.getValue()); tmp = new Attribute(tmp.getName(), tmp.getValue()); attribute_set.setElementAt(tmp, i); } It works fine actually, but not for attributes with values like "text/css" or "windowTitle();", and I don't know why - I guess quotes are added automatically no matter what, because in those two cases I just described, I still get the messages "Removing quotes for: text/css" and "Removing quotes for: windowTitle();". By the way, I tried to do it in a simpler way with the setQuote method, but it seems to only set the ending quote, creating code like: <a href="somelink>[text]</a> 2) I didn't succeed in detecting empty attributes like alt="" or alt= , though I've read the javadoc and played around with tmp.isEmpty() and !tmp.isValued() ... could someone give me the condition to use to detect them? Good afternoon everyone, Anthony
------------------------------------------------------- This SF.Net email is sponsored by: IBM Linux Tutorials Free Linux tutorial presented by Daniel Robbins, President and CEO of GenToo technologies. Learn everything from fundamentals to system administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click _______________________________________________ Htmlparser-user mailing list Htmlparser-user <at> lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/htmlparser-user

HO Chi-wai, Zero | 9 Mar 05:34 2004
Picon

Any sample showing how to convert a HTML to DOM ?

I planned to convert the returned HTML object to XML first and parse it as DOM object.
 
Any sample code ?
 
Many thanks

 
Marcin | 12 Mar 21:32 2004
Picon

Re: EncodingChangeException: character mismatch

Dear Derrick,

> >I get the following error:
> >
> >org.htmlparser.util.EncodingChangeException: character mismatch (new: ?
!=
> >old:
> >¬) for encoding change from ISO-8859-2 to ISO-8859-1 at character offset
> >4162
> >Output from LinkExtractor example.
> >
> >If I'll try-catch it I won't get any resoult. What can I do with it?

> The exception is thrown because some of the nodes already given out are
> in error.  You can try a second time after discarding the information
> you've gained so far, like StringBean does:

Thank you for answer but I it's no good solution :( Please try LinkBean
example with that code:

import java.net.URL;
import org.htmlparser.beans.LinkBean;

public class LinkDemo
{
    public static void main (String[] args)
    {
        LinkBean lb = new LinkBean ();
        lb.setURL ("http://www.puszta.pl");
        URL[] urls = lb.getLinks ();
        for (int i = 0; i < urls.length; i++)
            System.out.println (urls[i]);
    }
}

Exception in thread "main" java.lang.NullPointerException
        at LinkDemo.main(LinkDemo.java:11)

I can deal with that page with low level lexer but there must by a way to
extract links from pages with mixed up encodings with NodeVisitor. Is it?

Greets,
B

-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id70&alloc_id638&op=click
Derrick Oswald | 12 Mar 23:51 2004

Re: EncodingChangeException: character mismatch

Try overriding the LinkBean .extractLinks() method like so:

class MyLinkBean extends LinkBean
{
    protected URL[] extractLinks (String url) throws ParserException
    {
        Parser parser;
        ObjectFindingVisitor visitor;
        Vector vector;
        LinkTag link;
        URL[] ret;

        parser = new Parser (url);
        visitor = new ObjectFindingVisitor (LinkTag.class);
        try
        {
            parser.visitAllNodesWith (visitor);
        }
        catch (EncodingChangeException ece)
        {
            parser.reset ();
            visitor = new ObjectFindingVisitor (LinkTag.class);
            parser.visitAllNodesWith (visitor);
        }
        Node [] nodes = visitor.getTags();
        vector = new Vector();
        for (int i = 0; i < nodes.length; i++)
            try
            {
                link = (LinkTag)nodes[i];
                vector.add(new URL (link.getLink ()));
            }
            catch (MalformedURLException murle)
            {
            }
        ret = new URL[vector.size ()];
        vector.copyInto (ret);

        return (ret);
    }
}

Then use it like so:

        LinkBean lb = new MyLinkBean ();

If it works for you, I'll incorporate the fix.

Derrick

Marcin wrote:

>Dear Derrick,
>
>Thank you for answer but I it's no good solution :( Please try LinkBean
>example with that code:
>
>import java.net.URL;
>import org.htmlparser.beans.LinkBean;
>
>public class LinkDemo
>{
>    public static void main (String[] args)
>    {
>        LinkBean lb = new LinkBean ();
>        lb.setURL ("http://www.puszta.pl");
>        URL[] urls = lb.getLinks ();
>        for (int i = 0; i < urls.length; i++)
>            System.out.println (urls[i]);
>    }
>}
>
>Exception in thread "main" java.lang.NullPointerException
>        at LinkDemo.main(LinkDemo.java:11)
>
>I can deal with that page with low level lexer but there must by a way to
>extract links from pages with mixed up encodings with NodeVisitor. Is it?
>
>Greets,
>B
>
>  
>

-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click

Gmane