Kevin Hamilton | 4 Apr 2008 20:18
Picon

spider threads with basic auth

I didn't look at it in any depth, so I'm not sure if this is a bug in Venus or in httplib2, but I thought I'd start here.

If I try to fetch a feed which requires authentication, using the URL structure http://username:password <at> www.example.com, it will work if spider_threads=0 in the config.ini but it will fail if spider_threads=1 (or more) in the config.ini.

Here's the output from the spider.py test program when spider_threads=1:
1207331661.923120 Socket timeout set to 20 seconds
1207331662.161323 Fetching http://username:password <at> www.example.com/extrss.php?type=custom&forumids=16%2C39&lastpost=1&fulldesc=1 via 0
1207331662.162117 Error processing http://username:password <at> www.example.com/extrss.php?type=custom&forumids=16%2C39&lastpost=1&fulldesc=1
1207331662.219738 InvalidURL: nonnumeric port: 'password <at> www.example.com'
1207331662.220118   File "/homepages/41/d94174740/htdocs/home/v/planet/spider.py", line 312, in httpThread
    (resp, content) = h.request(idna, 'GET', headers=headers)
1207331662.220417   File "/homepages/41/d94174740/htdocs/home/v/planet/vendor/httplib2/__init__.py", line 780, in request
    conn = self.connections[scheme+":"+authority] = connection_type(authority)
1207331662.220724   File "/kunden/homepages/41/d94174740/htdocs/home/lib/python2.4/httplib.py", line 586, in __init__
    self._set_hostport(host, port)
1207331662.221006   File "/kunden/homepages/41/d94174740/htdocs/home/lib/python2.4/httplib.py", line 598, in _set_hostport
    raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
1207331662.262189 Error 500 while updating feed http://username:password <at> www.example.com/extrss.php?type=custom&forumids=16%2C39&lastpost=1&fulldesc=1
1207331662.283040 Finished threaded part of processing.

and here is the output when spider_threads=0:
1207331728.759372 Socket timeout set to 20 seconds
1207331728.760210 Building work queue
1207331731.451615 Updating feed http://username:password <at> www.example.com/extrss.php?type=custom&forumids=16%2C39&lastpost=1&fulldesc=1 <at> http://www.example.com/extrss.php?type=custom&forumids=16%2C39&lastpost=1&fulldesc=1



--

-- 
devel mailing list
devel <at> lists.planetplanet.org
http://lists.planetplanet.org/mailman/listinfo/devel
Mary Gardiner | 11 Apr 2008 04:31
Favicon
Gravatar

Possible bug in Venus

An author forward dated a post by about three months. He then seems to
have moved it to the correct date, which involved changing the URL, but
the forward dated post hung around in the cache.

I'll see if I can come up with a test case, but I'm not sure what the
actual correct solution would be.

-Mary
--

-- 
devel mailing list
devel <at> lists.planetplanet.org
http://lists.planetplanet.org/mailman/listinfo/devel

Sam Ruby | 11 Apr 2008 04:47
Favicon
Gravatar

Re: Possible bug in Venus

Mary Gardiner wrote:
> An author forward dated a post by about three months. He then seems to
> have moved it to the correct date, which involved changing the URL, but
> the forward dated post hung around in the cache.
> 
> I'll see if I can come up with a test case, but I'm not sure what the
> actual correct solution would be.

Perhaps the "future_dates" configuration parameter may be of help:

   http://intertwingly.net/code/venus/docs/normalization.html#overrides

This configuration parameter can be set either planet wide or on a per 
feed basis.

- Sam Ruby

P.S.  The changing of the id is a crucial part of the problem you are 
seeing.  Essentially what is happening is that the person is creating a 
new entry as opposed to updating the time on the existing entry.

--

-- 
devel mailing list
devel <at> lists.planetplanet.org
http://lists.planetplanet.org/mailman/listinfo/devel

Mary Gardiner | 16 Apr 2008 04:54
Favicon
Gravatar

xpath regular expressions example?

When Sam first described the xpath-sifter filter, one of the advantages
people talked about was the ability to use regular expressions in the
filters. Is this in fact possible? I know XPath 2.0 introduces regular
expressions, but I can't figure out if or how to use the matches()
function.

Something like this does not work:

require:
  //atom:title[matches(.,'Release')]

I get the following Python error:

ERROR:planet.runner:xmlXPathCompOpEval: function matches not found
Unregistered function
xmlXPathEval: 3 object left on the stack

-Mary
--

-- 
devel mailing list
devel <at> lists.planetplanet.org
http://lists.planetplanet.org/mailman/listinfo/devel

Mary Gardiner | 16 Apr 2008 04:51
Favicon
Gravatar

Venus merge request: xpath-sifter configuration example

I am not fluent in XPath and yet wanted to use the xpath-sifter.py
function. In case there are others in my boat I've done a slightly more
verbose example. bzr branch is at
http://users.puzzling.org/users/mary/bzr/venus/branches/venus-xsltdoc/
and the actual documentation file is attached.

-Mary
# The xpath_sifter filter allows you to stop entries from a feed being displayed
# if they do not match a particular pattern.

# It is useful for things like only displaying entries in a particular category
# even if the site does not provide per category feeds, and displaying only entries
# that contain a particular string in their title.

# The xpath_sifter filter applies only after all feeds are normalised to Atom 1.0.
# Look in your cache to see what entries look like.

[Planet]
filters = xpath_sifter.py

# We are only interested in entries in the category "two" from this blogger, but
# he does not provide a per-category feed.
# The Atom for categories looks like this: <category term="two"/>, so here
# we filter the http://example.com/uncategorised.xml file for entries with a
# category tag with the term attribute equal to 'two'
[http://example.com/uncategorised.xml]
name = Category 'two' (from Site Without a Categorised Feed)
[xpath_sifter.py]
require:
  //atom:category[ <at> term='two']

# The verbose blogger whose feed is below blogs about many subjects but we are
# only interested in entries about Venus. She does not use categories but
# fortunately her titles are very consistent, so we search within the title
# tag's text for the text 'Venus'
[http://example.com/verbose.xml]
name = Venus (from Verbose Site)
[xpath_sifter.py]
require:
  //atom:title[contains(.,'Venus')]
--

-- 
devel mailing list
devel <at> lists.planetplanet.org
http://lists.planetplanet.org/mailman/listinfo/devel
Mary Gardiner | 16 Apr 2008 05:33
Favicon
Gravatar

Problem with translation to Atom for multiple categories?

http://feeds.feedburner.com/weblogsinc/gadling uses multiple categories
in its RSS file, but only one of them shows up in the Atom normalised
cached form.

-Mary
--

-- 
devel mailing list
devel <at> lists.planetplanet.org
http://lists.planetplanet.org/mailman/listinfo/devel

Sam Ruby | 17 Apr 2008 20:26
Favicon
Gravatar

Re: Venus merge request: xpath-sifter configuration example

Mary Gardiner wrote:
> I am not fluent in XPath and yet wanted to use the xpath-sifter.py
> function. In case there are others in my boat I've done a slightly more
> verbose example. bzr branch is at
> http://users.puzzling.org/users/mary/bzr/venus/branches/venus-xsltdoc/
> and the actual documentation file is attached.

Pulled.  Thanks!

> -Mary

- Sam Ruby

--

-- 
devel mailing list
devel <at> lists.planetplanet.org
http://lists.planetplanet.org/mailman/listinfo/devel

Sam Ruby | 17 Apr 2008 20:44
Favicon
Gravatar

Re: xpath regular expressions example?

Mary Gardiner wrote:
> When Sam first described the xpath-sifter filter, one of the advantages
> people talked about was the ability to use regular expressions in the
> filters. Is this in fact possible? I know XPath 2.0 introduces regular
> expressions, but I can't figure out if or how to use the matches()
> function.
> 
> Something like this does not work:
> 
> require:
>   //atom:title[matches(.,'Release')]
> 
> I get the following Python error:
> 
> ERROR:planet.runner:xmlXPathCompOpEval: function matches not found
> Unregistered function
> xmlXPathEval: 3 object left on the stack

Apparently, libxslt only supports XSLT 1.0.

   http://www.xmlsoft.org/XSLT.html

In your case, something like contains would suffice:

   http://www.w3.org/TR/xpath#function-contains

As near as I can tell, if you require the full power of regular 
expressions, something like the following would be required:

   import libxml2, re
   doc = libxml2.parseDoc(sys.stdin.read())
   xp = doc.xpathNewContext()
   xp.xpathRegisterNs("atom", "http://www.w3.org/2005/Atom")
   title = xp.xpathEval("/atom:entry/atom:title")
   if re.search('Release', title.content): print doc

- Sam Ruby

--

-- 
devel mailing list
devel <at> lists.planetplanet.org
http://lists.planetplanet.org/mailman/listinfo/devel

Sam Ruby | 17 Apr 2008 21:35
Favicon
Gravatar

Re: Problem with translation to Atom for multiple categories?

Mary Gardiner wrote:
> http://feeds.feedburner.com/weblogsinc/gadling uses multiple categories
> in its RSS file, but only one of them shows up in the Atom normalised
> cached form.

When I run

   python tests/reconstitute.py \
     http://feeds.feedburner.com/weblogsinc/gadling

I get back

...
     <updated>2008-04-17T15:00:00Z</updated>
     <category term="airline miles"/>
     <category term="AirlineMiles"/>
     <category term="first class"/>
     <category term="FirstClass"/>
     <category term="frequent flier miles"/>
     <category term="frequent flyer miles"/>
     <category term="FrequentFlierMiles"/>
     <category term="FrequentFlyerMiles"/>
     <category term="upgrade"/><feedburner:origLink 
xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">http://www.gadling.com/2008/04/17/how-do-i-upgrade-my-airline-ticket-with-miles/</feedburner:origLink>
...

Is this consistent with what you are seeing?

- Sam Ruby

--

-- 
devel mailing list
devel <at> lists.planetplanet.org
http://lists.planetplanet.org/mailman/listinfo/devel

Mary Gardiner | 18 Apr 2008 01:08
Favicon
Gravatar

Re: Problem with translation to Atom for multiple categories?

On Thu, Apr 17, 2008, Sam Ruby wrote:
> Is this consistent with what you are seeing?

Yes. In one case in my cache I only got a few of the category tags, but
it's fallen out of the feed. Cache file attached, the blog entry is at
http://www.gadling.com/2008/04/15/american-airlines-tell-its-pilots-your-travel-horror-stories

Given the funny way that they turn the tags into categories in their
feed and not the actual... categories they have, I will probably filter
the text for what I'm looking for.

-Mary
<?xml version="1.0" ?><entry xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"
xmlns:planet="http://planet.intertwingly.net/"><id>http://www.gadling.com/2008/04/15/american-airlines-tell-its-pilots-your-travel-horror-stories/</id><link
href="http://feeds.feedburner.com/~r/weblogsinc/gadling/~3/270688017/" rel="alternate"
type="text/html"/><title>American Airlines: Tell its pilots your travel horror
stories</title><summary type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>Filed
under: <a href="http://www.gadling.com/category/airlines/" rel="tag">Airlines</a>, <a
href="http://www.gadling.com/category/consumer-activism/" rel="tag">Consumer
Activism</a></p><a href="http://flickr.com/photos/bcorreira/2117120002/"><img align="right"
alt="" border="0" height="148" hspace="4"
src="http://www.blogsmithmedia.com/www.gadling.com/media/2008/04/aa.jpg" vspace="4"
width="200"/></a>American Airlines customers: Are you angry at the company? Well, AA pilots want to
hear from you.<br/><br/>Following a week that saw the carrier cancel more than 3,000 flights, leaving
tens of thousands of passengers stranded, a group of AA pilots has just launched a Web site -- <a
href="http://www.tellyouraastory.com/">Tell Your AA Story</a> -- where they want you to air your
frustrations about the airline's recent and future performance. The site has already generated more
than 56,000 hits.<br/> <br/> Check out this welcome message, taken directly from the site (the caps are
not mine):<br/> <br/> <em>Had your travel plans destroyed by the actions of AA lately? Even if they're not
listening--we are. Whether you were traveling on business, for pleasure, or for an emergency, we realize
that the mismanagement of American Airlines has cost you dearly. It doesn't matter if you lost a day at
Disney with your family, a day of work for your business, or a major family event, the unfortunate truth is
that your life has been disrupted, your plans destroyed, your business derailed--all for one bad reason:
THE PROFIT OF A FEW AMR EXECUTIVES.</em><br/> <br/> AA's 12,000 pilots are in the middle of contract
negotiations, so the site's launch seems as much a collective bargaining move as a response to last week's
cancellations, even as the Allied Pilots Association, the pilots' union, says it is
neither.<br/><br/>If you're flying out of Boston, New York, Miami, San Francisco and a handful of other
cities today, you are likely to encounter a few dozen AA pilots outside protesting the company's recent
performance. At key hubs, 30-50 AA pilots will be passing out literature promoting their new Web site
between 11 a.m. and 2:30 p.m.<br/><br/>This just as American says it is back up and running at a full
schedule (the pilots say their actions will not translate into flight delays). AA clearly has more than
just frustrated customers to be concerned about. I'm reminded of Grant's <a
href="http://www.gadling.com/2008/04/04/honored-american-airlines-flight-attendant-rejects-award-comp/">post</a>
earlier this month about an AA flight attendant who used an awards ceremony as a forum to take AA management
to the woodshed over a whole host of consumer and safety issues.<p style="clear: both; padding: 8px 0 0 0;
height: 2px; font-size: 1px; border: 0; margin: 0; padding: 0;"> </p><p><a
href="http://www.gadling.com/2008/04/15/american-airlines-tell-its-pilots-your-travel-horror-stories/"
rel="bookmark" title="Permanent link to this entry">Permalink</a> | <a
href="http://www.gadling.com/forward/1167837/" title="Send this entry to a friend via
email">Email this</a> | <a
href="http://www.technorati.com/cosmos/search.html?rank=&amp;fc=1&amp;url=http://www.gadling.com/2008/04/15/american-airlines-tell-its-pilots-your-travel-horror-stories/"
title="Linking Blogs">Linking Blogs</a> | <a
href="http://www.gadling.com/2008/04/15/american-airlines-tell-its-pilots-your-travel-horror-stories/#comments"
title="View reader comments on this entry">Comments</a></p><div class="feedflare">
<a href="http://feeds.feedburner.com/~f/weblogsinc/gadling?a=xcFqAng"><img border="0"
src="http://feeds.feedburner.com/~f/weblogsinc/gadling?i=xcFqAng"/></a> <a
href="http://feeds.feedburner.com/~f/weblogsinc/gadling?a=ufmZmOg"><img border="0" src="http://feeds.feedburner.com/~f/weblogsinc/gadling?i=ufmZmOg"/></a>
</div><img height="1" src="http://feeds.feedburner.com/~r/weblogsinc/gadling/~4/270688017"
width="1"/></div></summary><updated planet:format="April 15, 2008 08:30
AM">2008-04-15T08:30:00Z</updated><category term="American Airlines"/><category
term="AmericanAirlines"/><feedburner:origLink
xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">http://www.gadling.com/2008/04/15/american-airlines-tell-its-pilots-your-travel-horror-stories/</feedburner:origLink><author><name>Jeffrey
White</name></author><source><id>http://www.gadling.com</id><logo>http://www.gadling.com/media/feedlogo.gif</logo><link
href="http://www.gadling.com" rel="alternate" type="text/html"/><link
href="http://www.gadling.com/rss.xml" rel="self"
type="application/rss+xml"/><rights>Copyright 2008 Weblogs, Inc. The contents of this feed are
available for non-commercial use
only.</rights><subtitle>Gadling</subtitle><title>Gadling</title><updated
planet:format="April 16, 2008 03:27
AM">2008-04-16T03:27:41Z</updated><planet:format>rss20</planet:format><planet:activity_threshold>60</planet:activity_threshold><planet:name>Gadling
(SCUBA
only)</planet:name><planet:bozo>false</planet:bozo><planet:http_etag>1mLjS4AlrF4/nXhIn/29P1oPf0s</planet:http_etag><planet:css-id>gadling-scuba-only</planet:css-id><planet:http_last_modified>Wed,
16 Apr 2008 03:17:48 GMT</planet:http_last_modified><planet:http_status>200</planet:http_status><planet:days_per_page>5</planet:days_per_page></source></entry>
--

-- 
devel mailing list
devel <at> lists.planetplanet.org
http://lists.planetplanet.org/mailman/listinfo/devel

Gmane