Siddhartha Jonnalagadda | 1 Aug 02:26 2010
Picon

How do we extract actual text in html?

Is it trivial to extract the title and relevant text (ignoring the ads and other irrelevant stuff)? For example, in the website: http://tvnz.co.nz/world-news/chelsea-clinton-marries-in-ny-3680168

I am only interested in extracting the tile: "Chelsea Clinton marries in NY"
and the subject below. How easy is this?

"Bill and Hillary Clinton's daughter married her long-time boyfriend in the picturesque New York village of Rhinebeck today in what has been dubbed America's royal wedding.

Chelsea Clinton - the only child of the former US president and the US secretary of state - wed Marc Mezvinsky at Astor Courts, an historic 50-acre (20-hectare) estate on the Hudson River, about 160 km north of New York City.

"Today, we watched with great pride and overwhelming emotion as Chelsea and Marc wed in a beautiful ceremony at Astor Courts, surrounded by family and their close friends," Bill and Hillary Clinton said in a statement.

"We could not have asked for a more perfect day to celebrate the beginning of their life together, and we are so happy to welcome Marc into our family," the statement said.

"On behalf of the newlyweds, we want to give special thanks to the people of Rhinebeck for welcoming us and to everyone for their well-wishes on this special day."

The statement, sent just after 7:30 pm (12:30pm NZT today), did not indicate exactly when the nuptials took place.

On Friday night, Bill and Hillary Clinton waved to crowds of onlookers as they arrived at the historic Beekman Arms Inn in the center of Rhinebeck for a late-night cocktail party for some of the wedding guests.

Apart from the parents of the bride, the only other high profile guests seen in Rhinebeck have been Bill Clinton's former secretary of state, Madeleine Albright, actors Ted Danson and Mary Steenburgen and fashion designer Vera Wang.

Also spotted was real estate scion and movie producer billionaire Steve Bing. Bing lent Bill Clinton his jet to fly to North Korea in August of last year to bring home American journalists Laura Ling and Euna Lee after they spent four months imprisoned in the reclusive communist state.

Guests boarded buses in Rhinebeck to be taken to Astor Courts about 5 pm EDT (10am NZT)

Chelsea Clinton, 30, and Mezvinsky, 32, have known each other since they were teenagers. He is an investment banker, whose parents Marjorie Margolies-Mezvinsky and Edward Mezvinsky were once Democratic US House of Representatives members.

Chelsea Clinton, who worked at a New York hedge fund and has more recently studied health policy at Columbia University, has kept a low profile since her father left the White House in January 2001, although she campaigned for her mother during her failed run for the 2008 Democratic presidential nomination.

Signs and pictures congratulating the newlyweds hang in many shop windows in Rhinebeck, which has been swarmed by media around the world for an event that experts estimate to have cost between $US3 million and $US5 million.

Airspace above Rhinebeck has been closed for 12 hours from 3 pm EDT (7pm NZT) today for the wedding and media were kept well away from the entrance to Astor Courts. Security in the area was comparable to that surrounding state visits.

The guest list was reported to be between 400 and 500, but did not include a very understanding President Barack Obama.

"Hillary and Bill properly want to keep this as a thing for Chelsea and her soon-to-be husband," Obama said on The View talk show on Thursday. "It would be tough enough to have one president at a wedding. You don't want two presidents."

"
_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora
Nitin Madnani | 1 Aug 02:57 2010
Picon

Re: How do we extract actual text in html?


On Jul 31, 2010, at 8:26 PM, Siddhartha Jonnalagadda <sid.kgp <at> gmail.com> wrote:

Is it trivial to extract the title and relevant text (ignoring the ads and other irrelevant stuff)? For example, in the website: http://tvnz.co.nz/world-news/chelsea-clinton-marries-in-ny-3680168

I am only interested in extracting the tile: "Chelsea Clinton marries in NY"
and the subject below. How easy is this?

"Bill and Hillary Clinton's daughter married her long-time boyfriend in the picturesque New York village of Rhinebeck today in what has been dubbed America's royal wedding.

Chelsea Clinton - the only child of the former US president and the US secretary of state - wed Marc Mezvinsky at Astor Courts, an historic 50-acre (20-hectare) estate on the Hudson River, about 160 km north of New York City.

"Today, we watched with great pride and overwhelming emotion as Chelsea and Marc wed in a beautiful ceremony at Astor Courts, surrounded by family and their close friends," Bill and Hillary Clinton said in a statement.

"We could not have asked for a more perfect day to celebrate the beginning of their life together, and we are so happy to welcome Marc into our family," the statement said.

"On behalf of the newlyweds, we want to give special thanks to the people of Rhinebeck for welcoming us and to everyone for their well-wishes on this special day."

The statement, sent just after 7:30 pm (12:30pm NZT today), did not indicate exactly when the nuptials took place.

On Friday night, Bill and Hillary Clinton waved to crowds of onlookers as they arrived at the historic Beekman Arms Inn in the center of Rhinebeck for a late-night cocktail party for some of the wedding guests.

Apart from the parents of the bride, the only other high profile guests seen in Rhinebeck have been Bill Clinton's former secretary of state, Madeleine Albright, actors Ted Danson and Mary Steenburgen and fashion designer Vera Wang.

Also spotted was real estate scion and movie producer billionaire Steve Bing. Bing lent Bill Clinton his jet to fly to North Korea in August of last year to bring home American journalists Laura Ling and Euna Lee after they spent four months imprisoned in the reclusive communist state.

Guests boarded buses in Rhinebeck to be taken to Astor Courts about 5 pm EDT (10am NZT)

Chelsea Clinton, 30, and Mezvinsky, 32, have known each other since they were teenagers. He is an investment banker, whose parents Marjorie Margolies-Mezvinsky and Edward Mezvinsky were once Democratic US House of Representatives members.

Chelsea Clinton, who worked at a New York hedge fund and has more recently studied health policy at Columbia University, has kept a low profile since her father left the White House in January 2001, although she campaigned for her mother during her failed run for the 2008 Democratic presidential nomination.

Signs and pictures congratulating the newlyweds hang in many shop windows in Rhinebeck, which has been swarmed by media around the world for an event that experts estimate to have cost between $US3 million and $US5 million.

Airspace above Rhinebeck has been closed for 12 hours from 3 pm EDT (7pm NZT) today for the wedding and media were kept well away from the entrance to Astor Courts. Security in the area was comparable to that surrounding state visits.

The guest list was reported to be between 400 and 500, but did not include a very understanding President Barack Obama.

"Hillary and Bill properly want to keep this as a thing for Chelsea and her soon-to-be husband," Obama said on The View talk show on Thursday. "It would be tough enough to have one president at a wedding. You don't want two presidents."

"
_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora
_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora
Siddhartha Jonnalagadda | 1 Aug 05:18 2010
Picon

Re: How do we extract actual text in html?

something in Java?

On Sat, Jul 31, 2010 at 5:57 PM, Nitin Madnani <nmadnani <at> gmail.com> wrote:

On Jul 31, 2010, at 8:26 PM, Siddhartha Jonnalagadda <sid.kgp <at> gmail.com> wrote:

Is it trivial to extract the title and relevant text (ignoring the ads and other irrelevant stuff)? For example, in the website: http://tvnz.co.nz/world-news/chelsea-clinton-marries-in-ny-3680168

I am only interested in extracting the tile: "Chelsea Clinton marries in NY"
and the subject below. How easy is this?

"Bill and Hillary Clinton's daughter married her long-time boyfriend in the picturesque New York village of Rhinebeck today in what has been dubbed America's royal wedding.

Chelsea Clinton - the only child of the former US president and the US secretary of state - wed Marc Mezvinsky at Astor Courts, an historic 50-acre (20-hectare) estate on the Hudson River, about 160 km north of New York City.

"Today, we watched with great pride and overwhelming emotion as Chelsea and Marc wed in a beautiful ceremony at Astor Courts, surrounded by family and their close friends," Bill and Hillary Clinton said in a statement.

"We could not have asked for a more perfect day to celebrate the beginning of their life together, and we are so happy to welcome Marc into our family," the statement said.

"On behalf of the newlyweds, we want to give special thanks to the people of Rhinebeck for welcoming us and to everyone for their well-wishes on this special day."

The statement, sent just after 7:30 pm (12:30pm NZT today), did not indicate exactly when the nuptials took place.

On Friday night, Bill and Hillary Clinton waved to crowds of onlookers as they arrived at the historic Beekman Arms Inn in the center of Rhinebeck for a late-night cocktail party for some of the wedding guests.

Apart from the parents of the bride, the only other high profile guests seen in Rhinebeck have been Bill Clinton's former secretary of state, Madeleine Albright, actors Ted Danson and Mary Steenburgen and fashion designer Vera Wang.

Also spotted was real estate scion and movie producer billionaire Steve Bing. Bing lent Bill Clinton his jet to fly to North Korea in August of last year to bring home American journalists Laura Ling and Euna Lee after they spent four months imprisoned in the reclusive communist state.

Guests boarded buses in Rhinebeck to be taken to Astor Courts about 5 pm EDT (10am NZT)

Chelsea Clinton, 30, and Mezvinsky, 32, have known each other since they were teenagers. He is an investment banker, whose parents Marjorie Margolies-Mezvinsky and Edward Mezvinsky were once Democratic US House of Representatives members.

Chelsea Clinton, who worked at a New York hedge fund and has more recently studied health policy at Columbia University, has kept a low profile since her father left the White House in January 2001, although she campaigned for her mother during her failed run for the 2008 Democratic presidential nomination.

Signs and pictures congratulating the newlyweds hang in many shop windows in Rhinebeck, which has been swarmed by media around the world for an event that experts estimate to have cost between $US3 million and $US5 million.

Airspace above Rhinebeck has been closed for 12 hours from 3 pm EDT (7pm NZT) today for the wedding and media were kept well away from the entrance to Astor Courts. Security in the area was comparable to that surrounding state visits.

The guest list was reported to be between 400 and 500, but did not include a very understanding President Barack Obama.

"Hillary and Bill properly want to keep this as a thing for Chelsea and her soon-to-be husband," Obama said on The View talk show on Thursday. "It would be tough enough to have one president at a wedding. You don't want two presidents."

"
_______________________________________________

#avg_ls_inline_popup { position:absolute; z-index:9999; padding: 0px 0px; margin-left: 0px; margin-top: 0px; width: 240px; overflow: hidden; word-wrap: break-word; color: black; font-size: 10px; text-align: left; line-height: 13px;}
_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora
Andrew.Lampert | 1 Aug 13:09 2010
Picon
Picon

Re: How do we extract actual text in html?

In Java, I've used Jericho (http://jerichohtml.sourceforge.net/) to good effect for pulling out plain
text from HTML. It won't do everything you need, but it might be a good starting point.

Cheers,
Andrew
________________________________________
From: corpora-bounces <at> uib.no [corpora-bounces <at> uib.no] On Behalf Of Siddhartha Jonnalagadda [sid.kgp <at> gmail.com]
Sent: Sunday, 1 August 2010 1:18 PM
To: Nitin Madnani
Cc: corpora <at> uib.no
Subject: Re: [Corpora-List] How do we extract actual text in html?

something in Java?

On Sat, Jul 31, 2010 at 5:57 PM, Nitin Madnani <nmadnani <at> gmail.com<mailto:nmadnani <at> gmail.com>> wrote:
http://www.crummy.com/software/BeautifulSoup/

- Nitin

On Jul 31, 2010, at 8:26 PM, Siddhartha Jonnalagadda <sid.kgp <at> gmail.com<mailto:sid.kgp <at> gmail.com>> wrote:

Is it trivial to extract the title and relevant text (ignoring the ads and other irrelevant stuff)? For
example, in the website: <http://tvnz.co.nz/world-news/chelsea-clinton-marries-in-ny-3680168> http://tvnz.co.nz/world-news/chelsea-clinton-marries-in-ny-3680168

I am only interested in extracting the tile: "Chelsea Clinton marries in NY"
and the subject below. How easy is this?

"Bill and Hillary Clinton's daughter married her long-time boyfriend in the picturesque New York village
of Rhinebeck today in what has been dubbed America's royal wedding.

Chelsea Clinton - the only child of the former US president and the US secretary of state - wed Marc Mezvinsky
at Astor Courts, an historic 50-acre (20-hectare) estate on the Hudson River, about 160 km north of New
York City.

"Today, we watched with great pride and overwhelming emotion as Chelsea and Marc wed in a beautiful
ceremony at Astor Courts, surrounded by family and their close friends," Bill and Hillary Clinton said in
a statement.

"We could not have asked for a more perfect day to celebrate the beginning of their life together, and we are
so happy to welcome Marc into our family," the statement said.

"On behalf of the newlyweds, we want to give special thanks to the people of Rhinebeck for welcoming us and to
everyone for their well-wishes on this special day."

The statement, sent just after 7:30 pm (12:30pm NZT today), did not indicate exactly when the nuptials took place.

On Friday night, Bill and Hillary Clinton waved to crowds of onlookers as they arrived at the historic
Beekman Arms Inn in the center of Rhinebeck for a late-night cocktail party for some of the wedding guests.

 <http://ad.au.doubleclick.net/jump/tvnz.co.nz/news/world-news/reuters/_3680168;pos=mid;sectn=world-news;site=news;kw=ONENEWS;kw=WORLD;kw=BILLCLINTON;kw=HILLARYCLINTON;sourc=Reuters;sid=425822;did=3680168;sz=300x250;ord=123456789?>

Apart from the parents of the bride, the only other high profile guests seen in Rhinebeck have been Bill
Clinton's former secretary of state, Madeleine Albright, actors Ted Danson and Mary Steenburgen and
fashion designer Vera Wang.

Also spotted was real estate scion and movie producer billionaire Steve Bing. Bing lent Bill Clinton his
jet to fly to North Korea in August of last year to bring home American journalists Laura Ling and Euna Lee
after they spent four months imprisoned in the reclusive communist state.

Guests boarded buses in Rhinebeck to be taken to Astor Courts about 5 pm EDT (10am NZT)

Chelsea Clinton, 30, and Mezvinsky, 32, have known each other since they were teenagers. He is an
investment banker, whose parents Marjorie Margolies-Mezvinsky and Edward Mezvinsky were once
Democratic US House of Representatives members.

Chelsea Clinton, who worked at a New York hedge fund and has more recently studied health policy at Columbia
University, has kept a low profile since her father left the White House in January 2001, although she
campaigned for her mother during her failed run for the 2008 Democratic presidential nomination.

Signs and pictures congratulating the newlyweds hang in many shop windows in Rhinebeck, which has been
swarmed by media around the world for an event that experts estimate to have cost between $US3 million and
$US5 million.

Airspace above Rhinebeck has been closed for 12 hours from 3 pm EDT (7pm NZT) today for the wedding and media
were kept well away from the entrance to Astor Courts. Security in the area was comparable to that
surrounding state visits.

The guest list was reported to be between 400 and 500, but did not include a very understanding President
Barack Obama.

"Hillary and Bill properly want to keep this as a thing for Chelsea and her soon-to-be husband," Obama said
on The View talk show on Thursday. "It would be tough enough to have one president at a wedding. You don't
want two presidents."

"
_______________________________________________

Corpora mailing list
Corpora <at> uib.no<mailto:Corpora <at> uib.no>
http://mailman.uib.no/listinfo/corpora

_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

Oltramari ISTC-CNR | 2 Aug 10:33 2010
Picon

Final Program: OntoLex 2010 <at> COLING

***APOLOGIES FOR MULTIPLE POSTINGS - PLEASE DISTRIBUTE***
==================================================================
                  COLING 2010 Workshop

  The 6th Workshop on "Ontologies and Lexical Resources (OntoLex 2010)" 
			http://www.loa-cnr.it/ontolex2010
-----------------------------------------------------------------

             Beijing, China, August, 21, 2010
                        COLING 2010
FINAL PROGRAM
Check it out at: http://www.loa-cnr.it/ontolex2010 

INVITED TALK
Prof. Chu-Ren Huang, The Hong Kong Polytechnic University

REGISTRATION
http://www.coling-2010.org/Registration.htm

Introduction

As human linguistic practice reveals, accessing to concepts through natural
language is the implicit pathway for enabling mutual comprehension and
effective meaning negotiation between agents in a community. But, in order
to exchange knowledge, we need to share the conceptual models underlying the
lexicon, namely ontologies. These remarks become even more crucial when
focusing on human-computer interaction. In this context, computational
ontologies and human-language technologies converge in the task of providing
the semantic description of knowledge contents (e.g. multimedia, web
resources, services, etc.): underlying intended models need to be made
explicit in order to become accessible by artificial agents and sharable
with humans. According to this picture, 1) computational lexicons, whose aim
is to make lexical-content machine-understandable, constitute a fundamental
component to foster the (mono- and multi-linguistic) access to any knowledge
content; 2) computational ontologies, on the other side, are necessary to
capture the logical structure of those knowledge contents: both contribute
to dig out the basic elements of a given semantic space (domain-dependent or
general), characterizing the different relations holding among them.
In this general framework, the contributions presented under the scope of
OntoLex 2010 (Ontologies and Lexical Resources) show in fact a variety of
approaches under many respects. Some of the papers are oriented to describe
the different construction processes of semantic resources (e.g., Daoud et
al. and Nagata deal with two approaches based on Wikipedia), other papers
are especially concerned with specific tasks and applications. Regarding the
latter aspect, some contributions present proposals to enhance
interoperability within the various standardization formats for linguistic
and terminological descriptions (Peters, Vossen et al.) as well as
exploiting specific algorithms for ontology matching. Some papers also focus
on formal ontology, both at the level of theoretical analysis and at the
level of specific categories and relations (see for example the paper by
Bogulaslavsky). The investigated domains span from bio-surveillance (Conway
et al.) through medicine; sentiment/opinion mining confirms to be an
emergent area of interest too (see Cadilhac et al.). Automatic techniques
and algorithms to extract terms and taxonomies are also introduced (Van der
Plas, Nagata et al., vor der Br├╝ck).
Originating in 2000, OntoLex is recognized as a common "meeting place" by a
constantly growing interdisciplinary community of lexicographers,
ontologists and computational linguists. Traditionally represented by
researchers and practitioners from a variety of backgrounds (acquisition of
lexical knowledge, ontology-based approaches to information extraction,
ontology learning, ontology matching, etc.), OntoLex 2010's contributions
confirm this trend in the Sixth edition of the workshop too, hosted by
COLING conference for the first time. We think that the comprehensive
perspective emerging from the 10 articles collected in these proceedings can
help in progress towards next-generation knowledge systems based on the
integration between ontologies and lexical resources.

_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

Beatrice Alex | 1 Aug 20:08 2010
Picon
Picon

Re: How do we extract actual text in html?

You might want to check out Boilerpipe:


Best,

Bea

------------------
Beatrice Alex
Research Fellow and Project Manager at the School of Informatics, University of Edinburgh.


On 1 Aug 2010, at 01:26, Siddhartha Jonnalagadda wrote:

Is it trivial to extract the title and relevant text (ignoring the ads and other irrelevant stuff)? For example, in the website: http://tvnz.co.nz/world-news/chelsea-clinton-marries-in-ny-3680168

I am only interested in extracting the tile: "Chelsea Clinton marries in NY"
and the subject below. How easy is this?

"Bill and Hillary Clinton's daughter married her long-time boyfriend in the picturesque New York village of Rhinebeck today in what has been dubbed America's royal wedding.

Chelsea Clinton - the only child of the former US president and the US secretary of state - wed Marc Mezvinsky at Astor Courts, an historic 50-acre (20-hectare) estate on the Hudson River, about 160 km north of New York City.

"Today, we watched with great pride and overwhelming emotion as Chelsea and Marc wed in a beautiful ceremony at Astor Courts, surrounded by family and their close friends," Bill and Hillary Clinton said in a statement.

"We could not have asked for a more perfect day to celebrate the beginning of their life together, and we are so happy to welcome Marc into our family," the statement said.

"On behalf of the newlyweds, we want to give special thanks to the people of Rhinebeck for welcoming us and to everyone for their well-wishes on this special day."

The statement, sent just after 7:30 pm (12:30pm NZT today), did not indicate exactly when the nuptials took place.

On Friday night, Bill and Hillary Clinton waved to crowds of onlookers as they arrived at the historic Beekman Arms Inn in the center of Rhinebeck for a late-night cocktail party for some of the wedding guests.

Apart from the parents of the bride, the only other high profile guests seen in Rhinebeck have been Bill Clinton's former secretary of state, Madeleine Albright, actors Ted Danson and Mary Steenburgen and fashion designer Vera Wang.

Also spotted was real estate scion and movie producer billionaire Steve Bing. Bing lent Bill Clinton his jet to fly to North Korea in August of last year to bring home American journalists Laura Ling and Euna Lee after they spent four months imprisoned in the reclusive communist state.

Guests boarded buses in Rhinebeck to be taken to Astor Courts about 5 pm EDT (10am NZT)

Chelsea Clinton, 30, and Mezvinsky, 32, have known each other since they were teenagers. He is an investment banker, whose parents Marjorie Margolies-Mezvinsky and Edward Mezvinsky were once Democratic US House of Representatives members.

Chelsea Clinton, who worked at a New York hedge fund and has more recently studied health policy at Columbia University, has kept a low profile since her father left the White House in January 2001, although she campaigned for her mother during her failed run for the 2008 Democratic presidential nomination.

Signs and pictures congratulating the newlyweds hang in many shop windows in Rhinebeck, which has been swarmed by media around the world for an event that experts estimate to have cost between $US3 million and $US5 million.

Airspace above Rhinebeck has been closed for 12 hours from 3 pm EDT (7pm NZT) today for the wedding and media were kept well away from the entrance to Astor Courts. Security in the area was comparable to that surrounding state visits.

The guest list was reported to be between 400 and 500, but did not include a very understanding President Barack Obama.

"Hillary and Bill properly want to keep this as a thing for Chelsea and her soon-to-be husband," Obama said on The View talk show on Thursday. "It would be tough enough to have one president at a wedding. You don't want two presidents."

"
_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora



The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora
Ralf Krestel | 1 Aug 13:29 2010
Picon

Re: How do we extract actual text in html?

Hi,
I recommend http://code.google.com/p/boilerpipe/ based on the paper:
Christian Kohlschütter, Peter Fankhauser and Wolfgang Nejdl,
Boilerplate Detection using Shallow Text Features,
WSDM 2010 -- The Third ACM International Conference on Web Search and Data Mining New York City, NY USA.

Cheers,
Ralf

Am 01.08.2010 13:09, schrieb Andrew.Lampert <at> csiro.au:
In Java, I've used Jericho (http://jerichohtml.sourceforge.net/) to good effect for pulling out plain text from HTML. It won't do everything you need, but it might be a good starting point. Cheers, Andrew ________________________________________ From: corpora-bounces <at> uib.no [corpora-bounces <at> uib.no] On Behalf Of Siddhartha Jonnalagadda [sid.kgp <at> gmail.com] Sent: Sunday, 1 August 2010 1:18 PM To: Nitin Madnani Cc: corpora <at> uib.no Subject: Re: [Corpora-List] How do we extract actual text in html? something in Java? On Sat, Jul 31, 2010 at 5:57 PM, Nitin Madnani <nmadnani <at> gmail.com<mailto:nmadnani <at> gmail.com>> wrote: http://www.crummy.com/software/BeautifulSoup/ - Nitin On Jul 31, 2010, at 8:26 PM, Siddhartha Jonnalagadda <sid.kgp <at> gmail.com<mailto:sid.kgp <at> gmail.com>> wrote: Is it trivial to extract the title and relevant text (ignoring the ads and other irrelevant stuff)? For example, in the website: <http://tvnz.co.nz/world-news/chelsea-clinton-marries-in-ny-3680168> http://tvnz.co.nz/world-news/chelsea-clinton-marries-in-ny-3680168 I am only interested in extracting the tile: "Chelsea Clinton marries in NY" and the subject below. How easy is this? "Bill and Hillary Clinton's daughter married her long-time boyfriend in the picturesque New York village of Rhinebeck today in what has been dubbed America's royal wedding. Chelsea Clinton - the only child of the former US president and the US secretary of state - wed Marc Mezvinsky at Astor Courts, an historic 50-acre (20-hectare) estate on the Hudson River, about 160 km north of New York City. "Today, we watched with great pride and overwhelming emotion as Chelsea and Marc wed in a beautiful ceremony at Astor Courts, surrounded by family and their close friends," Bill and Hillary Clinton said in a statement. "We could not have asked for a more perfect day to celebrate the beginning of their life together, and we are so happy to welcome Marc into our family," the statement said. "On behalf of the newlyweds, we want to give special thanks to the people of Rhinebeck for welcoming us and to everyone for their well-wishes on this special day." The statement, sent just after 7:30 pm (12:30pm NZT today), did not indicate exactly when the nuptials took place. On Friday night, Bill and Hillary Clinton waved to crowds of onlookers as they arrived at the historic Beekman Arms Inn in the center of Rhinebeck for a late-night cocktail party for some of the wedding guests. <http://ad.au.doubleclick.net/jump/tvnz.co.nz/news/world-news/reuters/_3680168;pos=mid;sectn=world-news;site=news;kw=ONENEWS;kw=WORLD;kw=BILLCLINTON;kw=HILLARYCLINTON;sourc=Reuters;sid=425822;did=3680168;sz=300x250;ord=123456789?> Apart from the parents of the bride, the only other high profile guests seen in Rhinebeck have been Bill Clinton's former secretary of state, Madeleine Albright, actors Ted Danson and Mary Steenburgen and fashion designer Vera Wang. Also spotted was real estate scion and movie producer billionaire Steve Bing. Bing lent Bill Clinton his jet to fly to North Korea in August of last year to bring home American journalists Laura Ling and Euna Lee after they spent four months imprisoned in the reclusive communist state. Guests boarded buses in Rhinebeck to be taken to Astor Courts about 5 pm EDT (10am NZT) Chelsea Clinton, 30, and Mezvinsky, 32, have known each other since they were teenagers. He is an investment banker, whose parents Marjorie Margolies-Mezvinsky and Edward Mezvinsky were once Democratic US House of Representatives members. Chelsea Clinton, who worked at a New York hedge fund and has more recently studied health policy at Columbia University, has kept a low profile since her father left the White House in January 2001, although she campaigned for her mother during her failed run for the 2008 Democratic presidential nomination. Signs and pictures congratulating the newlyweds hang in many shop windows in Rhinebeck, which has been swarmed by media around the world for an event that experts estimate to have cost between $US3 million and $US5 million. Airspace above Rhinebeck has been closed for 12 hours from 3 pm EDT (7pm NZT) today for the wedding and media were kept well away from the entrance to Astor Courts. Security in the area was comparable to that surrounding state visits. The guest list was reported to be between 400 and 500, but did not include a very understanding President Barack Obama. "Hillary and Bill properly want to keep this as a thing for Chelsea and her soon-to-be husband," Obama said on The View talk show on Thursday. "It would be tough enough to have one president at a wedding. You don't want two presidents." " _______________________________________________ Corpora mailing list Corpora <at> uib.no<mailto:Corpora <at> uib.no> http://mailman.uib.no/listinfo/corpora _______________________________________________ Corpora mailing list Corpora <at> uib.no http://mailman.uib.no/listinfo/corpora
_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora
Wouter Weerkamp | 2 Aug 11:51 2010
Picon
Picon

Re: How do we extract actual text in html?

In 2007 there was a workshop on content extraction from web pages. You 
could gave a look at the papers presented there:
http://cleaneval.sigwac.org.uk/

If you intend to follow feeds, and need to extract content from these, 
you can use a learning approach. For each feed you collect a certain 
number of pages, and you learn which part of the page changes, and which 
parts don't. From that it shouldn't be hard to determine "real" content.

You could also have a look at fivefilters, it works pretty good given 
the simple approach is uses:
http://fivefilters.org/content-only/
(following a few links, you can get to the (php) code).

Wouter

On 8/1/10 8:08 PM, Beatrice Alex wrote:
> You might want to check out Boilerpipe:
>
> http://code.google.com/p/boilerpipe/
>
> Best,
>
> Bea
>
> ------------------
> Beatrice Alex
> Research Fellow and Project Manager at the School of Informatics, University of Edinburgh.
>
>
> On 1 Aug 2010, at 01:26, Siddhartha Jonnalagadda wrote:
>
>> Is it trivial to extract the title and relevant text (ignoring the ads and other irrelevant stuff)? For
example, in the website: http://tvnz.co.nz/world-news/chelsea-clinton-marries-in-ny-3680168
>>
>> I am only interested in extracting the tile: "Chelsea Clinton marries in NY"
>> and the subject below. How easy is this?
>>
>> "Bill and Hillary Clinton's daughter married her long-time boyfriend in the picturesque New York
village of Rhinebeck today in what has been dubbed America's royal wedding.
>> Chelsea Clinton - the only child of the former US president and the US secretary of state - wed Marc
Mezvinsky at Astor Courts, an historic 50-acre (20-hectare) estate on the Hudson River, about 160 km
north of New York City.
>>
>> "Today, we watched with great pride and overwhelming emotion as Chelsea and Marc wed in a beautiful
ceremony at Astor Courts, surrounded by family and their close friends," Bill and Hillary Clinton said in
a statement.
>>
>> "We could not have asked for a more perfect day to celebrate the beginning of their life together, and we
are so happy to welcome Marc into our family," the statement said.
>>
>> "On behalf of the newlyweds, we want to give special thanks to the people of Rhinebeck for welcoming us and
to everyone for their well-wishes on this special day."
>>
>> The statement, sent just after 7:30 pm (12:30pm NZT today), did not indicate exactly when the nuptials
took place.
>>
>> On Friday night, Bill and Hillary Clinton waved to crowds of onlookers as they arrived at the historic
Beekman Arms Inn in the center of Rhinebeck for a late-night cocktail party for some of the wedding guests.
>>
>>
>>
>> Apart from the parents of the bride, the only other high profile guests seen in Rhinebeck have been Bill
Clinton's former secretary of state, Madeleine Albright, actors Ted Danson and Mary Steenburgen and
fashion designer Vera Wang.
>>
>> Also spotted was real estate scion and movie producer billionaire Steve Bing. Bing lent Bill Clinton his
jet to fly to North Korea in August of last year to bring home American journalists Laura Ling and Euna Lee
after they spent four months imprisoned in the reclusive communist state.
>>
>> Guests boarded buses in Rhinebeck to be taken to Astor Courts about 5 pm EDT (10am NZT)
>>
>> Chelsea Clinton, 30, and Mezvinsky, 32, have known each other since they were teenagers. He is an
investment banker, whose parents Marjorie Margolies-Mezvinsky and Edward Mezvinsky were once
Democratic US House of Representatives members.
>>
>> Chelsea Clinton, who worked at a New York hedge fund and has more recently studied health policy at
Columbia University, has kept a low profile since her father left the White House in January 2001,
although she campaigned for her mother during her failed run for the 2008 Democratic presidential nomination.
>>
>> Signs and pictures congratulating the newlyweds hang in many shop windows in Rhinebeck, which has been
swarmed by media around the world for an event that experts estimate to have cost between $US3 million and
$US5 million.
>>
>> Airspace above Rhinebeck has been closed for 12 hours from 3 pm EDT (7pm NZT) today for the wedding and
media were kept well away from the entrance to Astor Courts. Security in the area was comparable to that
surrounding state visits.
>>
>> The guest list was reported to be between 400 and 500, but did not include a very understanding President
Barack Obama.
>>
>> "Hillary and Bill properly want to keep this as a thing for Chelsea and her soon-to-be husband," Obama
said on The View talk show on Thursday. "It would be tough enough to have one president at a wedding. You
don't want two presidents."
>>
>> "
>> _______________________________________________
>> Corpora mailing list
>> Corpora <at> uib.no
>> http://mailman.uib.no/listinfo/corpora
>
>
>
>
>
>
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>
>
> _______________________________________________
> Corpora mailing list
> Corpora <at> uib.no
> http://mailman.uib.no/listinfo/corpora

--

-- 
ISLA * University of Amsterdam * http://ilps.science.uva.nl

_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

Siddhartha Jonnalagadda | 2 Aug 13:24 2010
Picon

Re: How do we extract actual text in html?

Thanks all for your replies. I am trying BoilerPipe now; will also look into the other things mentioned.

thanks again,
siddhartha

On Mon, Aug 2, 2010 at 2:51 AM, Wouter Weerkamp <w.weerkamp <at> uva.nl> wrote:
In 2007 there was a workshop on content extraction from web pages. You could gave a look at the papers presented there:
http://cleaneval.sigwac.org.uk/

If you intend to follow feeds, and need to extract content from these, you can use a learning approach. For each feed you collect a certain number of pages, and you learn which part of the page changes, and which parts don't. From that it shouldn't be hard to determine "real" content.

You could also have a look at fivefilters, it works pretty good given the simple approach is uses:
http://fivefilters.org/content-only/
(following a few links, you can get to the (php) code).

Wouter



On 8/1/10 8:08 PM, Beatrice Alex wrote:
You might want to check out Boilerpipe:

http://code.google.com/p/boilerpipe/

Best,

Bea

------------------
Beatrice Alex
Research Fellow and Project Manager at the School of Informatics, University of Edinburgh.


On 1 Aug 2010, at 01:26, Siddhartha Jonnalagadda wrote:

Is it trivial to extract the title and relevant text (ignoring the ads and other irrelevant stuff)? For example, in the website: http://tvnz.co.nz/world-news/chelsea-clinton-marries-in-ny-3680168

I am only interested in extracting the tile: "Chelsea Clinton marries in NY"
and the subject below. How easy is this?

"Bill and Hillary Clinton's daughter married her long-time boyfriend in the picturesque New York village of Rhinebeck today in what has been dubbed America's royal wedding.
Chelsea Clinton - the only child of the former US president and the US secretary of state - wed Marc Mezvinsky at Astor Courts, an historic 50-acre (20-hectare) estate on the Hudson River, about 160 km north of New York City.

"Today, we watched with great pride and overwhelming emotion as Chelsea and Marc wed in a beautiful ceremony at Astor Courts, surrounded by family and their close friends," Bill and Hillary Clinton said in a statement.

"We could not have asked for a more perfect day to celebrate the beginning of their life together, and we are so happy to welcome Marc into our family," the statement said.

"On behalf of the newlyweds, we want to give special thanks to the people of Rhinebeck for welcoming us and to everyone for their well-wishes on this special day."

The statement, sent just after 7:30 pm (12:30pm NZT today), did not indicate exactly when the nuptials took place.

On Friday night, Bill and Hillary Clinton waved to crowds of onlookers as they arrived at the historic Beekman Arms Inn in the center of Rhinebeck for a late-night cocktail party for some of the wedding guests.



Apart from the parents of the bride, the only other high profile guests seen in Rhinebeck have been Bill Clinton's former secretary of state, Madeleine Albright, actors Ted Danson and Mary Steenburgen and fashion designer Vera Wang.

Also spotted was real estate scion and movie producer billionaire Steve Bing. Bing lent Bill Clinton his jet to fly to North Korea in August of last year to bring home American journalists Laura Ling and Euna Lee after they spent four months imprisoned in the reclusive communist state.

Guests boarded buses in Rhinebeck to be taken to Astor Courts about 5 pm EDT (10am NZT)

Chelsea Clinton, 30, and Mezvinsky, 32, have known each other since they were teenagers. He is an investment banker, whose parents Marjorie Margolies-Mezvinsky and Edward Mezvinsky were once Democratic US House of Representatives members.

Chelsea Clinton, who worked at a New York hedge fund and has more recently studied health policy at Columbia University, has kept a low profile since her father left the White House in January 2001, although she campaigned for her mother during her failed run for the 2008 Democratic presidential nomination.

Signs and pictures congratulating the newlyweds hang in many shop windows in Rhinebeck, which has been swarmed by media around the world for an event that experts estimate to have cost between $US3 million and $US5 million.

Airspace above Rhinebeck has been closed for 12 hours from 3 pm EDT (7pm NZT) today for the wedding and media were kept well away from the entrance to Astor Courts. Security in the area was comparable to that surrounding state visits.

The guest list was reported to be between 400 and 500, but did not include a very understanding President Barack Obama.

"Hillary and Bill properly want to keep this as a thing for Chelsea and her soon-to-be husband," Obama said on The View talk show on Thursday. "It would be tough enough to have one president at a wedding. You don't want two presidents."

"
_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora






The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

--
ISLA * University of Amsterdam * http://ilps.science.uva.nl


_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora
Anil Singh | 2 Aug 16:13 2010
Picon

Re: How do we extract actual text in html?

Cleaneval is a good place to find out the problems and many solutions. However, my experience is that it ultimately depends on your exact needs. And the methods can be broadly categorized in two classes: deterministic and learning based. Unless you want to work on data with completely arbitrary formats, learning doesn't seem to be a good idea.

There is some code for text extraction from HTML documents and one or two utilities in Sanchay, but there is no documentation and the it is not connected to the current public GUI. The available code will have to be slightly modified for specific formats: some simple code that uses the HTML parser library to effectively create a template for extraction of a specific format. For a single format, it is not very time consuming.

On Mon, Aug 2, 2010 at 4:54 PM, Siddhartha Jonnalagadda <sid.kgp <at> gmail.com> wrote:
Thanks all for your replies. I am trying BoilerPipe now; will also look into the other things mentioned.

thanks again,
siddhartha

On Mon, Aug 2, 2010 at 2:51 AM, Wouter Weerkamp <w.weerkamp <at> uva.nl> wrote:
In 2007 there was a workshop on content extraction from web pages. You could gave a look at the papers presented there:
http://cleaneval.sigwac.org.uk/

If you intend to follow feeds, and need to extract content from these, you can use a learning approach. For each feed you collect a certain number of pages, and you learn which part of the page changes, and which parts don't. From that it shouldn't be hard to determine "real" content.

You could also have a look at fivefilters, it works pretty good given the simple approach is uses:
http://fivefilters.org/content-only/
(following a few links, you can get to the (php) code).

Wouter



On 8/1/10 8:08 PM, Beatrice Alex wrote:
You might want to check out Boilerpipe:

http://code.google.com/p/boilerpipe/

Best,

Bea

------------------
Beatrice Alex
Research Fellow and Project Manager at the School of Informatics, University of Edinburgh.


On 1 Aug 2010, at 01:26, Siddhartha Jonnalagadda wrote:

Is it trivial to extract the title and relevant text (ignoring the ads and other irrelevant stuff)? For example, in the website: http://tvnz.co.nz/world-news/chelsea-clinton-marries-in-ny-3680168

I am only interested in extracting the tile: "Chelsea Clinton marries in NY"
and the subject below. How easy is this?

"Bill and Hillary Clinton's daughter married her long-time boyfriend in the picturesque New York village of Rhinebeck today in what has been dubbed America's royal wedding.
Chelsea Clinton - the only child of the former US president and the US secretary of state - wed Marc Mezvinsky at Astor Courts, an historic 50-acre (20-hectare) estate on the Hudson River, about 160 km north of New York City.

"Today, we watched with great pride and overwhelming emotion as Chelsea and Marc wed in a beautiful ceremony at Astor Courts, surrounded by family and their close friends," Bill and Hillary Clinton said in a statement.

"We could not have asked for a more perfect day to celebrate the beginning of their life together, and we are so happy to welcome Marc into our family," the statement said.

"On behalf of the newlyweds, we want to give special thanks to the people of Rhinebeck for welcoming us and to everyone for their well-wishes on this special day."

The statement, sent just after 7:30 pm (12:30pm NZT today), did not indicate exactly when the nuptials took place.

On Friday night, Bill and Hillary Clinton waved to crowds of onlookers as they arrived at the historic Beekman Arms Inn in the center of Rhinebeck for a late-night cocktail party for some of the wedding guests.



Apart from the parents of the bride, the only other high profile guests seen in Rhinebeck have been Bill Clinton's former secretary of state, Madeleine Albright, actors Ted Danson and Mary Steenburgen and fashion designer Vera Wang.

Also spotted was real estate scion and movie producer billionaire Steve Bing. Bing lent Bill Clinton his jet to fly to North Korea in August of last year to bring home American journalists Laura Ling and Euna Lee after they spent four months imprisoned in the reclusive communist state.

Guests boarded buses in Rhinebeck to be taken to Astor Courts about 5 pm EDT (10am NZT)

Chelsea Clinton, 30, and Mezvinsky, 32, have known each other since they were teenagers. He is an investment banker, whose parents Marjorie Margolies-Mezvinsky and Edward Mezvinsky were once Democratic US House of Representatives members.

Chelsea Clinton, who worked at a New York hedge fund and has more recently studied health policy at Columbia University, has kept a low profile since her father left the White House in January 2001, although she campaigned for her mother during her failed run for the 2008 Democratic presidential nomination.

Signs and pictures congratulating the newlyweds hang in many shop windows in Rhinebeck, which has been swarmed by media around the world for an event that experts estimate to have cost between $US3 million and $US5 million.

Airspace above Rhinebeck has been closed for 12 hours from 3 pm EDT (7pm NZT) today for the wedding and media were kept well away from the entrance to Astor Courts. Security in the area was comparable to that surrounding state visits.

The guest list was reported to be between 400 and 500, but did not include a very understanding President Barack Obama.

"Hillary and Bill properly want to keep this as a thing for Chelsea and her soon-to-be husband," Obama said on The View talk show on Thursday. "It would be tough enough to have one president at a wedding. You don't want two presidents."

"
_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora






The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

--
ISLA * University of Amsterdam * http://ilps.science.uva.nl


_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora


_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora


_______________________________________________
Corpora mailing list
Corpora <at> uib.no
http://mailman.uib.no/listinfo/corpora

Gmane