Joe, Cameron, thank you for your help!
I did not know sed and tr command, they are totally new and the syntax
really complex, and i don't know perl as well. I did not understand what
you mean exactly, but I tried to copy the sytax you sent, barely as it is..
the only thing which worked a bit seems to be this:
sed -n -e 's/\(<a[^h]*href="\)\([^"]*\)"\([^>]*>\)/\2/g; p'
so I was able to place the url inside the <a> tag outside which is good :)
I could not split end of lines, because it is long text with no "\n"
characters.
I just need to strip out all the other html tags (I don't need them) and
keep only plain text outside the tags.
so I will have a file like
url texttext text text text text url text text text text text
and then I should be ok to handle it.
Maybe is there any GUI interface to use sed on mac?
thank you for your help!
Luigi
On Wed, Apr 25, 2012 at 12:14 AM, Cameron Simpson <cs <at> zip.com.au> wrote:
> Joe has the truth here. I'm just adding a few remarks.
>
> On 24Apr2012 13:05, Joe Gain <joe.gain <at> gmail.com> wrote:
> | one of sed's problems is dealing with newline characters
>
> Remark: specificly because it deals with a line at a time.
>
> | and you have
> | your html anchor tags split over new lines. There's probably a smart
> | way to overcome this using sed (and sed's line buffer), but it's not
> | an easy alternative.
> |
> | I think the best way for you is to:
> |
> | 1. remove all the new line characters,or at least don't split tags
> | with new lines (there are many ways to do this, depending on which
> | version of sed you are using, could be as easy as # sed -n -e
> | 's/\n//g; p' data.html)
>
> Or, not using sed:
>
> ( tr -d '\012' < data.html ; echo ) | sed other-sed-work-now....
>
> This removes all the newlines using tr, then adds one at the end with
> echo, thus putting all the text on a single line.
>
> personally I'd trn newlines into spaces, otherwise a word break across a
> line like this:
>
> <p> some words
> here</p>
>
> would become
>
> <p> some wordshere</p>
>
> So:
>
> ( tr '\012' ' ' < data.html ; echo ) | sed other-sed-work-now....
>
> Remember, sed is not your only tool.
>
> | 2. remove the surrounding tags from the text that you want in your
> | anchor attributes with something like:
> | sed -n -e 's/\(<a[^h]*href="\)\([^"]*\)"\([^>]*>\)/\2/g; p'
>
> I have my concerns here, specificly the "[^h]". It will match "<a href"
> as intended, but it will also match all sorts of other undesired things.
>
> <a *href=
>
> is more reliable, matching spaces. It does presume no TAB characters.
>
> I would be inclined to use even more tr at the outset, to hammer the
> text flatter:
>
> ( tr '\011\012' ' ' < data.html ; echo ) | tr -s ' ' | sed ...
>
> So:
> - turn newlines and TABs into spaces
> - turn multiple spaces into a single space
>
> That way you can write patterns like this:
>
> <a href=
>
> without having to cope with more than one space; it makes the regexps
> much easier to write and read.
>
> Cheers,
> --
> Cameron Simpson <cs <at> zip.com.au> DoD#743
> http://www.cskk.ezoshosting.com/cs/
>
> My venus fly trap is higher up the food chain than I am.
> - simon <at> ohm.york.ac.uk (Simon Klyne)
>
--
Luigi Assom
Skype contact: oggigigi
[Non-text portions of this message have been removed]