ruby - Extract text between <br

ruby - Extract text between tags -

January 15, 2010

to extract url's using following:

html = open('http://lab/links.html') urls = uri.extract(html)

this works great.

now need extract list of url's without prefix http or https, between   tags. since there no http or https tags, uri.extract doesnt work.

domain1.com/index.html<br >domain2.com/home/~john/index.html<br >domain3.com/a/b/c/d/index.php

each unprefixed url between   tags.

~~i have been looking @ nokogiri xpath retrieve text after within <td> , couldnt work.~~

output

domain1.com/index.html domain2.com/home/~john/index.html domain3.com/a/b/c/d/index.php

~~intermediate solution~~

~~doc = nokogiri::html(open("http://lab/noprefix_domains.html")) doc.search('br').each |n| n.replace("\n") end puts doc~~

~~i still need strip out rest of html tags (!doctype, html, body, p)...~~

solution

str = "" doc.traverse { |n| str << n.to_s if (n.name == "text" or n.name == "br") } puts str.split /\s*<\s*br\s*>\s*/

thanks.

assuming have method extract example string showed in question, can use split on string:

str = "domain1.com/index.html<br >domain2.com/home/~john/index.html<br >domain3.com/a/b/c/d/index.php" str.split /\s*<\s*br\s*>\s*/ #=> ["domain1.com/index.html",  #    "domain2.com/home/~john/index.html", #    "domain3.com/a/b/c/d/index.php"]

this split string @ every   tag. remove whitespace before , after   , allow whitespace inside   tag, e.g.   or  . if need handle self-closing tags, (e.g.  ), use regex instead:

/\s*<\s*br\s*\/?\s*>\s*/

Search This Blog

Detect

ruby - Extract text between <br > tags -

Comments

Post a Comment

Popular posts from this blog

assembly - 8086 TASM: Illegal Indexing Mode -

javascript - addthis share facebook and google+ url -

Java, LWJGL, OpenGL 1.1, decoding BufferedImage to Bytebuffer and binding to OpenGL across classes -