python - How to strip characters interfering with Beautiful Soup returning links with specified text? -

February 15, 2013

find , print divs class
find , print links contain text

the first part working. second part returning empty list, is, []. in trying troubleshoot this, created following works intended:

from bs4 import beautifulsoup  def my_funct():     content = "<div class=\"class1 class2\">some text</div> \         <a href='#' title='text blah5454' onclick='blahblahblah'>text blah5454</a>"     soup = beautifulsoup(content)     thing1 = soup("div", "class1 class2")     thing2 = soup("a", text="text")     print thing1     print thing2  my_funct()

after looking @ source of original content (of actual implementation) in scite editor. however, 1 difference there lf , 4 ->'s on new line between text , blah5454 in link text, example:

enter image description here

and therefore think reason getting empty [].

my questions are:

is cause?
if so, best solution 'strip' these characters , if best way that?

the text paramater matches on whole text content. need use regular expression instead:

import re  thing2 = soup("a", text=re.compile(r"\btext\b"))

the \b word boundary anchors make sure match whole word, not partial word. mind r'' raw string literal used here, \b means different when interpreted normal string; you'd have double backslashes if don't use raw string literal here.

demo:

>>> bs4 import beautifulsoup >>> content = "<div class=\"class1 class2\">some text</div> \ ...         <a href='#' title='wooh!' onclick='blahblahblah'>text blah5454</a>" >>> soup = beautifulsoup(content) >>> soup("a", text='text') [] >>> soup("a", text=re.compile(r"\btext\b")) [<a href="#" onclick="blahblahblah" title="wooh!">text blah5454</a>]

Search This Blog

Detect

python - How to strip characters interfering with Beautiful Soup returning links with specified text? -

Comments

Post a Comment

Popular posts from this blog

javascript - addthis share facebook and google+ url -

ios - Show keyboard with UITextField in the input accessory view -

c++ - importing crypto++ in QT application and occurring linker errors? -