python - How to strip characters interfering with Beautiful Soup returning links with specified text? -
i trying 2 things beautiful soup:
- find , print divs class
- find , print links contain text
the first part working. second part returning empty list, is, []
. in trying troubleshoot this, created following works intended:
from bs4 import beautifulsoup def my_funct(): content = "<div class=\"class1 class2\">some text</div> \ <a href='#' title='text blah5454' onclick='blahblahblah'>text blah5454</a>" soup = beautifulsoup(content) thing1 = soup("div", "class1 class2") thing2 = soup("a", text="text") print thing1 print thing2 my_funct()
after looking @ source of original content (of actual implementation) in scite editor. however, 1 difference there lf
, 4 ->
's on new line between text
, blah5454
in link text, example:
and therefore think reason getting empty []
.
my questions are:
- is cause?
- if so, best solution 'strip' these characters , if best way that?
the text
paramater matches on whole text content. need use regular expression instead:
import re thing2 = soup("a", text=re.compile(r"\btext\b"))
the \b
word boundary anchors make sure match whole word, not partial word. mind r''
raw string literal used here, \b
means different when interpreted normal string; you'd have double backslashes if don't use raw string literal here.
demo:
>>> bs4 import beautifulsoup >>> content = "<div class=\"class1 class2\">some text</div> \ ... <a href='#' title='wooh!' onclick='blahblahblah'>text blah5454</a>" >>> soup = beautifulsoup(content) >>> soup("a", text='text') [] >>> soup("a", text=re.compile(r"\btext\b")) [<a href="#" onclick="blahblahblah" title="wooh!">text blah5454</a>]
Comments
Post a Comment