python - How to strip characters interfering with Beautiful Soup returning links with specified text? -
i trying 2 things beautiful soup:
- find , print divs class
- find , print links contain text
the first part working. second part returning empty list, is, []. in trying troubleshoot this, created following works intended:
from bs4 import beautifulsoup def my_funct(): content = "<div class=\"class1 class2\">some text</div> \ <a href='#' title='text blah5454' onclick='blahblahblah'>text blah5454</a>" soup = beautifulsoup(content) thing1 = soup("div", "class1 class2") thing2 = soup("a", text="text") print thing1 print thing2 my_funct() after looking @ source of original content (of actual implementation) in scite editor. however, 1 difference there lf , 4 ->'s on new line between text , blah5454 in link text, example:

and therefore think reason getting empty [].
my questions are:
- is cause?
- if so, best solution 'strip' these characters , if best way that?
the text paramater matches on whole text content. need use regular expression instead:
import re thing2 = soup("a", text=re.compile(r"\btext\b")) the \b word boundary anchors make sure match whole word, not partial word. mind r'' raw string literal used here, \b means different when interpreted normal string; you'd have double backslashes if don't use raw string literal here.
demo:
>>> bs4 import beautifulsoup >>> content = "<div class=\"class1 class2\">some text</div> \ ... <a href='#' title='wooh!' onclick='blahblahblah'>text blah5454</a>" >>> soup = beautifulsoup(content) >>> soup("a", text='text') [] >>> soup("a", text=re.compile(r"\btext\b")) [<a href="#" onclick="blahblahblah" title="wooh!">text blah5454</a>]
Comments
Post a Comment