python - How to strip characters interfering with Beautiful Soup returning links with specified text? -


i trying 2 things beautiful soup:

  1. find , print divs class
  2. find , print links contain text

the first part working. second part returning empty list, is, []. in trying troubleshoot this, created following works intended:

from bs4 import beautifulsoup  def my_funct():     content = "<div class=\"class1 class2\">some text</div> \         <a href='#' title='text blah5454' onclick='blahblahblah'>text blah5454</a>"     soup = beautifulsoup(content)     thing1 = soup("div", "class1 class2")     thing2 = soup("a", text="text")     print thing1     print thing2  my_funct() 

after looking @ source of original content (of actual implementation) in scite editor. however, 1 difference there lf , 4 ->'s on new line between text , blah5454 in link text, example:

enter image description here

and therefore think reason getting empty [].

my questions are:

  1. is cause?
  2. if so, best solution 'strip' these characters , if best way that?

the text paramater matches on whole text content. need use regular expression instead:

import re  thing2 = soup("a", text=re.compile(r"\btext\b")) 

the \b word boundary anchors make sure match whole word, not partial word. mind r'' raw string literal used here, \b means different when interpreted normal string; you'd have double backslashes if don't use raw string literal here.

demo:

>>> bs4 import beautifulsoup >>> content = "<div class=\"class1 class2\">some text</div> \ ...         <a href='#' title='wooh!' onclick='blahblahblah'>text blah5454</a>" >>> soup = beautifulsoup(content) >>> soup("a", text='text') [] >>> soup("a", text=re.compile(r"\btext\b")) [<a href="#" onclick="blahblahblah" title="wooh!">text blah5454</a>] 

Comments

Popular posts from this blog

c# - Send Image in Json : 400 Bad request -

jquery - Fancybox - apply a function to several elements -

An easy way to program an Android keyboard layout app -