python - How to deal with unknown encoding when scraping webpages? -
this question has answer here:
- determine encoding of text in python 8 answers
i'm scraping news articles various sites, using gae , python.
the code scrape 1 article url @ time leads following error:
unicodedecodeerror: 'ascii' codec can't decode byte 0xe2 in position 8858: ordinal not in range(128)
here's code in simplest form:
from google.appengine.api import urlfetch def fetch(url): headers = {'user-agent' : "chrome/11.0.696.16"} result = urlfetch.fetch(url,headers) if result.status_code == 200: return result.content
here variant have tried, same result:
def fetch(url): headers = {'user-agent' : "chrome/11.0.696.16"} result = urlfetch.fetch(url,headers) if result.status_code == 200: s = result.content s = s.decode('utf-8') s = s.encode('utf-8') s = unicode(s,'utf-8') return s
here's ugly, brittle one, doesn't work:
def fetch(url): headers = {'user-agent' : "chrome/11.0.696.16"} result = urlfetch.fetch(url,headers) if result.status_code == 200: s = result.content try: s = s.decode('iso-8859-1') except: pass try: s = s.decode('ascii') except: pass try: s = s.decode('gb2312') except: pass try: s = s.decode('windows-1251') except: pass try: s = s.decode('windows-1252') except: s = "did not work" s = s.encode('utf-8') s = unicode(s,'utf-8') return s
the last variant returns s string "did not work" last except.
so, going have expand clumsy try/except construction encompass possible encodings (will work?), or there easier way?
why have decided scrape entire html, not beautifulsoup? because want soupifying later, avoid deadlineexceederror in gae.
have read excellent articles unicode, , how should done? yes. however, have failed find solution not assume know incoming encoding, don't, since i'm scraping different sites every day.
i had same problem time ago , there nothing 100% accurate. did was:
- get encoding content-type
- get encoding meta tags
- detect encoding chardet python module
- decode text common encoding unicode
- process text/html
Comments
Post a Comment