python - How to deal with unknown encoding when scraping webpages? -


this question has answer here:

i'm scraping news articles various sites, using gae , python.

the code scrape 1 article url @ time leads following error:

unicodedecodeerror: 'ascii' codec can't decode byte 0xe2 in position 8858: ordinal not in range(128) 

here's code in simplest form:

from google.appengine.api import urlfetch  def fetch(url):     headers = {'user-agent' : "chrome/11.0.696.16"}     result = urlfetch.fetch(url,headers)     if result.status_code == 200:         return result.content 

here variant have tried, same result:

def fetch(url):     headers = {'user-agent' : "chrome/11.0.696.16"}     result = urlfetch.fetch(url,headers)     if result.status_code == 200:         s = result.content         s = s.decode('utf-8')         s = s.encode('utf-8')         s = unicode(s,'utf-8')         return s 

here's ugly, brittle one, doesn't work:

def fetch(url):     headers = {'user-agent' : "chrome/11.0.696.16"}     result = urlfetch.fetch(url,headers)     if result.status_code == 200:         s = result.content          try:             s = s.decode('iso-8859-1')         except:             pass         try:             s = s.decode('ascii')         except:              pass         try:             s = s.decode('gb2312')         except:             pass         try:             s = s.decode('windows-1251')         except:             pass         try:             s = s.decode('windows-1252')         except:             s = "did not work"          s = s.encode('utf-8')         s = unicode(s,'utf-8')         return s 

the last variant returns s string "did not work" last except.

so, going have expand clumsy try/except construction encompass possible encodings (will work?), or there easier way?

why have decided scrape entire html, not beautifulsoup? because want soupifying later, avoid deadlineexceederror in gae.

have read excellent articles unicode, , how should done? yes. however, have failed find solution not assume know incoming encoding, don't, since i'm scraping different sites every day.

i had same problem time ago , there nothing 100% accurate. did was:

  • get encoding content-type
  • get encoding meta tags
  • detect encoding chardet python module
  • decode text common encoding unicode
  • process text/html

Comments

Popular posts from this blog

c# - Send Image in Json : 400 Bad request -

jquery - Fancybox - apply a function to several elements -

An easy way to program an Android keyboard layout app -