python - How to deal with unknown encoding when scraping webpages? -

August 15, 2013

this question has answer here:

determine encoding of text in python 8 answers

i'm scraping news articles various sites, using gae , python.

the code scrape 1 article url @ time leads following error:

unicodedecodeerror: 'ascii' codec can't decode byte 0xe2 in position 8858: ordinal not in range(128)

here's code in simplest form:

from google.appengine.api import urlfetch  def fetch(url):     headers = {'user-agent' : "chrome/11.0.696.16"}     result = urlfetch.fetch(url,headers)     if result.status_code == 200:         return result.content

here variant have tried, same result:

def fetch(url):     headers = {'user-agent' : "chrome/11.0.696.16"}     result = urlfetch.fetch(url,headers)     if result.status_code == 200:         s = result.content         s = s.decode('utf-8')         s = s.encode('utf-8')         s = unicode(s,'utf-8')         return s

here's ugly, brittle one, doesn't work:

def fetch(url):     headers = {'user-agent' : "chrome/11.0.696.16"}     result = urlfetch.fetch(url,headers)     if result.status_code == 200:         s = result.content          try:             s = s.decode('iso-8859-1')         except:             pass         try:             s = s.decode('ascii')         except:              pass         try:             s = s.decode('gb2312')         except:             pass         try:             s = s.decode('windows-1251')         except:             pass         try:             s = s.decode('windows-1252')         except:             s = "did not work"          s = s.encode('utf-8')         s = unicode(s,'utf-8')         return s

the last variant returns s string "did not work" last except.

so, going have expand clumsy try/except construction encompass possible encodings (will work?), or there easier way?

why have decided scrape entire html, not beautifulsoup? because want soupifying later, avoid deadlineexceederror in gae.

have read excellent articles unicode, , how should done? yes. however, have failed find solution not assume know incoming encoding, don't, since i'm scraping different sites every day.

i had same problem time ago , there nothing 100% accurate. did was:

get encoding content-type
get encoding meta tags
detect encoding chardet python module
decode text common encoding unicode
process text/html

Search This Blog

Detect

python - How to deal with unknown encoding when scraping webpages? -

Comments

Post a Comment

Popular posts from this blog

javascript - addthis share facebook and google+ url -

ios - Show keyboard with UITextField in the input accessory view -

c++ - importing crypto++ in QT application and occurring linker errors? -