web crawler - Error while downloading images from Wikipedia via python script -

May 15, 2011

i trying download images of particular wikipedia page. here code snippet

from bs4 import beautifulsoup bs import urllib2 import urlparse urllib import urlretrieve  site="http://en.wikipedia.org/wiki/pune" hdr= {'user-agent': 'mozilla/5.0'} outpath="" req = urllib2.request(site,headers=hdr) page = urllib2.urlopen(req) soup =bs(page) tag_image=soup.findall("img") image in tag_image:         print "image: %(src)s" % image         urlretrieve(image["src"], "/home/mayank/desktop/test")

while after running program see error following stack

image: //upload.wikimedia.org/wikipedia/commons/thumb/0/04/pune_montage.jpg/250px-pune_montage.jpg traceback (most recent call last):   file "download_images.py", line 15, in <module>     urlretrieve(image["src"], "/home/mayank/desktop/test")   file "/usr/lib/python2.7/urllib.py", line 93, in urlretrieve     return _urlopener.retrieve(url, filename, reporthook, data)   file "/usr/lib/python2.7/urllib.py", line 239, in retrieve     fp = self.open(url, data)   file "/usr/lib/python2.7/urllib.py", line 207, in open     return getattr(self, name)(url)   file "/usr/lib/python2.7/urllib.py", line 460, in open_file     return self.open_ftp(url)   file "/usr/lib/python2.7/urllib.py", line 543, in open_ftp     ftpwrapper(user, passwd, host, port, dirs)   file "/usr/lib/python2.7/urllib.py", line 864, in __init__     self.init()   file "/usr/lib/python2.7/urllib.py", line 870, in init     self.ftp.connect(self.host, self.port, self.timeout)   file "/usr/lib/python2.7/ftplib.py", line 132, in connect     self.sock = socket.create_connection((self.host, self.port), self.timeout)   file "/usr/lib/python2.7/socket.py", line 571, in create_connection     raise err ioerror: [errno ftp error] [errno 111] connection refused

please on causing error?

// shorthand current protocol. seems wikipedia using shorthand, have explicitly specify http instead of ftp (which python assuming reason):

for image in tag_image:     src = 'http:' + image

Search This Blog

Detect

web crawler - Error while downloading images from Wikipedia via python script -

Comments

Post a Comment

Popular posts from this blog

c++ - importing crypto++ in QT application and occurring linker errors? -

javascript - addthis share facebook and google+ url -

ios - Show keyboard with UITextField in the input accessory view -