web crawler - Error while downloading images from Wikipedia via python script -
i trying download images of particular wikipedia page. here code snippet
from bs4 import beautifulsoup bs import urllib2 import urlparse urllib import urlretrieve site="http://en.wikipedia.org/wiki/pune" hdr= {'user-agent': 'mozilla/5.0'} outpath="" req = urllib2.request(site,headers=hdr) page = urllib2.urlopen(req) soup =bs(page) tag_image=soup.findall("img") image in tag_image: print "image: %(src)s" % image urlretrieve(image["src"], "/home/mayank/desktop/test")
while after running program see error following stack
image: //upload.wikimedia.org/wikipedia/commons/thumb/0/04/pune_montage.jpg/250px-pune_montage.jpg traceback (most recent call last): file "download_images.py", line 15, in <module> urlretrieve(image["src"], "/home/mayank/desktop/test") file "/usr/lib/python2.7/urllib.py", line 93, in urlretrieve return _urlopener.retrieve(url, filename, reporthook, data) file "/usr/lib/python2.7/urllib.py", line 239, in retrieve fp = self.open(url, data) file "/usr/lib/python2.7/urllib.py", line 207, in open return getattr(self, name)(url) file "/usr/lib/python2.7/urllib.py", line 460, in open_file return self.open_ftp(url) file "/usr/lib/python2.7/urllib.py", line 543, in open_ftp ftpwrapper(user, passwd, host, port, dirs) file "/usr/lib/python2.7/urllib.py", line 864, in __init__ self.init() file "/usr/lib/python2.7/urllib.py", line 870, in init self.ftp.connect(self.host, self.port, self.timeout) file "/usr/lib/python2.7/ftplib.py", line 132, in connect self.sock = socket.create_connection((self.host, self.port), self.timeout) file "/usr/lib/python2.7/socket.py", line 571, in create_connection raise err ioerror: [errno ftp error] [errno 111] connection refused
please on causing error?
//
shorthand current protocol. seems wikipedia using shorthand, have explicitly specify http instead of ftp (which python assuming reason):
for image in tag_image: src = 'http:' + image
Comments
Post a Comment