python - urllib2.HTTPError: While Web Scraping a huge list -


the web page has huge list of journal names other details. trying scrape table content dataframe.

#http://www.citefactor.org/journal-impact-factor-list-2015.html  import bs4 bs  import urllib  #using python 2.7 import pandas pd   dfs = pd.read_html('http://www.citefactor.org/journal-impact-factor-list-2015.html/', header=0) df in dfs:     print(df)     df.to_csv('citefactor_list.csv', header=true) 

but getting following error .. did try referring raised questions not fix.

error:

traceback (most recent call last):   file "scrape_impact_factor.py", line 7, in <module>     dfs = pd.read_html('http://www.citefactor.org/journal-impact-factor-list-2015.html/', header=0)   file "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 896, in read_html     keep_default_na=keep_default_na)   file "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 733, in _parse     raise_with_traceback(retained)   file "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 727, in _parse     tables = p.parse_tables()   file "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 196, in parse_tables     tables = self._parse_tables(self._build_doc(), self.match, self.attrs)   file "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 450, in _build_doc     return beautifulsoup(self._setup_build_doc(), features='html5lib',   file "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 443, in _setup_build_doc     raw_text = _read(self.io)   file "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 130, in _read     urlopen(obj) url:   file "/usr/lib/python2.7/contextlib.py", line 17, in __enter__     return self.gen.next()   file "/usr/local/lib/python2.7/dist-packages/pandas/io/common.py", line 60, in urlopen     closing(_urlopen(*args, **kwargs)) f:   file "/usr/lib/python2.7/urllib2.py", line 127, in urlopen     return _opener.open(url, data, timeout)   file "/usr/lib/python2.7/urllib2.py", line 410, in open     response = meth(req, response)   file "/usr/lib/python2.7/urllib2.py", line 523, in http_response     'http', request, response, code, msg, hdrs)   file "/usr/lib/python2.7/urllib2.py", line 448, in error     return self._call_chain(*args)   file "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain     result = func(*args)   file "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default     raise httperror(req.get_full_url(), code, msg, hdrs, fp) urllib2.httperror: http error 500: internal server error 

a 500 internal server error means went wrong on server , therefore out of control.

however problem using wrong url.

if go http://www.citefactor.org/journal-impact-factor-list-2015.html/ in browser 404 not found error. remove trailing slash i.e. http://www.citefactor.org/journal-impact-factor-list-2015.html , work.


Comments

Popular posts from this blog

php - How to add and update images or image url in Volusion using Volusion API -

javascript - jQuery UI Splitter/Resizable for unlimited amount of columns -

javascript - IE9 error '$'is not defined -