我正在尝试将来自纽约证券交易所网站(http://www1.nyse.com/about/listed/IPO_Index.html)的表格抓取到熊猫数据框中。为了做到这一点,我有一个像这样的设置:
def htmltodf(url): page = requests.get(url) soup = BeautifulSoup(page.text) tables = soup.findAll('table') test = pandas.io.html.read_html(str(tables)) return(test) #return dataframe type object
但是,当我在页面上运行此命令时,列表中返回的所有表实际上都是空的。当我进一步调查时,我发现该表是由javascript生成的。在我的Web浏览器中使用开发人员工具时,我看到该表看起来与带有标签等的任何其他HTML表一样。但是,在源代码视图中却显示了类似以下内容:
<script language="JavaScript"> . . . <script> var year = [["ICC","21st Century Oncology Holdings, Inc.","22 May 2014","/about/listed/icc.html" ], ... more entries here ... ,["ZOES","Zoe's Kitchen, Inc.","11 Apr 2014","/about/listed/zoes.html" ]] ; if(year.length != 0) { document.write ("<table width='619' border='0' cellspacing='0' cellpadding='0'><tr><td><span class='fontbold'>"); document.write ('2014' + " IPO Showcase"); document.write ("</span></td></tr></table>"); } </script>
因此,当我的HTML解析器去寻找table标记时,它只能找到if条件,而下面没有合适的标记会指示内容。我怎么刮这张桌子?是否可以搜索标签而不是显示内容的表格?因为代码不是传统的html表形式,所以我如何用熊猫读入代码- 我必须手动解析数据吗?
在这种情况下,您需要一些东西来为您运行该javascript代码。
一种选择是使用selenium:
selenium
from pandas.io.html import read_html from selenium import webdriver driver = webdriver.Firefox() driver.get('http://www1.nyse.com/about/listed/IPO_Index.html') table = driver.find_element_by_xpath('//div[@class="sp5"]/table//table/..') table_html = table.get_attribute('innerHTML') df = read_html(table_html)[0] print df driver.close()
印刷品:
0 1 2 3 0 Name Symbol NaT NaN 1 Performance Sports Group Ltd. PSG 2014-06-20 NaN 2 Century Communities, Inc. CCS 2014-06-18 NaN 3 Foresight Energy Partners LP FELP 2014-06-18 NaN ... 79 EGShares TCW EM Long Term Investment Grade Bon... LEMF 2014-01-08 NaN 80 EGShares TCW EM Short Term Investment Grade Bo... SEMF 2014-01-08 NaN [81 rows x 4 columns]