我正在尝试抓取此网站:http : //data.eastmoney.com/xg/xg/
到目前为止,我已经使用selenium执行javascript并抓取了表格。但是,现在我的代码仅使我获得第一页。我想知道是否有一种方法可以访问其他17个页面,因为当我单击下一页时,URL不会更改,因此我不能每次都遍历另一个URL
下面是我到目前为止的代码:
from selenium import webdriver import lxml from bs4 import BeautifulSoup import time def scrape(): url = 'http://data.eastmoney.com/xg/xg/' d={} f = open('east.txt','a') driver = webdriver.PhantomJS() driver.get(url) lst = [x for x in range(0,25)] htmlsource = driver.page_source bs = BeautifulSoup(htmlsource) heading = bs.find_all('thead')[0] hlist = [] for header in heading.find_all('tr'): head = header.find_all('th') for i in lst: if i!=2: hlist.append(head[i].get_text().strip()) h = '|'.join(hlist) print h table = bs.find_all('tbody')[0] for row in table.find_all('tr'): cells = row.find_all('td') d[cells[0].get_text()]=[y.get_text() for y in cells] for key in d: ret=[] for i in lst: if i != 2: ret.append(d.get(key)[i]) s = '|'.join(ret) print s if __name__ == "__main__": scrape()
还是我每次单击后都可以使用webdriver.Chrome()而不是PhantomJS来通过浏览器单击下一步,然后在新页面上运行Python?
这不是要与之交互的琐碎页面,需要使用“ 显式等待”来等待“加载”指示器的隐形。
这是可以用作起点的完整且可行的实现:
# -*- coding: utf-8 -*- from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium import webdriver import time url = "http://data.eastmoney.com/xg/xg/" driver = webdriver.PhantomJS() driver.get(url) def get_table_results(driver): for row in driver.find_elements_by_css_selector("table#dt_1 tr[class]"): print [cell.text for cell in row.find_elements_by_tag_name("td")] # initial wait for results WebDriverWait(driver, 10).until(EC.invisibility_of_element_located((By.XPATH, u"//th[. = '加载中......']"))) while True: # print current page number page_number = driver.find_element_by_id("gopage").get_attribute("value") print "Page #" + page_number get_table_results(driver) next_link = driver.find_element_by_link_text("下一页") if "nolink" in next_link.get_attribute("class"): break next_link.click() time.sleep(2) # TODO: fix? # wait for results to load WebDriverWait(driver, 10).until(EC.invisibility_of_element_located((By.XPATH, u"//img[contains(@src, 'loading')]"))) print "------"
想法是要有一个无限循环,只有当“下一页”链接被禁用(没有更多可用页面)时,我们才会退出。在每次迭代中,获取表结果(为示例起见,在控制台上打印),单击下一个链接,然后等待出现在网格顶部的“正在加载”旋转圆的隐形性。