我试图使用Selenium运行无头Chrome浏览器以从网络上抓取内容。我使用wget安装了无头Chrome,然后将其解压缩到当前文件夹中。
!wget "http://chromedriver.storage.googleapis.com/2.25/chromedriver_linux64.zip" !unzip chromedriver_linux64.zip
现在,当我加载驱动程序时
from selenium.webdriver.chrome.options import Options import os # instantiate a chrome options object so you can set the size and headless preference chrome_options = Options() chrome_options.add_argument("--headless") chrome_options.add_argument("--window-size=1920x1080") chrome_driver = os.getcwd() +"/chromedriver" driver = webdriver.Chrome(chrome_options=chrome_options,executable_path=chrome_driver)
我收到一个错误
WebDriverException Traceback (most recent call last) <ipython-input-67-0aeae0cfd891> in <module>() ----> 1 driver = webdriver.Chrome(chrome_options=chrome_options, executable_path=chrome_driver) 2 driver.get("https://www.google.com") 3 lucky_button = driver.find_element_by_css_selector("[name=btnI]") 4 lucky_button.click() 5 /usr/local/lib/python3.6/dist-packages/selenium/webdriver/chrome/webdriver.py in __init__(self, executable_path, port, chrome_options, service_args, desired_capabilities, service_log_path) 60 service_args=service_args, 61 log_path=service_log_path) ---> 62 self.service.start() 63 64 try: /usr/local/lib/python3.6/dist-packages/selenium/webdriver/common/service.py in start(self) 84 count = 0 85 while True: ---> 86 self.assert_process_still_running() 87 if self.is_connectable(): 88 break /usr/local/lib/python3.6/dist-packages/selenium/webdriver/common/service.py in assert_process_still_running(self) 97 raise WebDriverException( 98 'Service %s unexpectedly exited. Status code was: %s' ---> 99 % (self.path, return_code) 100 ) 101 WebDriverException: Message: Service /content/chromedriver unexpectedly exited. Status code was: -6
因此,经过一些研究,我尝试了另一种方法
!apt install chromium-chromedriver import selenium as se options = se.webdriver.ChromeOptions() options.add_argument('headless') driver = se.webdriver.Chrome(chrome_options=options)
在Google Colab上,这再次给了我相同的错误
WebDriverException: Message: Service chromedriver unexpectedly exited. Status code was: -6
我已经找到了有关为什么我出错的问题的答案。请安装chrome-chromedriver,并将其添加到您的路径变量以及bin目录中。
这是解决如何在Colab上使用Selenium抓取数据的完整解决方案。使用PhantomJS还有另一种方法,但是Selenium已弃用此API,希望他们在下一次Selenium更新中将其删除。
# install chromium, its driver, and selenium !apt-get update !apt install chromium-chromedriver !cp /usr/lib/chromium-browser/chromedriver /usr/bin !pip install selenium # set options to be headless, .. from selenium import webdriver options = webdriver.ChromeOptions() options.add_argument('--headless') options.add_argument('--no-sandbox') options.add_argument('--disable-dev-shm-usage') # open it, go to a website, and get results wd = webdriver.Chrome('chromedriver',options=options) wd.get("https://www.website.com") print(wd.page_source) # results
这对于想要在Google Colab上而不是在您的本地计算机上抓取数据的任何人都有效。请按相同顺序依次执行以下步骤。
您可以在https://colab.research.google.com/drive/1GFJKhpOju_WLAgiVPCzCGTBVGMkyAjtk中找到笔记本。