Python urllib2.urlopen（）速度很慢，需要一种更好的方式来读取多个URL

小编典典

Python urllib2.urlopen（）速度很慢，需要一种更好的方式来读取多个URL

python

顾名思义，我正在使用python编写的网站上，它对urllib2模块进行了多次调用以读取网站。然后，我用BeautifulSoup解析它们。

由于我必须阅读5-10个站点，因此页面需要一段时间才能加载。

我只是想知道是否可以同时阅读所有站点？还是任何使它更快的技巧，例如我应该在每次读取后关闭urllib2.urlopen还是将其保持打开状态？

补充：另外，如果我只是改用php，从其他站点获取和解析HTML和XML文件会更快吗？我只希望它加载更快，而不是目前需要20秒钟的时间

阅读 430

2020-12-20

共1个答案

小编典典

我正在使用以下现代Python模块（如threading和）重写Dumb Guy的代码Queue。

import threading, urllib2
import Queue

urls_to_load = [
'http://stackoverflow.com/',
'http://slashdot.org/',
'http://www.archive.org/',
'http://www.yahoo.co.jp/',
]

def read_url(url, queue):
    data = urllib2.urlopen(url).read()
    print('Fetched %s from %s' % (len(data), url))
    queue.put(data)

def fetch_parallel():
    result = Queue.Queue()
    threads = [threading.Thread(target=read_url, args = (url,result)) for url in urls_to_load]
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    return result

def fetch_sequencial():
    result = Queue.Queue()
    for url in urls_to_load:
        read_url(url,result)
    return result

的最佳时间find_sequencial()是2秒。最佳时间为fetch_parallel()0.9秒。

也不说thread由于GIL在Python中没有用。这是线程在Python中有用的情况之一，因为线程在I /
O上被阻塞。如您在我的结果中看到的，并行案例的速度快了2倍。

2020-12-20