Requests库的爬取性能分析

嵩天发表于2020年02月02日

尽管Requests库功能很友好、开发简单（其实除了import外只需一行主要代码），但其性能与专业爬虫相比还是有一定差距的。请编写一个小程序，“任意”找个url，测试一下成功爬取100次网页的时间。（某些网站对于连续爬取页面将采取屏蔽IP的策略，所以，要避开这类网站。）请回复代码，并给出url及在自己机器上的运行时间。

459 回复

1楼

龍mooc205 发表于2020年03月17日

0 | 0 | 举报

import requests import time def getHEMLText(url):    try:        r = requests.get(url, timeout = 30)        r.raise_for_status() #如果状态不是200，引发HTTPError异常        r.encoding = r.apparent_encoding        return r.text    except:        return "产生异常" if __name__ == "__main__":    url = "https://www.baidu.com"    begin = time.time()    for i in range(0, 100):        getHEMLText(url)    print("{:.2f}".format(time.time()-begin)) D:\Python题库\Scripts\python.exe D:/Python题库/网络爬取通用代码框架.py 9.68 Process finished with exit code 0

龍mooc205 发表于2020年03月17日

0 | 评论(0) | 举报

添加评论

2楼

mooc1528519975754 发表于2020年03月17日

1 | 1 | 举报

<code class="brush:python;toolbar:false" >import requests import time t1 = time.time() for i in range(100):     r = requests.get("https://www.baidu.com") t2 = time.time() t = t2 - t1 print(t)</code><img src="https://nos.netease.com/edu-image/0d658e51e5674e81956b4cc7a6d268e3.png" />

mooc1528519975754 发表于2020年03月17日

1 | 评论(1) | 举报

mooc1528519975754 2020年03月17日

1 | 举报

<code class="brush:python;toolbar:false" >import requests import time t1 = time.time() for i in range(5):     r = requests.get("https://lq.fyxfw.gov.cn/display.php?id=35139")#爬取某可计数网页     t2 = time.time()     t = t2 - t1     print(t)     #可以通过爬取可计数网页来验证是否真正的爬取成功，我的例子是“临泉县先锋网”的，每次爬取耗时3.4秒。</code>

mooc1528519975754 发表于2020年03月17日

1 | 举报

添加评论

3楼

nono_226 发表于2020年03月17日

1 | 0 | 举报

import requests import time # 京东商品页 prefix_url = "https://item.jd.com/{}.html" urls = [] num = 100000768781# 随机的100个任意页面 for i in range(100):    urls.append(prefix_url.format(num))    num +=1 # 启动性能计时 start_time = time.time() for url in urls:    r = requests.get(url) end_time = time.time() delta_time = end_time - start_time# 打印耗时 print(delta_time) D:\Python\python.exe E:/Code/python/web_crawler/test.py36.75410223007202 Process finished with exit code 0

nono_226 发表于2020年03月17日

1 | 评论(0) | 举报

添加评论

4楼

FishXxxxx 发表于2020年03月17日

1 | 0 | 举报

<img src="https://nos.netease.com/edu-image/34a07db1d4444c378c217e6c36380410.png" />爬取的是中国大学MOOC（慕课）国家精品页面，用时17.84秒。

FishXxxxx 发表于2020年03月17日

1 | 评论(0) | 举报

添加评论

5楼

CharltonZL 发表于2020年03月17日

0 | 0 | 举报

import requestsimport timet1 = time.time()for i in range(100):     r = request.get("https://www.baidu.com")t2 = time.time()t = t2 - t1print(t)

CharltonZL 发表于2020年03月17日

0 | 评论(0) | 举报

添加评论

6楼

小洪didi 发表于2020年03月17日

0 | 0 | 举报

import requestsimport time def getHtmlText(url):    try:        r = requests.get(url, timeout = 30)        r.raise_for_status        r.encoding = r.apparent_encoding        return r.text    except:        return ''if __name__ == "__main__":    url = 'https://baidu.com'    start = time.perf_counter()    for i in range(100):        getHtmlText(url)    end = time.perf_counter()    dur = end - start    print(f'{dur = :.2f}')url = '<a href="https://baidu.com'" >https://baidu.com'</a> 运行时间dur = 9.78

小洪didi 发表于2020年03月17日

0 | 评论(0) | 举报

添加评论

7楼

201曹杨发表于2020年03月17日

0 | 0 | 举报

201曹杨发表于2020年03月17日

0 | 评论(0) | 举报

添加评论

8楼

ryfecho163com 发表于2020年03月17日

0 | 0 | 举报

import requestsimport time def getHTMLText(url):    try:        r=requests.get(url,timeout=30)        r.raise_for_status()        r.encoding=r.apparent_ecoding        return r.text    except:        return "产生异常" if __name__=="__main__":    start=time.perf_counter()    url="https://www.qq.com"    for i in range(100):        r=requests.get(url)    end=time.perf_counter()    print("{:.2f}".format(end-start)) 39.11

ryfecho163com 发表于2020年03月17日

0 | 评论(0) | 举报

添加评论

9楼

媛媛ykt54340324227634056 发表于2020年03月17日

0 | 0 | 举报

媛媛ykt54340324227634056 发表于2020年03月17日

0 | 评论(0) | 举报

添加评论

10楼

黎瑞Clare 发表于2020年03月17日

0 | 0 | 举报

import requests import time def getHtmlText(url):    try:        r = requests.get(url, timeout=30)        r.raise_for_status        r.encoding = r.apparent_encoding        return r.text    except:        return '爬取失败！' if __name__ == "__main__":    url = 'https://m.bilibili.com/'    start = time.perf_counter()    for i in range(100):        getHtmlText(url)    end = time.perf_counter()    dur = end - start    print(f'{dur = :.2f}') url = '<a href="https://m.bilibili.com/'" >https://m.bilibili.com/'</a> 运行时间：199.30

黎瑞Clare 发表于2020年03月17日

0 | 评论(0) | 举报

添加评论

11楼

物流151-崔莹04 发表于2020年03月17日

1 | 0 | 举报

物流151-崔莹04 发表于2020年03月17日

1 | 评论(0) | 举报

添加评论

12楼

细雨青铜发表于2020年03月17日

0 | 0 | 举报

#coding=gbk from time import perf_counter import requests def getHTMLText(url):     try:         r=requests.get(url,timeout=30)         r.raise_for_status() #如果状态不是200，引发HTTPError异常         r.encoding=r.apparent_encoding         return r.text     except:         return "产生异常" start=perf_counter()        for i in range(100):     if __name__=="__main__":         url="https://www.danda.com.cn/index/index/page_title/cate_id/3/sub_style/0.html?device=pc&renqun_youhua=165944"         print(getHTMLText(url))     print("爬取网页100次的时间为{}s".format(perf_counter()-start))

细雨青铜发表于2020年03月17日

0 | 评论(0) | 举报

添加评论

13楼

大方的头发表于2020年03月17日

0 | 0 | 举报

<code class="brush:python;toolbar:false" >import requests import time t = time.perf_counter() for i in range(100):     r = requests.get("https://www.baidu.com", timeout=10) duri = time.perf_counter()-t print(duri)</code>结果：16.93397787

大方的头发表于2020年03月17日

0 | 评论(0) | 举报

添加评论

14楼

太原理工大学软件1825班姬靖宇发表于2020年03月17日

0 | 0 | 举报

import requests import time def getHTMLText(url):       try:             r=requests.get(url,timeout=30)             r.raise_for_status()             r.encode=r.apparent_encode             return r.text       except:             return("返回错误") if __name__=="__main__":       start=time.perf_counter()       url="https://www.baidu.com"       for i in range(100):             getHTMLText(url)       dur=time.perf_counter()-start       print("{:.2f}".format(dur)) 4.57

太原理工大学软件1825班姬靖宇发表于2020年03月17日

0 | 评论(0) | 举报

添加评论

15楼

Yuzx 发表于2020年03月17日

4 | 4 | 举报

<code class="brush:python;toolbar:false" >import requests as rq import time as t def gethtml(url):     try:         r = rq.get(url)         r.raise_for_status()         r.encoding = r.apparent_encoding         return r     except:         return '爬取失败' if __name__ == '__main__':     start = t.perf_counter()     url = 'https://baidu.com'     for i in range(100):         gethtml(url)     end = t.perf_counter()     print('一百次爬取时间为{:.2f}秒'.format(end-start))</code>一百次爬取时间为10.38秒

Yuzx 发表于2020年03月17日

4 | 评论(4) | 举报

dilantor 2020年03月25日

2 | 举报

感觉可读性很好。

dilantor 发表于2020年03月25日

2 | 举报
≮阿狸的笑只为桃子绽﹎ 2020年04月02日

1 | 举报

同学我可以加你联系方式吗？感觉自己有点辣鸡，希望大佬可以带带我

≮阿狸的笑只为桃子绽﹎发表于2020年04月02日

1 | 举报
prideszh 2020年04月18日

0 | 举报

if中的判断是什么意思呢？

prideszh 发表于2020年04月18日

0 | 举报
金晓-S20020081 2020年04月18日

0 | 举报

当函数名为主函数时运行一下代码

金晓-S20020081 发表于2020年04月18日

0 | 举报

添加评论

16楼

星河百穿发表于2020年03月17日

0 | 0 | 举报

星河百穿发表于2020年03月17日

0 | 评论(0) | 举报

添加评论

17楼

程澜午发表于2020年03月17日

1 | 0 | 举报

程澜午发表于2020年03月17日

1 | 评论(0) | 举报

添加评论

18楼

星河百穿发表于2020年03月17日

0 | 0 | 举报

<code class="brush:python;toolbar:false" >import requests as rq import time as t def gethtml(url):     try:         r = rq.get(url)         r.raise_for_status()         r.encoding = r.apparent_encoding         return r     except:         return '爬取失败' if __name__ == '__main__':     start = t.perf_counter()     url = 'https://www.icourse163.org'     for i in range(100):         gethtml(url)     end = t.perf_counter()     print('一百次爬取时间为{:.2f}秒'.format(end-start))</code>一百次爬取时间为94.74秒

星河百穿发表于2020年03月17日

0 | 评论(0) | 举报

添加评论

19楼

MF1915043徐慈寒发表于2020年03月17日

0 | 0 | 举报

<img src="https://nos.netease.com/edu-image/50b9bb8fedcc452189132bf41d619013.png" />85.53秒

MF1915043徐慈寒发表于2020年03月17日

0 | 评论(0) | 举报

添加评论

20楼

这个昵称太抢手了_换一个发表于2020年03月17日

0 | 0 | 举报

import requests import time as t def getHTMLText(url):    try :        r = requests.get(url, timeout = 30)        r.raise_for_status()#如果状态不是200，引发HTTPError异常        r.encoding = r.apparent_encoding  #按照内容猜测的编码格式赋值给解析内容用的编码格式        return r.text    except:        return "产生异常" if __name__ == "__main__":    url = "https://www.duba.com"    start = t.perf_counter()    for i in range(100):        getHTMLText(url)    end = t.perf_counter()    print(f"爬取该网页一百次所用时间： {end-start}") 12.387908399999999

这个昵称太抢手了_换一个发表于2020年03月17日

0 | 评论(0) | 举报

添加评论

点击加载更多

发表回复

Requests库的爬取性能分析

友情链接

关注我们

关于我们