- 课堂交流区
- 帖子详情
46
回复
-
<p>震惊,头一回知道requests会影响爬虫的性能,还以为爬虫的性能只受限于网速和线程数量QAQ。以前自己测,网站本身就慢的话、单次requests请求的时间就是略短于一次刷新的时间,爬取100次加上爬取后还要进行数据处理和数据导出,所以通常要两分钟左右,<span style="text-decoration: line-through;" >并且</span><span style="text-decoration: line-through;" >为了降低服务器负载我写的时候会尽量加个sleep</span>。</p><p><br ></p><p>老师,根据这个问题想向您请教一下,您提到的“专业爬虫”用的是用别的库或者直接用C写吗?</p>添加评论
-
<p>import requests</p><p>import time</p><p><br ></p><p>def pertime(url):</p><p> try:</p><p> r = requests.get(url,timeout=30)</p><p> r.raise_for_status</p><p> r.encoding = r.apparent_encoding</p><p> return r.text</p><p> except:</p><p> print('产生异常')</p><p><br ></p><p>if __name__ =="__main__":</p><p> url = 'https://www.baidu.com'</p><p> totaltime=0</p><p> for i in range(100):</p><p> starttime = time.perf_counter()</p><p><br ></p><p> pertime(url)</p><p> endtime = time.perf_counter()</p><p><br ></p><p> totaltime = totaltime +endtime - starttime</p><p> print('共用时{:.4f}秒'.format(totaltime))</p><p><br ></p><p>共用时49.2819秒...</p>添加评论
-
<p>import requests</p><p>import time</p><p><br ></p><p>def getTime(url):</p><p> try:</p><p> r = requests.get(url, timeout = 30)</p><p> r.raise_for_status</p><p> r.encoding = r.apparent_encoding</p><p> return r.text</p><p> except:</p><p> print("产生异常")</p><p><br ></p><p>if __name__ == "__main__":</p><p> url = 'https://www.baidu.com'</p><p> totaltime = 0</p><p> for i in range(100):</p><p> start = time.perf_counter()</p><p> getTime(url)</p><p> totaltime = totaltime + time.perf_counter() - start</p><p> print("共用时{:.2f}s".format(totaltime))</p><p>共用时5.52s</p>添加评论
-
<p>python源码:</p><p>import requests<br >import time<br >def gettime(url):<br > try:<br > r = requests.get(url, timeout=30)<br > r.raise_for_status<br > r.encoding = r.apparent_encoding<br > return r.text<br > except:<br > print('产生异常')<br >if __name__ == "__main__":<br > url = 'https://www.baidu.com'<br > totaltime = 0<br > for i in range(100):<br > starttime = time.perf_counter()<br > gettime(url)<br > totaltime = totaltime + time.perf_counter() - starttime<br > print('爬取100次网页共用时{:.4f}秒'.format(totaltime))</p><p>运行结果:爬取100次网页共用时55.9501秒</p>添加评论
-
<p>import requests</p><p><br ></p><p>def get_html_text(url):</p><p> try:</p><p> r = requests.get(url, timeout=30)</p><p> r.raise_for_status()</p><p> r.encoding = r.apparent_encoding</p><p> return r.text</p><p> except:</p><p> return '出现异常'</p><p><br ></p><p>url = 'https://www.baidu.com'</p><p>for number in range(100):</p><p> print(get_html_text(url))</p><p>用时8.9s</p><p><br ></p><p><br ></p><p><br ></p><p>最后问一下伙伴们,if __name__ == '__main__'这行代码有什么意义啊?</p><p><code class="brush:python;toolbar:false" ><br ></code></p>
-
这个貌似是一种格式,一种设定,设定一个定量,后面写的代码就能引用
-
<p>哦哦 似懂非懂 感谢你的回答</p>
-
<p>简单理解,就是用这个代码生成py文件后,双击py文件就可以直接运行了</p>
添加评论 -
-
<p>import time<br >import requests<br >def gethtml(url):<br > try:<br > r = requests.get(url,timeout = 30)<br > r.raise_for_status()<br > r.encoding = r.apparent_encoding<br > return r.text<br > except:<br > return "异常"<br >start= time.time()<br >for i in range(100):<br > url = "https://www.cnki.net/"<br > gethtml(url)<br >end = time.time()-start<br >print(end)<br ></p><p>58.22875738143921</p>添加评论
-
import timeimport requestsdef gethtml(url): try: r = requests.get(url,timeout = 30) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except: return "异常"start= time.time()for i in range(100): url = "https://www.cnki.net/" gethtml(url)end = time.time()-startprint(end)58.22875738143921添加评论
-
<p>import time as t</p><p>import requests as r</p><p><br ></p><p>def getHtml(url):</p><p> try:</p><p> </p><p> nn=r.get(url)</p><p> </p><p> print(nn.status_code) </p><p> </p><p> nn.raise_for_status()</p><p> return nn.text</p><p> except:</p><p> print("发生异常")</p><p><br ></p><p><br ></p><p>url="https://www.baidu.com"</p><p>start=t.time()</p><p>for i in range(100):</p><p> </p><p> nr=getHtml(url)</p><p>end=t.time()</p><p>print("{}秒".format(end-start))</p><p>print(nr)</p><p>运行13.36秒</p>添加评论
-
<p>运行100次,用了17.22秒,成功爬取100次</p><p><code class="brush:python;toolbar:false" >import requests import time def getHTML(url): try: r=requests.get(url,headers=hd) return "OK" except: return "爬取失败!" hd={"user-agent":"chrome/10"} url='https://www.shanghairanking.cn/rankings/bcur/2023' start_time=time.time() count=0 for i in range(100): info=getHTML(url) if info=='OK': count+=1 print("运行100次,用了{:.2f}秒,成功爬取{}次".format(time.time()-start_time,count))</code></p>添加评论
-
<p>import requests<br >import time<br ><br ><br >def netCra(url):<br > try:<br > r = requests.get(url)<br > r.raise_for_status()<br > return True<br > except:<br > return False<br ><br ><br >url = "https://www.tfrerc.cn/"<br >stime = time.perf_counter()<br >re = {"True": 0, "False": 0}<br >for i in range(100):<br > if netCra(url):<br > re["True"] = re.get("True", 0) + 1<br > else:<br > re["True"] = re.get("False", 0) + 1<br >runTime = time.perf_counter() - stime<br >print(f"成功了{re['True']}次,\n失败了{re['False']}次")<br >print(f"共花费了{runTime:.2f}秒")<br ><br ></p><p>成功了100次,</p><p>失败了0次</p><p>共花费了6.01秒<code class="brush:python;toolbar:false" ><br ></code></p>添加评论
-
import requests import time def netCra(url): try: r = requests.get(url) r.raise_for_status() return True except: return False url = "https://www.tfrerc.cn/" stime = time.perf_counter() re = {"True": 0, "False": 0} for i in range(100): if netCra(url): re["True"] = re.get("True", 0) + 1 else: re["True"] = re.get("False", 0) + 1 runTime = time.perf_counter() - stime print(f"成功了{re['True']}次,\n失败了{re['False']}次") print(f"共花费了{runTime:.2f}秒") 成功了100次, 失败了0次 共花费了6.01秒添加评论
-
<p>import requests<br >import time<br >def getHTMLText(ur1):<br > try:<br > r=requests.get(ur1)<br > r.raise_for_status()<br > return True<br > except:<br > return False<br ><br >t=0<br >f=0<br >url = "https://www.tfrerc.cn/"<br >stime = time.perf_counter()<br >for i in range(0,100):<br > if getHTMLText(url):<br > t+=1<br > else:<br > f+=1<br >runTime = time.perf_counter() - stime<br >print("响应时间为:%0.2f秒"%runTime)<br >print("共成功"+str(t)+"次,共失败"+str(f)+"次")</p><p><br ></p><p>响应时间为:36.92秒</p><p>共成功100次,共失败0次</p><p><br ></p>添加评论
-
import requestsimport timedef netCra(url): try: r = requests.get(url) r.raise_for_status() return True except: return Falseurl = "https://www.tfrerc.cn/"stime = time.perf_counter()re = {"True": 0, "False": 0}for i in range(100): if netCra(url): re["True"] = re.get("True", 0) 1 else: re["True"] = re.get("False", 0) 1runTime = time.perf_counter() - stimeprint(f"成功了{re['True']}次,\n失败了{re['False']}次")print(f"共花费了{runTime:.2f}秒")成功了100次,失败了0次共花费了6.01秒添加评论
-
<p>import requests<br >import time<br ><br ><br >def spider1(url):<br > try:<br > r = requests.get(url, timeout=30)<br > r.raise_for_status()<br > return True<br > except:<br > return False<br ><br ><br >url1 = "https://ssr1.scrape.center/"<br >url2 = "https://www.tfrerc.cn/"<br ><br >stime = time.perf_counter()<br >re = {"True": 0, "False": 0}<br >for i in range(100):<br > if spider1(url1):<br > re["True"] = re.get("True", 0)+1<br > else:<br > re["False"] = re.get("False", 0)+1<br ><br >runtime = time.perf_counter()-stime<br >print(f"成功了{re['True']}次,\n失败了{re['False']}次")<br >print(f"共花费了{runtime:.2f}秒")</p><p><br ></p><p>运行结果:</p><p>成功了100次,</p><p>失败了0次</p><p>共花费了32.82秒</p><p>Process finished with exit code 0</p>添加评论
-
<p>import requests</p><p>import time</p><p>start=time.perf_counter()</p><p>for i in range(100):</p><p>url="https://item.jd.com/2967929.html"</p><p>r=requests.get(url)</p><p>T=time.perf_counter()-start</p><p>if i!=99:</p><p>continue</p><p>print("time of crawling this website for 100 times:{}".format(T))</p><p><br ></p>添加评论
-
<p>import requests</p><p>import time</p><p>def getHTMLText(ur1):</p><p> try:</p><p> r=requests.get(ur1)</p><p> r.raise_for_status()</p><p> return True</p><p> except:</p><p> return False</p><p><br ></p><p>t=0</p><p>f=0</p><p>url = "https://www.tfrerc.cn/"</p><p>stime = time.perf_counter()</p><p>for i in range(0,100):</p><p> if getHTMLText(url):</p><p> t+=1</p><p> else:</p><p> f+=1</p><p>runTime = time.perf_counter() - stime</p><p>print("响应时间为:%0.2f秒"%runTime)</p><p>print("共成功"+str(t)+"次,共失败"+str(f)+"次")</p><p><br ></p><p><br ></p><p><br ></p><p>响应时间为:36.92秒</p><p><br ></p><p>共成功100次,共失败0次</p><p><br ></p>添加评论
-
<p><code class="brush:python;toolbar:false" >import requests,time def getHTML(url): try: r = requests.get(f'https://{url}.com') r.raise_for_status() return True except: return False def recordTime(url,Count): start = time.perf_counter() c = 0 while c <= int(Count): if getHTML(url): c += 1 totalTime = time.perf_counter() - start print(f'成功爬取 https://{url}.com {Count}次的时间为 {round(totalTime,2)} s') url = input('请输入要爬取网站的域名') Count = input('请输入爬取次数') print('开始爬取....') recordTime(url,Count)</code></p>添加评论
-
import requests,time def getHTML(url): try: r = requests.get(f'https://{url}.com') r.raise_for_status() return True except: return False def recordTime(url,Count): start = time.perf_counter() c = 0 while c < intCount ifgetHTMLurl c=" 1" totalTime=" time.perf_counter() - start" printfhttpurlcomCountroundtotalTimes url=" input('请输入要爬取网站的域名')" Count=" input('请输入爬取次数')" print recordTimeurlCountcode>添加评论
-
通过学习该门课程,我学习到了很多有关于该门课程的知识,很大程度上提升的我对该领域的兴趣,加强了课外兴趣,培养了创新能力,学到了很多知识,收获颇多!!!!添加评论
-
通过学习该门课程,我学习到了很多有关于该门课程的知识,很大程度上提升的我对该领域的兴趣,加强了课外兴趣,培养了创新能力,学到了很多知识,收获颇多!!!!添加评论
点击加载更多
到底啦~