urllib模块

urllib模块简介

urllib提供了一系列用于操作URL的功能。包含urllib.request,urllib.error,urllib.parse,urllib.robotparser四个子模块

  • urllib.request打开和浏览url中内容
  • urllib.error包含从 urllib.request发生的错误或异常
  • urllib.parse解析url
  • urllib.robotparser解析 robots.txt文件

urllib.request.urlopen()格式
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

urllib模块介绍

urlopen函数参数:

  •  url: 需要打开的网址
  •  data:Post提交的数据
  •  timeout:设置网站的访问超时时间

urlopen返回对象提供方法:

  •  read() , readline() ,readlines() , fileno() , close() :对HTTPResponse类型数据进行操作
  •   info():返回HTTPMessage对象,表示远程服务器返回的头信息
  •   getcode():返回Http状态码。如果是http请求,200请求成功完成 ; 404网址未找到
  •   geturl():返回请求的url

urlopen返回对象提供的属性:

- status:返回Http状态码。如果是http请求,200请求成功完成 ; 404网址未找到

- reason:返回数字,比如200

url参数的使用

get请求

例如对百度的一个URL https://www.baidu.com/进行抓取:

#demoe5.py
from urllib import request
url = 'https://www.baidu.com/' f = request.urlopen(url) data = f.read() print('Status:', f.status, f.reason) print('Data:', data)

运行程序可以得到如下:

C:\Pycham\venv\Scripts\python.exe C:/Pycham/demoe5.py
Status: 200 OK
Data: b'<html>\r\n<head>\r\n\t<script>\r\n\t\tlocation.replace(location.href.replace("https://","http://"));\r\n\t</script>\r\n</head>\r\n<body>\r\n\t<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>\r\n</body>\r\n</html>'

Process finished with exit code 0

  

Data的数据格式为bytes类型,需要decode()解码,转换成str类型

我们将最后一句代码改为

print('Data:', data.decode('utf-8'))

#demoe5.py
from urllib import request
url = 'https://www.baidu.com/'
f = request.urlopen(url)
data = f.read()
print('Status:', f.status, f.reason)
print('Data:', data.decode('utf-8'))

运行程序可以得到如下:

C:\Pycham\venv\Scripts\python.exe C:/Pycham/demoe5.py
Status: 200 OK
Data: <html>
<head>
	<script>
		location.replace(location.href.replace("https://","http://"));
	</script>
</head>
<body>
	<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html>

Process finished with exit code 0

这样得到的内容就可以与网页编码内容一样了

data参数的使用

post请求

urlopen()的data参数默认为None,当data参数不为空的时候,urlopen()提交方式为Post

Post的数据必须是bytes或者iterable of bytes,不能是str,如果是str需要进行encode()编码

然后作为data参数传递给Request对象。编码是使用一个urllib.parse库中的函数完成的。

import urllib.parse
import urllib.request

url = 'http://www.someserver.com/cgi-bin/register.cgi'
values = {'name' : 'Michael Foord',
          'location' : 'Northampton',
          'language' : 'Python' }

data = urllib.parse.urlencode(values)
data = data.encode('ascii') # data should be bytes
req = urllib.request.Request(url, data)
with urllib.request.urlopen(req) as response:
   the_page = response.read()

如果不传递data参数,那urllib就会使用GET请求方式。GET方式和POST方式的其中一个区别在于POST请求经常有副作用:它们会以某种方式改变系统的状态(比如,在网上下订单,会有一英担的午餐肉罐头送到你家门口)。尽管HTTP标准明确说POST方式总是会造成副作用,而GET方式从来不会,但是并没有保证措施。数据也可以用GET方式传递,只要把它编码在url中。

timeout参数的使用
在某些网络情况不好或者服务器端异常的情况会出现请求慢的情况,或者请求异常,所以这个时候我们需要给
请求设置一个超时时间,而不是让程序一直在等待结果。例子如下:

#demoe5.py
from urllib import request

url = 'https://www.baidu.com/'
f = request.urlopen(url,timeout=0.1)
data = f.read()
print('Status:', f.status, f.reason)
print('Data:', data)

  

使用Request包装请求

有很多网站为了防止程序爬虫爬网站造成网站瘫痪或者会给不同的浏览器发送不同的版本,会需要携带一些headers头部信息才能访问,最长见的有user-agent参数

格式:

urllib.request.Request(url, data=None, headers={}, method=None)

使用request()来包装请求,再通过urlopen()获取页面

用来包装头部的数据:

  •         User-Agent :这个头部可以携带如下几条信息:浏览器名和版本号、操作系统名和版本号、默认语言
  •         Referer:可以用来防止盗链,有一些网站图片显示来源http://***.com,就是检查Referer来鉴定的
  •        Connection:表示连接状态,记录Session的状态。

第一种方法:

import urllib.request

url = "https://www.baidu.com/"
#创建Request对象
request = urllib.request.Request(url)
#添加http的header
request.add_header('User-Agent','Mozilla/5.0 (compatible; MSIE 5.5; Windows NT)')
#发送请求获取结果
response2 = urllib.request.urlopen(request)
data =response2.read()
print(response2.status,response2.reason)       #打印请求的状态码
print(len(data))                               #输出网页字符串的长度
print(data.decode())                           #输出网页内容

  

运行结果如下:

C:\Pycham\venv\Scripts\python.exe C:/Pycham/demoe5.py
200 OK
227
<html>
<head>
	<script>
		location.replace(location.href.replace("https://","http://"));
	</script>
</head>
<body>
	<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html>

Process finished with exit code 0

  

第二种方法:

import urllib.request
url= 'https://www.baidu.com/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3',
    'Referer': 'https://www.baidu.com/',
    'Connection': 'keep-alive'
}
request = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(request).read()
data = response.decode('utf-8')
print(data)

  

运行结果如下:

</script>

<script type="text/javascript">var Cookie={set:function(e,t,o,i,s,n){document.cookie=e+"="+(n?t:escape(t))+(s?"; expires="+s.toGMTString():"")+(i?"; path="+i:"; path=/")+(o?"; domain="+o:"")},get:function(e,t){var o=document.cookie.match(new RegExp("(^| )"+e+"=([^;]*)(;|$)"));return null!=o?unescape(o[2]):t},clear:function(e,t,o){this.get(e)&&(document.cookie=e+"="+(t?"; path="+t:"; path=/")+(o?"; domain="+o:"")+";expires=Fri, 02-Jan-1970 00:00:00 GMT")}};!function(){function save(e){var t=[];for(tmpName in options)options.hasOwnProperty(tmpName)&&"duRobotState"!==tmpName&&t.push('"'+tmpName+'":"'+options[tmpName]+'"');
var o="{"+t.join(",")+"}";bds.comm.personalData?$.ajax({url:"//www.baidu.com/ups/submit/addtips/?product=ps&tips="+encodeURIComponent(o)+"&_r="+(new Date).getTime(),success:function(){writeCookie(),"function"==typeof e&&e()}}):(writeCookie(),"function"==typeof e&&setTimeout(e,0))}function set(e,t){options[e]=t}function get(e){return options[e]}function writeCookie(){if(options.hasOwnProperty("sugSet")){var e="0"==options.sugSet?"0":"3";clearCookie("sug"),Cookie.set("sug",e,document.domain,"/",expire30y)
}if(options.hasOwnProperty("sugStoreSet")){var e=0==options.sugStoreSet?"0":"1";clearCookie("sugstore"),Cookie.set("sugstore",e,document.domain,"/",expire30y)}if(options.hasOwnProperty("isSwitch")){var t={0:"2",1:"0",2:"1"},e=t[options.isSwitch];clearCookie("ORIGIN"),Cookie.set("ORIGIN",e,document.domain,"/",expire30y)}if(options.hasOwnProperty("imeSwitch")){var e=options.imeSwitch;clearCookie("bdime"),Cookie.set("bdime",e,document.domain,"/",expire30y)}}function writeBAIDUID(){var e,t,o,i=Cookie.get("BAIDUID");
/FG=(\d+)/.test(i)&&(t=RegExp.$1),/SL=(\d+)/.test(i)&&(o=RegExp.$1),/NR=(\d+)/.test(i)&&(e=RegExp.$1),options.hasOwnProperty("resultNum")&&(e=options.resultNum),options.hasOwnProperty("resultLang")&&(o=options.resultLang),Cookie.set("BAIDUID",i.replace(/:.*$/,"")+("undefined"!=typeof o?":SL="+o:"")+("undefined"!=typeof e?":NR="+e:"")+("undefined"!=typeof t?":FG="+t:""),".baidu.com","/",expire30y,!0)}function clearCookie(e){Cookie.clear(e,"/"),Cookie.clear(e,"/",document.domain),Cookie.clear(e,"/","."+document.domain),Cookie.clear(e,"/",".baidu.com")
}function reset(e){options=defaultOptions,save(e)}var defaultOptions={sugSet:1,sugStoreSet:1,isSwitch:1,isJumpHttps:1,imeSwitch:0,resultNum:10,skinOpen:1,resultLang:0,duRobotState:"000"},options={},tmpName,expire30y=new Date;expire30y.setTime(expire30y.getTime()+94608e7);try{if(bds&&bds.comm&&bds.comm.personalData){if("string"==typeof bds.comm.personalData&&(bds.comm.personalData=eval("("+bds.comm.personalData+")")),!bds.comm.personalData)return;for(tmpName in bds.comm.personalData)defaultOptions.hasOwnProperty(tmpName)&&bds.comm.personalData.hasOwnProperty(tmpName)&&"SUCCESS"==bds.comm.personalData[tmpName].ErrMsg&&(options[tmpName]=bds.comm.personalData[tmpName].value)
}try{parseInt(options.resultNum)||delete options.resultNum,parseInt(options.resultLang)||"0"==options.resultLang||delete options.resultLang}catch(e){}writeCookie(),"sugSet"in options||(options.sugSet=3!=Cookie.get("sug",3)?0:1),"sugStoreSet"in options||(options.sugStoreSet=Cookie.get("sugstore",0));var BAIDUID=Cookie.get("BAIDUID");"resultNum"in options||(options.resultNum=/NR=(\d+)/.test(BAIDUID)&&RegExp.$1?parseInt(RegExp.$1):10),"resultLang"in options||(options.resultLang=/SL=(\d+)/.test(BAIDUID)&&RegExp.$1?parseInt(RegExp.$1):0),"isSwitch"in options||(options.isSwitch=2==Cookie.get("ORIGIN",0)?0:1==Cookie.get("ORIGIN",0)?2:1),"imeSwitch"in options||(options.imeSwitch=Cookie.get("bdime",0))
}catch(e){}window.UPS={writeBAIDUID:writeBAIDUID,reset:reset,get:get,set:set,save:save}}(),function(){var e="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/static/protocol/https/plugins/every_cookie_4644b13.js";("Mac68K"==navigator.platform||"MacPPC"==navigator.platform||"Macintosh"==navigator.platform||"MacIntel"==navigator.platform)&&(e="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/static/protocol/https/plugins/every_cookie_mac_82990d4.js"),setTimeout(function(){$.ajax({url:e,cache:!0,dataType:"script"})},0);var t=navigator&&navigator.userAgent?navigator.userAgent:"",o=document&&document.cookie?document.cookie:"",i=!!(t.match(/(msie [2-8])/i)||t.match(/windows.*safari/i)&&!t.match(/chrome/i)||t.match(/(linux.*firefox)/i)||t.match(/Chrome\/29/i)||t.match(/mac os x.*firefox/i)||o.match(/\bISSW=1/)||0==UPS.get("isSwitch"));
bds&&bds.comm&&(bds.comm.supportis=!i,bds.comm.isui=!0),window.__restart_confirm_timeout=!0,window.__confirm_timeout=8e3,window.__disable_is_guide=!0,window.__disable_swap_to_empty=!0,window.__switch_add_mask=!0;var s="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/static/protocol/https/global/js/all_async_search_7edb824.js",n="/script";document.write("<script src='"+s+"'><"+n+">"),bds.comm.newindex&&$(window).on("index_off",function(){$('<div class="c-tips-container" id="c-tips-container"></div>').insertAfter("#wrapper"),window.__sample_dynamic_tab&&$("#s_tab").remove()
}),bds.comm&&bds.comm.ishome&&Cookie.get("H_PS_PSSID")&&(bds.comm.indexSid=Cookie.get("H_PS_PSSID"));var a=$(document).find("#s_tab").find("a");a&&a.length>0&&a.each(function(e,t){t.innerHTML&&t.innerHTML.match(/新闻/)&&(t.innerHTML="资讯",t.href="//www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news&word=",t.setAttribute("sync",!0))})}();</script>



<script>
if(bds.comm.supportis){
    window.__restart_confirm_timeout=true;
    window.__confirm_timeout=8000;
    window.__disable_is_guide=true;
    window.__disable_swap_to_empty=true;
}
initPreload({
    'isui':true,
    'index_form':"#form",
    'index_kw':"#kw",
    'result_form':"#form",
    'result_kw':"#kw"
});
</script>

<script>
if(navigator.cookieEnabled){
	document.cookie="NOJS=;expires=Sat, 01 Jan 2000 00:00:00 GMT";
}
</script>



</body>
</html>







Process finished with exit code 0

  




10-13 15:46