我正在为特定站点编写网络爬虫。该应用程序是不使用多个线程的VB.Net Windows Forms应用程序-每个Web请求都是连续的。但是,在十次成功的页面检索之后,每个连续的请求都会超时。
我已经查看了已经在此处发布的类似问题,并在我的GetPage例程中实现了推荐的技术,如下所示:
Public Function GetPage(ByVal url As String) As String
Dim result As String = String.Empty
Dim uri As New Uri(url)
Dim sp As ServicePoint = ServicePointManager.FindServicePoint(uri)
sp.ConnectionLimit = 100
Dim request As HttpWebRequest = WebRequest.Create(uri)
request.KeepAlive = False
request.Timeout = 15000
Try
Using response As HttpWebResponse = DirectCast(request.GetResponse, HttpWebResponse)
Using dataStream As Stream = response.GetResponseStream()
Using reader As New StreamReader(dataStream)
If response.StatusCode <> HttpStatusCode.OK Then
Throw New Exception("Got response status code: " + response.StatusCode)
End If
result = reader.ReadToEnd()
End Using
End Using
response.Close()
End Using
Catch ex As Exception
Dim msg As String = "Error reading page """ & url & """. " & ex.Message
Logger.LogMessage(msg, LogOutputLevel.Diagnostics)
End Try
Return result
End Function
我错过了什么吗?我不是要关闭或处置应有的物品吗?它总是在连续十次请求后发生,这很奇怪。
笔记:
ServicePointManager.DefaultConnectionLimit = 100
编辑
我在每个Web请求之间添加了2到7秒之间的延迟,这样我就不会出现“锤击”该站点或尝试进行DOS攻击的情况。但是,问题仍然会发生。
最佳答案
我认为该站点具有某种DOS保护,当它受到许多rapis请求的攻击时,它就会启动。您可能想尝试在webrequest上设置UserAgent。
关于.net - 十个连续请求后HttpWebRequest超时,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/1191926/