黄色网页视频 I 影音先锋日日狠狠久久 I 秋霞午夜毛片 I 秋霞一二三区 I 国产成人片无码视频 I 国产 精品 自在自线 I av免费观看网站 I 日本精品久久久久中文字幕5 I 91看视频 I 看全色黄大色黄女片18 I 精品不卡一区 I 亚洲最新精品 I 欧美 激情 在线 I 人妻少妇精品久久 I 国产99视频精品免费专区 I 欧美影院 I 欧美精品在欧美一区二区少妇 I av大片网站 I 国产精品黄色片 I 888久久 I 狠狠干最新 I 看看黄色一级片 I 黄色精品久久 I 三级av在线 I 69色综合 I 国产日韩欧美91 I 亚洲精品偷拍 I 激情小说亚洲图片 I 久久国产视频精品 I 国产综合精品一区二区三区 I 色婷婷国产 I 最新成人av在线 I 国产私拍精品 I 日韩成人影音 I 日日夜夜天天综合

Python使用scrapy爬取陽(yáng)光熱線(xiàn)問(wèn)政平臺(tái)過(guò)程解析

系統(tǒng) 2005 0

目的:爬取陽(yáng)光熱線(xiàn)問(wèn)政平臺(tái)問(wèn)題反映每個(gè)帖子里面的標(biāo)題、內(nèi)容、編號(hào)和帖子url

CrawlSpider版流程如下:

創(chuàng)建爬蟲(chóng)項(xiàng)目dongguang

            
scrapy startproject dongguang
          

設(shè)置items.py文件

            
# -*- coding: utf-8 -*-
import scrapy
class NewdongguanItem(scrapy.Item):
  # define the fields for your item here like:
  # name = scrapy.Field()
  # pass
  # 每頁(yè)的帖子鏈接
  url = scrapy.Field()
  # 帖子標(biāo)題
  title = scrapy.Field()
  # 帖子編號(hào)
  number = scrapy.Field()
  # 帖子內(nèi)容
  content = scrapy.Field()
          

在spiders目錄里面,創(chuàng)建并編寫(xiě)爬蟲(chóng)文件sun.py

            
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from dongguan.items import DongguanItem
class SunSpider(CrawlSpider):
  name = 'dg'
  allowed_domains = ['wz.sun0769.com']
  start_urls = ['http://wz.sun0769.com/html/top/report.shtml']
  # rules是Rule的集合,每個(gè)rule規(guī)則同時(shí)執(zhí)行。另外,如果發(fā)現(xiàn)web服務(wù)器有反爬蟲(chóng)機(jī)制如返回一個(gè)假的url,則可以使用Rule里面的參數(shù)process_links調(diào)用一個(gè)自編函數(shù)來(lái)處理url后返回一個(gè)真的url
  rules = (
    # 每個(gè)url都有一個(gè)獨(dú)一無(wú)二的指紋,每個(gè)爬蟲(chóng)項(xiàng)目都有一個(gè)去重隊(duì)列
    # Rule里面沒(méi)有回調(diào)函數(shù),則默認(rèn)對(duì)匹配的鏈接要跟進(jìn),就是對(duì)匹配的鏈接在進(jìn)行請(qǐng)求獲取響應(yīng)后對(duì)響應(yīng)里面匹配的鏈接繼續(xù)跟進(jìn),只不過(guò)沒(méi)有回調(diào)函數(shù)對(duì)響應(yīng)數(shù)據(jù)進(jìn)行處理
    # Rule(LinkExtractor(allow="page="))如果設(shè)置為follow=False,則不會(huì)跟進(jìn),只顯示當(dāng)前頁(yè)面匹配的鏈接。如設(shè)置為follow=True,則會(huì)對(duì)每個(gè)匹配的鏈接發(fā)送請(qǐng)求獲取響應(yīng)進(jìn)而從每個(gè)響應(yīng)里面再次匹配跟進(jìn),直至沒(méi)有。python遞歸深度默認(rèn)為不超過(guò)1000,否則會(huì)報(bào)異常
    Rule(LinkExtractor(allow="page=")),

    Rule(LinkExtractor(allow='http://wz.sun0769.com/html/question/\d+/\d+.shtml'),callback='parse_item')

  )

  def parse_item(self, response):
    print(response.url)
    item = DongguanItem()
    item['url'] = response.url
    item['title'] = response.xpath('//div[@class="pagecenter p3"]//strong/text()').extract()[0]
    item['number'] = response.xpath('//div[@class="pagecenter p3"]//strong/text()').extract()[0].split(' ')[-1].split(':')[-1]
     # 對(duì)帖子里面有圖片的處理,發(fā)現(xiàn)沒(méi)有圖片時(shí)則沒(méi)有class="contentext"的div標(biāo)簽,以此作為標(biāo)準(zhǔn)獲取帖子內(nèi)容
    if len(response.xpath('//div[@class="contentext"]')) == 0:
      item['content'] = ''.join(response.xpath('//div[@class="c1 text14_2"]/text()').extract())
    else:
      item['content'] = ''.join(response.xpath('//div[@class="contentext"]/text()').extract())
    yield item
          

編寫(xiě)管道pipelines.py文件

            
# -*- coding: utf-8 -*-
import json
class DongguanPipeline(object):
  def __init__(self):
    self.file = open('dongguan.json','w')
  def process_item(self, item, spider):
    content = json.dumps(dict(item),ensure_ascii=False).encode('utf-8') + '\n'
    self.file.write(content)
    return item
  def closespider(self):
    self.file.close()
          

編寫(xiě)settings.py文件

            
# -*- coding: utf-8 -*-
BOT_NAME = 'dongguan'
SPIDER_MODULES = ['dongguan.spiders']
NEWSPIDER_MODULE = 'dongguan.spiders'
# log日志文件默認(rèn)保存在當(dāng)前目錄,下面為日志級(jí)別,當(dāng)大于或等于INFO時(shí)將被保存
LOG_FILE = 'dongguan.log'
LOG_LEVEL = 'INFO'
# 爬取深度設(shè)置
# DEPTH_LIMIT = 1
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'dongguan (+http://www.yourdomain.com)'
# Obey robots.txt rules
# ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
  'dongguan.pipelines.DongguanPipeline': 300,
}
          

測(cè)試運(yùn)行爬蟲(chóng),終端執(zhí)行命令(只要在項(xiàng)目目錄內(nèi)即可)

            
scrapy crawl dg
          

Spider版流程如下:

創(chuàng)建爬蟲(chóng)項(xiàng)目newdongguang

            
scrapy startproject newdongguan
          

設(shè)置items.py文件

            
# -*- coding: utf-8 -*-
  import scrapy
  class NewdongguanItem(scrapy.Item):
    # 每頁(yè)的帖子鏈接
    url = scrapy.Field()
    # 帖子標(biāo)題
    title = scrapy.Field()
    # 帖子編號(hào)
    number = scrapy.Field()
    # 帖子內(nèi)容
    content = scrapy.Field()
          

在spiders目錄里面,創(chuàng)建并編寫(xiě)爬蟲(chóng)文件newsun.py

            
# -*- coding: utf-8 -*-
import scrapy
from newdongguan.items import NewdongguanItem
class NewsunSpider(scrapy.Spider):
  name = 'ndg'
  # 設(shè)置爬取的域名范圍,可寫(xiě)可不寫(xiě),不寫(xiě)則表示爬取時(shí)候不限域名,結(jié)果有可能會(huì)導(dǎo)致爬蟲(chóng)失控。
  allowed_domains = ['wz.sun0769.com']
  offset = 0
  url = 'http://wz.sun0769.com/index.php/question/report?page=' + str(offset)
  start_urls = [url]
  def parse(self, response):
    link_list = response.xpath("http://a[@class='news14']/@href").extract()
    for each in link_list:
      # 對(duì)每頁(yè)的帖子發(fā)送請(qǐng)求,獲取帖子內(nèi)容里面指定數(shù)據(jù)返回給管道文件
      yield scrapy.Request(each,callback=self.deal_link)
    self.offset += 30
    if self.offset <= 124260:
      url = 'http://wz.sun0769.com/index.php/question/report?page=' + str(self.offset)
      # 對(duì)指定分頁(yè)發(fā)送請(qǐng)求,響應(yīng)交給parse函數(shù)處理
      yield scrapy.Request(url,callback=self.parse)

  # 從每個(gè)分頁(yè)帖子內(nèi)容獲取數(shù)據(jù),返回給管道
  def deal_link(self,response):
    item = NewdongguanItem()
    item['url'] = response.url
    item['title'] = response.xpath("http://div[@class='pagecenter p3']//strong[@class='tgray14']/text()").extract()[0]
    item['number'] = response.xpath("http://div[@class='pagecenter p3']//strong[@class='tgray14']/text()").extract()[0].split(' ')[-1].split(':')[-1]

    if len(response.xpath("http://div[@class='contentext']")) == 0:
      item['content'] = ''.join(response.xpath("http://div[@class='c1 text14_2']/text()").extract())
    else:
      item['content'] = ''.join(response.xpath("http://div[@class='contentext']/text()").extract())
    yield item
          

編寫(xiě)管道pipelines.py文件

            
# -*- coding: utf-8 -*-
import codecs
import json
class NewdongguanPipeline(object):

  def __init__(self):
    # 使用codecs寫(xiě)文件,直接設(shè)置文件內(nèi)容編碼格式,省去每次都要對(duì)內(nèi)容進(jìn)行編碼
    self.file = codecs.open('newdongguan.json','w',encoding = 'utf-8')
    # 以前文件寫(xiě)法
    # self.file = open('newdongguan.json','w')

  def process_item(self, item, spider):
    print(item['title'])
    content = json.dumps(dict(item),ensure_ascii=False) + '\n'
    # 以前文件寫(xiě)法
    # self.file.write(content.encode('utf-8'))
    self.file.write(content)
    return item

  def close_spider(self):
    self.file.close()
          

編寫(xiě)settings.py文件

            
# -*- coding: utf-8 -*-
BOT_NAME = 'newdongguan'
SPIDER_MODULES = ['newdongguan.spiders']
NEWSPIDER_MODULE = 'newdongguan.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'newdongguan (+http://www.yourdomain.com)'
USER_AGENT = 'User-Agent:Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;'
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
  'newdongguan.pipelines.NewdongguanPipeline': 300,
}
          

測(cè)試運(yùn)行爬蟲(chóng),終端執(zhí)行命

            
srapy crawl ndg
          

備注:markdown語(yǔ)法關(guān)于代碼塊縮進(jìn)問(wèn)題,可通過(guò)tab鍵來(lái)解決。而簡(jiǎn)單文本則可以通過(guò)回車(chē)鍵來(lái)解決,如Spider版流程如下:和1. 創(chuàng)建爬蟲(chóng)項(xiàng)目newdongguang

以上就是本文的全部?jī)?nèi)容,希望對(duì)大家的學(xué)習(xí)有所幫助,也希望大家多多支持腳本之家。


更多文章、技術(shù)交流、商務(wù)合作、聯(lián)系博主

微信掃碼或搜索:z360901061

微信掃一掃加我為好友

QQ號(hào)聯(lián)系: 360901061

您的支持是博主寫(xiě)作最大的動(dòng)力,如果您喜歡我的文章,感覺(jué)我的文章對(duì)您有幫助,請(qǐng)用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧,狠狠點(diǎn)擊下面給點(diǎn)支持吧,站長(zhǎng)非常感激您!手機(jī)微信長(zhǎng)按不能支付解決辦法:請(qǐng)將微信支付二維碼保存到相冊(cè),切換到微信,然后點(diǎn)擊微信右上角掃一掃功能,選擇支付二維碼完成支付。

【本文對(duì)您有幫助就好】

您的支持是博主寫(xiě)作最大的動(dòng)力,如果您喜歡我的文章,感覺(jué)我的文章對(duì)您有幫助,請(qǐng)用微信掃描上面二維碼支持博主2元、5元、10元、自定義金額等您想捐的金額吧,站長(zhǎng)會(huì)非常 感謝您的哦!!!

發(fā)表我的評(píng)論
最新評(píng)論 總共0條評(píng)論