黄色网页视频 I 影音先锋日日狠狠久久 I 秋霞午夜毛片 I 秋霞一二三区 I 国产成人片无码视频 I 国产精品自在自线 I av免费观看网站 I 日本精品久久久久中文字幕5 I 91看视频 I 看全色黄大色黄女片18 I 精品不卡一区 I 亚洲最新精品 I 欧美激情在线 I 人妻少妇精品久久 I 国产99视频精品免费专区 I 欧美影院 I 欧美精品在欧美一区二区少妇 I av大片网站 I 国产精品黄色片 I 888久久 I 狠狠干最新 I 看看黄色一级片 I 黄色精品久久 I 三级av在线 I 69色综合 I 国产日韩欧美91 I 亚洲精品偷拍 I 激情小说亚洲图片 I 久久国产视频精品 I 国产综合精品一区二区三区 I 色婷婷国产 I 最新成人av在线 I 国产私拍精品 I 日韩成人影音 I 日日夜夜天天综合

> Ubuntu

Ubuntu環境下Nutch+Tomcat 搭建簡單的搜索引擎

系統 2019-08-12 01:33:41 2936 0

簡易的搜索引擎搭建

我的配置：

Nutch：1.2

Tomcat：7.0.57

1 Nutch設置

修改Nutch配置

1.1 修改conf/nutch-site.xml

        
           1
        
         <?xml version=
        
          "
        
        
          1.0
        
        
          "
        
        ?>


        
           2
        
         <?xml-stylesheet type=
        
          "
        
        
          text/xsl
        
        
          "
        
         href=
        
          "
        
        
          configuration.xsl
        
        
          "
        
        ?>


        
           3
        
        
           4
        
         <!-- Put site-specific property overrides 
        
          in
        
        
          this
        
         file. -->


        
           5
        
        
           6
        
         <configuration>


        
           7
        
        
           8
        
             <!--property> 


        
           9
        
             <name>storage.data.store.
        
          class
        
        </name> 


        
          10
        
             <value>org.apache.gora.hbase.store.HBaseStore</value> 


        
          11
        
             <description>Default 
        
          class
        
        
          for
        
         storing data</description> 


        
          12
        
             </property> 


        
          13
        
             <property>


        
          14
        
             <name>http.agent.name</name> 


        
          15
        
             <value>xxx0624-ThinkPad-Edge</value> 


        
          16
        
             </property-->


        
          17
        
        
          18
        
         <property>


        
          19
        
           <name>http.agent.name</name>


        
          20
        
           <value>nutch1.
        
          0
        
        </value>


        
          21
        
         </property>


        
          22
        
        
          23
        
         <property>


        
          24
        
           <name>plugin.folders</name>


        
          25
        
           <value>./plugins</value>


        
          26
        
         </property>


        
          27
        
        
          28
        
         </configuration>

View Code

1.2 修改conf/crawl-urlfilter.txt

      
        1
      
       # accept hosts 
      
        in
      
      
         MY.DOMAIN.NAME


      
      
        2
      
       +^http:
      
        //
      
      
        ([a-z0-9]*\.)*sohu.com/

找到該處進行修改。我的是以sohu網為例。表示只爬取sohu.com結尾的網頁。

1.3 增加文件夾

在nutch目錄下mkdir一個新的文件夾名字為urls，再在里面建立一個空的txt文件名字為urls.txt。

在urls.txt中寫入要爬取的網頁地址：如http://www.sohu.com/

1.4 開始爬取

命令：

      bin/nutch crawl urls/urls.txt -dir crawled -depth 5 -threads 5 -topN 200

crawled指爬取網頁的結果的存儲位置，當爬取結束時，會自動生成5個文件夾：crawldb，index，indexes，linkdb，segments

2 tomcat設置

2.1 將nutch編譯后的war包放在tomcat的webapps下，再啟動tomcat，再在生成的nutch1.2文件夾下修改WEB-INF/classes/nutch-sites.xml

      <property>    

    <name>searcher.dir</name>    

    <value>/home/xxx0624/nutch-
      
        1.2
      
      /crawled</value>    

</property>

這是設置抓取網頁信息的文件位置

2.2 針對中文亂碼修改

2.2.1 修改tomcat配置文件conf/server.xml

      
        1
      
       <Connector port=
      
        "
      
      
        8080
      
      
        "
      
       protocol=
      
        "
      
      
        HTTP/1.1
      
      
        "
      
      
        2
      
       connectionTimeout=
      
        "
      
      
        20000
      
      
        "
      
      
        3
      
       redirectPort=
      
        "
      
      
        8443
      
      
        "
      
      
        4
      
       URIEncoding=
      
        "
      
      
        UTF-8
      
      
        "
      
      
        5
      
       useBodyEncodingForURI=
      
        "
      
      
        true
      
      
        "
      
      />

增加其中的URIEncoding和useBodyEncodingForURI

2.2.2 修改nutch-1.2/cache.jsp

找到這一部分

        
           1
        
         Metadata metaData =
        
           bean.getParseData(details).getContentMeta();


        
        
           2
        
         ParseData ParseData =
        
           bean.getParseData(details);  


        
        
           3
        
           String content = 
        
          null
        
        
          ;


        
        
           4
        
        
          //
        
        
           String contentType = (String) metaData.get(Metadata.CONTENT_TYPE);
        
        
           5
        
         String contentType =
        
           ParseData.getMeta(Metadata.CONTENT_TYPE);


        
        
           6
        
        
          if
        
         (contentType.startsWith(
        
          "
        
        
          text/html
        
        
          "
        
        
          )) {


        
        
           7
        
        
          //
        
        
           FIXME : it's better to emit the original 'byte' sequence 


        
        
           8
        
        
          //
        
        
           with 'charset' set to the value of 'CharEncoding',


        
        
           9
        
        
          //
        
        
           but I don't know how to emit 'byte sequence' in JSP.


        
        
          10
        
        
          //
        
        
           out.getOutputStream().write(bean.getContent(details)) may work, 


        
        
          11
        
        
          //
        
        
           but I'm not sure.


        
        
          12
        
        
          //
        
        
          String encoding = (String) metaData.get("CharEncodingForConversion"); 
        
        
          13
        
             String encoding = ParseData.getMeta(
        
          "
        
        
          CharEncodingForConversion
        
        
          "
        
        
          ); 


        
        
          14
        
        
          if
        
         (encoding != 
        
          null
        
        
          ) {


        
        
          15
        
        
          try
        
        
           {


        
        
          16
        
                 content = 
        
          new
        
        
           String(bean.getContent(details), encoding);


        
        
          17
        
        
                }


        
        
          18
        
        
          catch
        
        
           (UnsupportedEncodingException e) {


        
        
          19
        
        
          //
        
        
           fallback to windows-1252
        
        
          20
        
                 content = 
        
          new
        
         String(bean.getContent(details), 
        
          "
        
        
          windows-1252
        
        
          "
        
        
          );


        
        
          21
        
        
                }


        
        
          22
        
        
              }


        
        
          23
        
        
          else
        
        
          24
        
          content = 
        
          new
        
         String(bean.getContent(details),
        
          "
        
        
          GBK
        
        
          "
        
        
          ); 


        
        
          25
        
        
          //
        
        
          content = new String(bean.getContent(details));

View Code

3 開始實驗

重啟tomcat

通過瀏覽器訪問：http://localhost:8080/nutch-1.2

Ubuntu環境下Nutch+Tomcat 搭建簡單的搜索引擎

更多文章、技術交流、商務合作、聯系博主

微信掃碼或搜索：z360901061

微信掃一掃加我為好友

QQ號聯系： 360901061

您的支持是博主寫作最大的動力，如果您喜歡我的文章，感覺我的文章對您有幫助，請用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧，狠狠點擊下面給點支持吧，站長非常感激您！手機微信長按不能支付解決辦法：請將微信支付二維碼保存到相冊，切換到微信，然后點擊微信右上角掃一掃功能，選擇支付二維碼完成支付。

【本文對您有幫助就好】元

2元

5元

10元

20元

自定義

發表我的評論

最新評論總共0條評論