博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Nutch爬取Ajax请求的动态网页
阅读量:6971 次
发布时间:2019-06-27

本文共 2626 字,大约阅读时间需要 8 分钟。

hot3.png

利用开源插件html-unit

https://github.com/xautlx/nutch-htmlunit

把插件倒入到nutch环境中

但是在执行过程中,会出现各种错误。原因是lib-htmlunit的HttpWebClient有问题,

作了如下修改:

package org.apache.nutch.protocol.htmlunit;

import org.apache.hadoop.conf.Configuration;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.net.URL;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlInput;
import com.gargoylesoftware.htmlunit.WebRequest;
import com.gargoylesoftware.htmlunit.AjaxController;
import com.gargoylesoftware.htmlunit.BrowserVersion;
/**
 * Htmlunit WebClient Helper
 * Use one WebClient instance per thread by ThreadLocal to support multiple threads execution
 */
public class HttpWebClient {
    private static final Logger LOG = LoggerFactory.getLogger("org.apache.nutch.protocol");
    private static ThreadLocal<WebClient> threadWebClient = new ThreadLocal<WebClient>();
    public static HtmlPage getHtmlPage(String url, Configuration conf) {
        try {
            WebClient webClient = threadWebClient.get();
            if (webClient == null) {
                LOG.info("Initing web client for thread: {}", Thread.currentThread().getId());
            AjaxController ajaxController = new NicelyResynchronizingAjaxController();
            webClient = new WebClient(BrowserVersion.FIREFOX_17);
            webClient.getOptions().setCssEnabled(false);
            webClient.getOptions().setJavaScriptEnabled(true);
            webClient.setAjaxController(ajaxController);    
            webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
            webClient.getOptions().setThrowExceptionOnScriptError(false);
            webClient.getOptions().setPrintContentOnFailingStatusCode(false);
            webClient.getOptions().setRedirectEnabled(true);
            webClient.getOptions().setPopupBlockerEnabled(true);
            webClient.setCache(new ExtHtmlunitCache());
                // Enhanced WebConnection based on urlfilter

//百度云盘基本都是Ajax实现的,提供了账号密码方式

      HtmlPage loginPage = webClient.getPage("http://yun.baidu.com");

        loginPage.getElementById("TANGRAM__PSP_4__userName").setAttribute("value","280889189");
        loginPage.getElementById("TANGRAM__PSP_4__password").setAttribute("value","123578951");
        loginPage = ((HtmlInput)loginPage.getElementById("TANGRAM__PSP_4__submit")).click();
            webClient.setWebConnection(new RegexHttpWebConnection(webClient,conf));
            threadWebClient.set(webClient);
            }
            HtmlPage page = webClient.getPage(url);
//            webClient.closeAllWindows();
            return page;
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }
    public static HtmlPage getHtmlPage(String url) {
        return getHtmlPage(url, null);
    }
}

转载于:https://my.oschina.net/junfrank/blog/288033

你可能感兴趣的文章
13 用Css做下拉菜单
查看>>
homework-01
查看>>
修改WAMPServer中MySql中文乱码的方法
查看>>
【下载】推荐一款免费的人脸识别SDK
查看>>
不定参数
查看>>
浏览器各种距离
查看>>
使用Python读取Google Spreadsheet的内容并写入到mangodb
查看>>
DOM操作和jQuery实现选项移动操作
查看>>
[emuch.net]MatrixComputations(1-6)
查看>>
ByteArrayOutputStream用法
查看>>
Floyed那些事~~~~~
查看>>
Python 学习笔记1 安装和IDE
查看>>
H5新增标签
查看>>
日志分析
查看>>
Extract Datasets
查看>>
递归加法运算
查看>>
蓝桥杯 倍数问题(dfs,枚举组合数)
查看>>
蓝桥杯 穿越雷区(bfs)
查看>>
SQL FORMAT() 函数实例
查看>>
iTerm 使用expect实现自动远程登录,登录跳板机
查看>>