热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

Java爬虫学习:利用HttpClient和Jsoup库实现简单的Java爬虫程序

利用HttpClient和Jsoup库实现简单的Java爬虫程序HttpClient简介HttpClient是ApacheJakartaCommon下的子项目,可以用来提供高效的

利用HttpClient和Jsoup库实现简单的Java爬虫程序

HttpClient简介

HttpClient是Apache Jakarta Common下的子项目,可以用来提供高效的、最新的、功能丰富的支持HTTP协议的客户端编程工具包,并且它支持 HTTP 协议最新的版本。它的主要功能有:

  • (1) 实现了所有 HTTP 的方法(GET,POST,PUT,HEAD 等)
  • (2) 支持自动转向
  • (3) 支持 HTTPS 协议
  • (4) 支持代理服务器等

Jsoup简介

jsoup是一款Java的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。它的主要功能有:
- (1) 从一个URL,文件或字符串中解析HTML;
- (2) 使用DOM或CSS选择器来查找、取出数据;
- (3) 可操作HTML元素、属性、文本;

使用步骤

maven项目添加依赖

pom.xml文件依赖如下:

<dependency>
    <groupId>org.apache.httpcomponentsgroupId>
    <artifactId>httpclientartifactId>
    <version>4.5.2version>
dependency>

<dependency>
    <groupId>org.jsoupgroupId>
    <artifactId>jsoupartifactId>
    <version>1.8.3version>
dependency>

编写Junit测试代码

代码


import org.apache.http.HttpEntity;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.protocol.HttpClientContext;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClientBuilder;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.junit.Test;

import java.util.List;

/** * HttpClient & Jsoup libruary test class * * Created by xuyh at 2017/11/6 15:28. */
public class HttpClientJsoupTest {
    @Test
    public void test() {
            //通过httpClient获取网页响应,将返回的响应解析为纯文本
        HttpGet httpGet = new HttpGet("http://sports.sina.com.cn/");
        httpGet.setConfig(RequestConfig.custom().setSocketTimeout(30000).setConnectTimeout(30000).build());
        CloseableHttpClient httpClient = null;
        CloseableHttpResponse respOnse= null;

        String respOnseStr= "";
        try {
            httpClient = HttpClientBuilder.create().build();
            HttpClientContext cOntext= HttpClientContext.create();
            respOnse= httpClient.execute(httpGet, context);
            int state = response.getStatusLine().getStatusCode();
            if (state != 200)
                respOnseStr= "";
            HttpEntity entity = response.getEntity();
            if (entity != null)
                respOnseStr= EntityUtils.toString(entity, "utf-8");
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if (response != null)
                    response.close();
                if (httpClient != null)
                    httpClient.close();
            } catch (Exception ex) {
                ex.printStackTrace();
            }
        }

        if (respOnseStr== null)
            return;

        //将解析到的纯文本用Jsoup工具转换成Document文档并进行操作
        Document document = Jsoup.parse(responseStr);
        List elements = document.getElementsByAttributeValue("class", "phdnews_txt fr").first()
                .getElementsByAttributeValue("class", "phdnews_hdline");
        elements.forEach(element -> {
            for (Element e : element.getElementsByTag("a")) {
                System.out.println(e.attr("href"));
                System.out.println(e.text());
            }
        });
    }
}

详解

  • 新建HttpGet对象,对象将从 http://sports.sina.com.cn/ 这个URL地址获取GET响应。并设置socket超时时间和连接超时时间分别为30000ms。
HttpGet httpGet = new HttpGet("http://sports.sina.com.cn/");
httpGet.setConfig(RequestConfig.custom().setSocketTimeout(30000).setConnectTimeout(30000).build());
  • 通过HttpClientBuilder新建一个CloseableHttpClient对象,并执行上面的HttpGet规定的请求,将响应放在新建的HttpClientContext对象中。最后从HttpClientContext对象中获取响应的文本格式。
CloseableHttpClient httpClient = null;
CloseableHttpResponse respOnse= null;

String respOnseStr= "";
try {
    httpClient = HttpClientBuilder.create().build();
    HttpClientContext cOntext= HttpClientContext.create();

    respOnse= httpClient.execute(httpGet, context);

    int state = response.getStatusLine().getStatusCode();
    if (state != 200)
        respOnseStr= "";


    HttpEntity entity = response.getEntity();
    if (entity != null)
        respOnseStr= EntityUtils.toString(entity, "utf-8");


} catch (Exception e) {
    e.printStackTrace();
} finally {
    try {
        if (response != null)
            response.close();
        if (httpClient != null)
            httpClient.close();
    } catch (Exception ex) {
        ex.printStackTrace();
    }
}
  • 将响应的文本用Jsoup库解析,得到其中的各个元素
Document document = Jsoup.parse(responseStr);

List elements = document.getElementsByAttributeValue("class", "phdnews_txt fr").first()
        .getElementsByAttributeValue("class", "phdnews_hdline");

elements.forEach(element -> {
    for (Element e : element.getElementsByTag("a")) {
        System.out.println(e.attr("href"));
        System.out.println(e.text());
    }
});
  • Jsoup的Document对象继承自org.jsoup.nodes.Element类和Element均有的部分方法:
public Element getElementById(String id);//通过id获取元素
public Elements getElementsByClass(String className);//通过className获取元素
public Elements getElementsByAttributeValue(String key, String value);//通过属性值获取元素
public Elements getElementsByTag(String tagName);//通过标签名获取元素
public String attr(String attributeKey);//获取本元素的属性值
public String text();//获取本元素的内容
  • 其中HTML规定的元素格式为:
<div class="code">   
    <div>
        <br>
            这是第一个段落。    
        <br>
    div>
div>

运行结果

  • 运行结果如下所示
http://sports.sina.com.cn/sportsevents/3v3/2017-11-05/doc-ifynmzrs7218551.shtml
3X3黄金联赛冠军赛山西队夺冠!独享48http://video.sina.com.cn/sports/k/cba/1105final3x3/
视频
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/181467390769.html
黄金mvp集锦
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/170167390621.html
直捣黄龙1v2
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/183267390917.html
5佳球:库里式虚晃
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/150067390331.html
大嫂徐冬冬亮相
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/145367390313.html
现场众多美女云集
http://video.sina.com.cn/p/sports/c/zj/v/doc/2017-11-05/150867390337.html
啦啦队热舞表演
http://sports.sina.com.cn/nba/
哈登56分周琦暴扣火箭胜
http://sports.sina.com.cn/basketball/nba/2017-11-06/doc-ifynmzrs7300047.shtml
詹皇26分骑士负
  • 爬取的网页内容区域为下图所示:

这里写图片描述

编写工具类

将HttpClient和Jsoup进行封装,形成一个工具类,内容如下:


import org.apache.http.HttpEntity;
import org.apache.http.NameValuePair;
import org.apache.http.client.COOKIEStore;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.client.protocol.HttpClientContext;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.COOKIE.COOKIE;
import org.apache.http.entity.ContentType;
import org.apache.http.entity.StringEntity;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClientBuilder;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.ssl.SSLContextBuilder;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import javax.net.ssl.*;
import java.io.IOException;
import java.security.GeneralSecurityException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

/** * 
 * Http工具,包含: * 普通http请求工具(使用httpClient进行http,https请求的发送) * 
* Created by xuyh at 2017/7/17 19:08. */
public class HttpUtils { /** * 请求超时时间,默认20000ms */ private int timeout = 20000; /** * COOKIE表 */ private Map COOKIEMap = new HashMap<>(); /** * 请求编码(处理返回结果),默认UTF-8 */ private String charset = "UTF-8"; private static HttpUtils httpUtils; private HttpUtils() { } /** * 获取实例 * * @return */ public static HttpUtils getInstance() { if (httpUtils == null) httpUtils = new HttpUtils(); return httpUtils; } /** * 清空COOKIEMap */ public void invalidCOOKIEMap() { COOKIEMap.clear(); } public int getTimeout() { return timeout; } /** * 设置请求超时时间 * * @param timeout */ public void setTimeout(int timeout) { this.timeout = timeout; } public String getCharset() { return charset; } /** * 设置请求字符编码集 * * @param charset */ public void setCharset(String charset) { this.charset = charset; } /** * 将网页返回为解析后的文档格式 * * @param html * @return * @throws Exception */ public static Document parseHtmlToDoc(String html) throws Exception { return removeHtmlSpace(html); } private static Document removeHtmlSpace(String str) { Document doc = Jsoup.parse(str); String result = doc.html().replace(" ", ""); return Jsoup.parse(result); } /** * 执行get请求,返回doc * * @param url * @return * @throws Exception */ public Document executeGetAsDocument(String url) throws Exception { return parseHtmlToDoc(executeGet(url)); } /** * 执行get请求 * * @param url * @return * @throws Exception */ public String executeGet(String url) throws Exception { HttpGet httpGet = new HttpGet(url); httpGet.setHeader("COOKIE", convertCOOKIEMapToString(COOKIEMap)); httpGet.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build()); CloseableHttpClient httpClient = null; String str = ""; try { httpClient = HttpClientBuilder.create().build(); HttpClientContext cOntext= HttpClientContext.create(); CloseableHttpResponse respOnse= httpClient.execute(httpGet, context); getCOOKIEsFromCOOKIEStore(context.getCOOKIEStore(), COOKIEMap); int state = response.getStatusLine().getStatusCode(); if (state == 404) { str = ""; } try { HttpEntity entity = response.getEntity(); if (entity != null) { str = EntityUtils.toString(entity, charset); } } finally { response.close(); } } catch (IOException e) { throw e; } finally { try { if (httpClient != null) httpClient.close(); } catch (IOException e) { throw e; } } return str; } /** * 用https执行get请求,返回doc * * @param url * @return * @throws Exception */ public Document executeGetWithSSLAsDocument(String url) throws Exception { return parseHtmlToDoc(executeGetWithSSL(url)); } /** * 用https执行get请求 * * @param url * @return * @throws Exception */ public String executeGetWithSSL(String url) throws Exception { HttpGet httpGet = new HttpGet(url); httpGet.setHeader("COOKIE", convertCOOKIEMapToString(COOKIEMap)); httpGet.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build()); CloseableHttpClient httpClient = null; String str = ""; try { httpClient = createSSLInsecureClient(); HttpClientContext cOntext= HttpClientContext.create(); CloseableHttpResponse respOnse= httpClient.execute(httpGet, context); getCOOKIEsFromCOOKIEStore(context.getCOOKIEStore(), COOKIEMap); int state = response.getStatusLine().getStatusCode(); if (state == 404) { str = ""; } try { HttpEntity entity = response.getEntity(); if (entity != null) { str = EntityUtils.toString(entity, charset); } } finally { response.close(); } } catch (IOException e) { throw e; } catch (GeneralSecurityException ex) { throw ex; } finally { try { if (httpClient != null) httpClient.close(); } catch (IOException e) { throw e; } } return str; } /** * 执行post请求,返回doc * * @param url * @param params * @return * @throws Exception */ public Document executePostAsDocument(String url, Map params) throws Exception { return parseHtmlToDoc(executePost(url, params)); } /** * 执行post请求 * * @param url * @param params * @return * @throws Exception */ public String executePost(String url, Map params) throws Exception { String reStr = ""; HttpPost httpPost = new HttpPost(url); httpPost.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build()); httpPost.setHeader("COOKIE", convertCOOKIEMapToString(COOKIEMap)); List paramsRe = new ArrayList<>(); for (String key : params.keySet()) { paramsRe.add(new BasicNameValuePair(key, params.get(key))); } CloseableHttpClient httpclient = HttpClientBuilder.create().build(); CloseableHttpResponse response; try { httpPost.setEntity(new UrlEncodedFormEntity(paramsRe)); HttpClientContext cOntext= HttpClientContext.create(); respOnse= httpclient.execute(httpPost, context); getCOOKIEsFromCOOKIEStore(context.getCOOKIEStore(), COOKIEMap); HttpEntity entity = response.getEntity(); reStr = EntityUtils.toString(entity, charset); } catch (IOException e) { throw e; } finally { httpPost.releaseConnection(); } return reStr; } /** * 用https执行post请求,返回doc * * @param url * @param params * @return * @throws Exception */ public Document executePostWithSSLAsDocument(String url, Map params) throws Exception { return parseHtmlToDoc(executePostWithSSL(url, params)); } /** * 用https执行post请求 * * @param url * @param params * @return * @throws Exception */ public String executePostWithSSL(String url, Map params) throws Exception { String re = ""; HttpPost post = new HttpPost(url); List paramsRe = new ArrayList<>(); for (String key : params.keySet()) { paramsRe.add(new BasicNameValuePair(key, params.get(key))); } post.setHeader("COOKIE", convertCOOKIEMapToString(COOKIEMap)); post.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build()); CloseableHttpResponse response; try { CloseableHttpClient httpClientRe = createSSLInsecureClient(); HttpClientContext cOntextRe= HttpClientContext.create(); post.setEntity(new UrlEncodedFormEntity(paramsRe)); respOnse= httpClientRe.execute(post, contextRe); HttpEntity entity = response.getEntity(); if (entity != null) { re = EntityUtils.toString(entity, charset); } getCOOKIEsFromCOOKIEStore(contextRe.getCOOKIEStore(), COOKIEMap); } catch (Exception e) { throw e; } return re; } /** * 发送JSON格式body的POST请求 * * @param url 地址 * @param jsonBody json body * @return * @throws Exception */ public String executePostWithJson(String url, String jsonBody) throws Exception { String reStr = ""; HttpPost httpPost = new HttpPost(url); httpPost.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build()); httpPost.setHeader("COOKIE", convertCOOKIEMapToString(COOKIEMap)); CloseableHttpClient httpclient = HttpClientBuilder.create().build(); CloseableHttpResponse response; try { httpPost.setEntity(new StringEntity(jsonBody, ContentType.APPLICATION_JSON)); HttpClientContext cOntext= HttpClientContext.create(); respOnse= httpclient.execute(httpPost, context); getCOOKIEsFromCOOKIEStore(context.getCOOKIEStore(), COOKIEMap); HttpEntity entity = response.getEntity(); reStr = EntityUtils.toString(entity, charset); } catch (IOException e) { throw e; } finally { httpPost.releaseConnection(); } return reStr; } /** * 发送JSON格式body的SSL POST请求 * * @param url 地址 * @param jsonBody json body * @return * @throws Exception */ public String executePostWithJsonAndSSL(String url, String jsonBody) throws Exception { String re = ""; HttpPost post = new HttpPost(url); post.setHeader("COOKIE", convertCOOKIEMapToString(COOKIEMap)); post.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build()); CloseableHttpResponse response; try { CloseableHttpClient httpClientRe = createSSLInsecureClient(); HttpClientContext cOntextRe= HttpClientContext.create(); post.setEntity(new StringEntity(jsonBody, ContentType.APPLICATION_JSON)); respOnse= httpClientRe.execute(post, contextRe); HttpEntity entity = response.getEntity(); if (entity != null) { re = EntityUtils.toString(entity, charset); } getCOOKIEsFromCOOKIEStore(contextRe.getCOOKIEStore(), COOKIEMap); } catch (Exception e) { throw e; } return re; } private void getCOOKIEsFromCOOKIEStore(COOKIEStore COOKIEStore, Map COOKIEMap) { List COOKIEs = COOKIEStore.getCOOKIEs(); for (COOKIE COOKIE : COOKIEs) { COOKIEMap.put(COOKIE.getName(), COOKIE.getValue()); } } private String convertCOOKIEMapToString(Map map) { String COOKIE = ""; for (String key : map.keySet()) { COOKIE += (key + "=" + map.get(key) + "; "); } if (map.size() > 0) { COOKIE = COOKIE.substring(0, COOKIE.length() - 2); } return COOKIE; } /** * 创建 SSL连接 * * @return * @throws GeneralSecurityException */ private static CloseableHttpClient createSSLInsecureClient() throws GeneralSecurityException { try { SSLContext sslCOntext= new SSLContextBuilder().loadTrustMaterial(null, (chain, authType) -> true).build(); SSLConnectionSocketFactory sslCOnnectionSocketFactory= new SSLConnectionSocketFactory(sslContext, (s, sslContextL) -> true); return HttpClients.custom().setSSLSocketFactory(sslConnectionSocketFactory).build(); } catch (GeneralSecurityException e) { throw e; } } }

上面的工具类不仅可以进行网页内容的获取,还能够进行http请求的发送。

源码地址

https://github.com/johnsonmoon/HttpUtils.git

推荐阅读
  • Voicewo在线语音识别转换jQuery插件的特点和示例
    本文介绍了一款名为Voicewo的在线语音识别转换jQuery插件,该插件具有快速、架构、风格、扩展和兼容等特点,适合在互联网应用中使用。同时还提供了一个快速示例供开发人员参考。 ... [详细]
  • 本文介绍了使用Java实现大数乘法的分治算法,包括输入数据的处理、普通大数乘法的结果和Karatsuba大数乘法的结果。通过改变long类型可以适应不同范围的大数乘法计算。 ... [详细]
  • 本文介绍了lua语言中闭包的特性及其在模式匹配、日期处理、编译和模块化等方面的应用。lua中的闭包是严格遵循词法定界的第一类值,函数可以作为变量自由传递,也可以作为参数传递给其他函数。这些特性使得lua语言具有极大的灵活性,为程序开发带来了便利。 ... [详细]
  • 本文介绍了使用AJAX的POST请求实现数据修改功能的方法。通过ajax-post技术,可以实现在输入某个id后,通过ajax技术调用post.jsp修改具有该id记录的姓名的值。文章还提到了AJAX的概念和作用,以及使用async参数和open()方法的注意事项。同时强调了不推荐使用async=false的情况,并解释了JavaScript等待服务器响应的机制。 ... [详细]
  • 本文讨论了如何优化解决hdu 1003 java题目的动态规划方法,通过分析加法规则和最大和的性质,提出了一种优化的思路。具体方法是,当从1加到n为负时,即sum(1,n)sum(n,s),可以继续加法计算。同时,还考虑了两种特殊情况:都是负数的情况和有0的情况。最后,通过使用Scanner类来获取输入数据。 ... [详细]
  • 本文介绍了OC学习笔记中的@property和@synthesize,包括属性的定义和合成的使用方法。通过示例代码详细讲解了@property和@synthesize的作用和用法。 ... [详细]
  • Mac OS 升级到11.2.2 Eclipse打不开了,报错Failed to create the Java Virtual Machine
    本文介绍了在Mac OS升级到11.2.2版本后,使用Eclipse打开时出现报错Failed to create the Java Virtual Machine的问题,并提供了解决方法。 ... [详细]
  • 在说Hibernate映射前,我们先来了解下对象关系映射ORM。ORM的实现思想就是将关系数据库中表的数据映射成对象,以对象的形式展现。这样开发人员就可以把对数据库的操作转化为对 ... [详细]
  • 1,关于死锁的理解死锁,我们可以简单的理解为是两个线程同时使用同一资源,两个线程又得不到相应的资源而造成永无相互等待的情况。 2,模拟死锁背景介绍:我们创建一个朋友 ... [详细]
  • Java验证码——kaptcha的使用配置及样式
    本文介绍了如何使用kaptcha库来实现Java验证码的配置和样式设置,包括pom.xml的依赖配置和web.xml中servlet的配置。 ... [详细]
  • 本文介绍了指针的概念以及在函数调用时使用指针作为参数的情况。指针存放的是变量的地址,通过指针可以修改指针所指的变量的值。然而,如果想要修改指针的指向,就需要使用指针的引用。文章还通过一个简单的示例代码解释了指针的引用的使用方法,并思考了在修改指针的指向后,取指针的输出结果。 ... [详细]
  • 猜字母游戏
    猜字母游戏猜字母游戏——设计数据结构猜字母游戏——设计程序结构猜字母游戏——实现字母生成方法猜字母游戏——实现字母检测方法猜字母游戏——实现主方法1猜字母游戏——设计数据结构1.1 ... [详细]
  • VScode格式化文档换行或不换行的设置方法
    本文介绍了在VScode中设置格式化文档换行或不换行的方法,包括使用插件和修改settings.json文件的内容。详细步骤为:找到settings.json文件,将其中的代码替换为指定的代码。 ... [详细]
  • 后台获取视图对应的字符串
    1.帮助类后台获取视图对应的字符串publicclassViewHelper{将View输出为字符串(注:不会执行对应的ac ... [详细]
  • 本文介绍了Web学习历程记录中关于Tomcat的基本概念和配置。首先解释了Web静态Web资源和动态Web资源的概念,以及C/S架构和B/S架构的区别。然后介绍了常见的Web服务器,包括Weblogic、WebSphere和Tomcat。接着详细讲解了Tomcat的虚拟主机、web应用和虚拟路径映射的概念和配置过程。最后简要介绍了http协议的作用。本文内容详实,适合初学者了解Tomcat的基础知识。 ... [详细]
author-avatar
Ss_爱咩咩
这个家伙很懒,什么也没留下!
Tags | 热门标签
RankList | 热门文章
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有