Step 1:

安装pdfkit包： https://blog.csdn.net/qq_35865125/article/details/109920565

Step 2:

将单篇文章爬取下来转成pdf。首先，根据文章的网址得到该网页的所有内容(借助urllib，bs4,re模块)，然后，从中截取文章主题部分，因为网页内容包括评论区等太多东西。

最后将主题部分转成pdf。

例子：可以运行：

import pdfkit
import os
import urllib.request
import re
from bs4 import BeautifulSoup

def get_html(url):
    '''
    返回对应url的网页源码，经过解码的内容
    :param url:
    :return:
    '''
    req = urllib.request.Request(url)
    resp = urllib.request.urlopen(req)  #这句崩溃！！！！！！！！！！！！！！！！！！！！需要关闭vpn
    html_page = resp.read().decode('utf-8')
    return html_page


def get_body_for_pdf(url):
    """
    获取url下文章的正文内容
    :param url:
    :return:
    """
    html_page = get_html(url)
    soup = BeautifulSoup(html_page,'html.parser')   #HTML文档解析器
    #提取网页中的文章正文部分，博客园的话关键字是"cnblogs_post_body"， csdn是"article_content"
    #div = soup.find(id = "cnblogs_post_body") # For 博客园#
    div = soup.find(id="article_content")      # For csdn
    return str(div)


def save_single_file_to_PDF(url):
    title = "HackedTitle"
    body = get_body_for_pdf(url)

    options = {
        'page-size':'Letter',
        'encoding':"UTF-8",
        'custom-header':[('Accept-Encoding','gzip')]
    }
    try:
        filename = title + '.pdf'
        pdfkit.from_string(body, 'Awo.pdf', options=options)#输出PDF文件到当前python文件所在目录下，也可以随便指定路径
        print(filename + "  file have saved...")     #提示文章下载完毕
    except:
        pass

if __name__ == '__main__':
    save_single_file_to_PDF('https://blog.csdn.net/qq_35865125/article/details/109837687')

Step 3:

实现自动爬取所有文章，

打开博主的文章列表网页，该网页的源码中包含着所有的文章的题目，网址，把两者通过正则表达式的方式全部提取出来，然后，一一调用Step2中的功能姐就可以了。

FireFox浏览器查看网页html源码的方式，右键-view page source. 下图为csnd博客中某位博主的文章列表网页的html内容(https://blog.csdn.net/qq_35865125/article/list/1)，可见文章的标题和对应的网址。

不同的博客平台有不同的格式，例如博客园的格式：

另外，如果博主的文章很多，文章列表会对应多个网页，各个网页的网址的最后一个数字一般是序号，例如，csdn该博主的文章列表https://blog.csdn.net/qq_35865125/article/list/1，https://blog.csdn.net/qq_35865125/article/list/2,等等。

举例：爬取csdn https://blog.csdn.net/qq_35865125/article/list/1 的所有文章题目以及网址:

如上图所示，《C++ STD标准模板库的泛型思想》这篇文章的网址是https://.../109889333。

搜索 <a href="https://blog.csdn.net/qq_35865125/article/details/109889333" 发现网页中有两处匹配，可以用正则表达式匹配它，然后剔除重复的；

搜索 <a href="https://blog.csdn.net/qq_35865125/article/details/109889333" data-report-click 只有一处匹配，可以直接用它做正则匹配：

代码：验证通过：

import pdfkit
import os
import urllib.request
import re
from bs4 import BeautifulSoup

def get_urls(url, pages):
    total_urls = []

    for i in range(1, pages+1):      #根据一个目录的url找到所有目录

        url_temp = url + str(i)
        htmlContent = get_html(url_temp)   #获取网页源码, 需要安装requests_html库 https://blog.csdn.net/weixin_43790560/article/details/86617630

        # Ref:  https://blog.csdn.net/weixin_42793426/article/details/88545939, python正则表达式https://blog.csdn.net/qq_41800366/article/details/86527810
        # https://www.cnblogs.com/wuxunyan/p/10615260.html
        #<a href="https://blog.csdn.net/qq_35865125/article/details/109920565"  data-report-click=
        net_pattern = re.compile(r'<a href="https://blog.csdn.net/qq_35865125/article/details/[0-9]*"  data-report-click=')

        url_withExtra = re.findall(net_pattern, htmlContent)     #找到一个目录中所有文章的网址

        #剔除重复元素:
        #url_withExtraNoDupli = set(url_withExtra)

        for _url in url_withExtra:
            stIdx = _url.find("https://");
            endIdx= _url.find('"  data-report-click=');
            _url_sub = _url[stIdx:endIdx]
            total_urls.append(_url_sub)            #所有文章url放在一起
            print(_url)
    return total_urls

if __name__ == '__main__':
    #save_single_file_to_PDF('https://blog.csdn.net/qq_35865125/article/details/109837687')
    get_urls('https://blog.csdn.net/qq_35865125/article/list/', 1)

另外，从文章列表页面https://blog.csdn.net/qq_35865125/article/list/1提取文章题目不太好搞(需要研究下python正则表达)，采取workAround的方式，即，从文章网页中提取该文章的题目，例如，

https://blog.csdn.net/qq_35865125/article/details/109889333 这篇文章的html内容中有title标记，可以直接用简单的正则表达式提取。

代码：

#给定一篇文章的链接，从中提取title
def get_title_of_one_artical(url):
    htmlContent = get_html(url)
    title_pattern = re.compile(r'<title>.*</title>') # .*用于匹配任何长度的任何字符
    title_withExtra = re.findall(title_pattern, htmlContent)
    if len(title_withExtra)<1:
        return 'NotFoundName'
    foune_name = title_withExtra[0]
    stIdx = foune_name.find("<title>")+7;
    endIdx = foune_name.find('</title>');
    title = foune_name[stIdx:endIdx]
    return title

Ref:

https://www.cnblogs.com/qsyll0916/p/8677151.html

https://www.cnblogs.com/qsyll0916/p/8678924.html

https://www.cnblogs.com/xingzhui/p/7881905.html

本文地址：https://blog.csdn.net/qq_35865125/article/details/109921762

python爬取一个博主的所有文章至pdf

Step 1:

Step 2:

Step 3:

相关推荐

php调用python脚本失败怎么解决

怎么将R语言与Python集成

Ruby与Python相比有哪些优势

Fortran如何与Python交互

python读取数据怎么去掉逗号

python怎么读取列表数据

python通配符查找方法怎么用

怎么用python通配符查找字符串