python3爬取纵横网小说并写入文本文件

文中用到的库：
request
BeautifulSoup

requests库的一些方法：

爬取网页主要有如下几个关键步骤：

get请求则使用requests.get请求网页：

response = requests.get(book_url, headers=header)

soup = BeautifulSoup(response.text,'lxml')# 使用BeautifulSoup解析网页，解析的结果就是一个完整的html网页

content = html.select('#readerFt > div > div.content > p')# 使用soup.select，通过标签查找正文

通过子标签查找时，尽量不使用完整的selector

比如下图中，正文都是放在class=content标签下的每一个<p></p>标签中

eg：第二个<p></p>标签复制出来的selector就是这样的：#readerFt > div > div.content > p:nth-child(2)，由于我们是爬取整篇小说，不止取第一段落，所以去掉p:nth-child(2)后面的nth-child(2)，直接为#readerFt > div > div.content > p

完整的代码为：

# -*- coding: utf-8 -*-
import re
import requests
from bs4 import BeautifulSoup
from requests.exceptions import RequestException

def get_page(book_url):
    '''
        try... except... 通过response的状态码判断是否请求成功，若请求成功则使用BeautifulSoup解析网页，若状态码不是200，则抛出异常
    '''
    try:
        # 构建一个header，模拟浏览器的操作，有些网站做了限制，如果不使用header，则无法正常返回数据
        header = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'}
        response = requests.get(book_url, headers=header)
        if response.status_code == 200:
            soup = BeautifulSoup(response.text,'lxml')# 使用BeautifulSoup解析网页，解析的结果就是一个完整的html网页
            print(type(soup))# <class 'bs4.BeautifulSoup'>
            return soup
        return response.status_code
    except RequestException:
        return '请求失败！'
def download():
    html = get_page(book_url)
    content = html.select('#readerFt > div > div.content > p')# 使用soup.select，通过标签查找正文
    # print(content) #打印结果是list类型
    f = open('E:\\pyProject\\test1\\content.txt', 'w')
    for i in content:
        i = str(i) # 将类型为<class 'bs4.element.Tag'>强转为str类型
        f.write(i+'\n') # 将每一个段落都换行写入
    f.close()

'''
若想去掉<p></p>标签，可以使用下面的方法，使用一个正则表达式，仅获取<p></p>标签中的文字
'''
def download1():
    html = get_page(book_url)
    content_html = html.select('#readerFt > div > div.content')
    # print(content_html)
    content = re.findall(r'<p>(.*?)</p>', str(content_html), re.S)# 通过正则表达式获取<p></p>标签中的文字
    # print(content)
    f = open('E:\\pyProject\\test1\\content.txt', 'w')
    for n in content:
        f.write(str(n)+'\n')
    f.close()

if __name__=='__main__':
    book_url = 'http://book.zongheng.com/chapter/681832/37860473.html'
    download()
    # download1()

调用download()方法写入txt文件为：

调用download1()方法写入txt文件的结果：

至此，一个简单的爬取小说的脚本完成，撒花~~

本文地址：https://blog.csdn.net/dhr201499/article/details/107317802

《python3爬取纵横网小说并写入文本文件.doc》

下载本文的Word格式文档，以方便收藏与打印。

python3爬取纵横网小说并写入文本文件

相关推荐

怎么使用python3操作mongodb数据库

etcd：增加30%的写入性能

php使用flock阻塞写入文件和非阻塞写入文件的实例讲解

php使用flock堵塞写入文件和非堵塞写入文件

写入Apache Hudi数据集

C#使用读写锁解决多线程并发写入文件时线程同步的问题

leveldb - 并发写入处理

SQLAlchemy并发写入引发的思考