Baidu文库爬取/百度文库爬虫(一)TXT

由于txt文档的下载是系列下载中最简单的部分，故放在整个系列的第一篇

简介

本项目是基于python实现对百度文库可预览文档的下载,实现了对以下文档格式的下载：

doc/docx
ppt/pptx
xls/xlsx
pdf
txt

⚠️本项目下载的文档均为pdf格式(除txt外)

⚠️项目是本人原创，转载请注明出处

⚠️项目是本人课程设计的作品，请勿用于商业用途

具体实现

问题分析

在百度文库随意搜索一篇txt文档，如下图：

发现下载该文档需要使用下载卷，事实上，大多数百度文库的文档均是以支付下载卷的形式下载

当然，不排除开通VIP的用户（土豪）直接下载

抓取网页信息

通过使用Chrome浏览器自带的网页抓包工具对网页进行分析(在网站上，按下F12)

关于Chrome抓包分析，可以参考这篇文章：

Google Chrome抓包分析详解

在Network页面勾选Preserve log选项以记录网站记录，而后刷新网页

在JS界面可以找到如下网址

打开该网址后可以发现其中内容如下：

cb([{"parags":[{"c":"\u673a\u5668\u6570\uff1a\u4e00\u4e2a\u6570\u503c\u5728\u8ba1\u7b97\u673a\u4e2d\u4e8c\u8fdb\u5236\u8868\u793a\u5f62\u5f0f\u3002\r\n\r\n\u5fae\u7a0b\u5e8f:\u7531\u5fae\u6307\u4ee4\u5e8f\u5217\u7ec4\u6210\u7684\u7a0b\u5e8f\u3002\r\n\r\n\u52a8\u6001\u5fae\u7a0b\u5e8f\u8bbe\u8ba1\uff1a\u901a\u8fc7\u6539\u53d8\u5fae\u6307\u4ee4\u548c...

显然这是unicode字符，找一在线网站将其转换一下可得到如下内容：

1	机器数：一个数值在计算机中二进制表示形式。\r\n\r\n微程序:由微指令序列组成的程序。\r\n\r\n动态微程序设计：通过改变微指令和

经过比较可以发现这正是txt文件的内容，事实上，该网址中的数据即为文档内容的json数据

构造网页url并抓取相应内容

分析上述网页的url可以发现

主要分为两部分

https://wkretype.bdimg.com/retype/text/df3abfc36137ee06eff9183f
md5sum=4b7be23adc08726ded99bc39f6a6d76b&sign=c180630292&callback=cb&pn=1&rn=5&type=txt&rsign=p_5-r_0-s_b2eb2&_=1585882305807

显然，需要找到url中的信息

首先在网页源代码中进行搜索df3abfc36137ee06eff9183f发现

1	'docId': 'df3abfc36137ee06eff9183f' // 文档格式

而对于md5sum等信息的搜索无结果，显然，这种信息一般由网页动态生成的，因此继续进行抓包分析

在XHR中发现了如下信息：

打开网页后果然找到了需要的信息的json格式,使用sublime格式化json数据得：

/**/
cb({
	"doc_id": "df3abfc36137ee06eff9183f",
	"smallFlowProfessionalDoc": 1,
	"smallFlowContent": {
		"isBaiduWiseGuideSmallFlow": 0,
		"isBaiduWiseGuideValid": 0,
		"isBaiduWiseGuideVipSmallFlow": 0,
		"isBaiduWiseGuideVipValid": 0
	},
	...
	"md5sum": "&md5sum=4b7be23adc08726ded99bc39f6a6d76b&sign=c180630292",
	"bcsParam": false,
	"rsign": "p_5-r_0-s_b2eb2",
	"downloadToken": "ba13d4912584b799e78b8bbbff158de5",
	"matchStatus": 1,
	"seoTitle": "\u8ba1\u7b97\u673a\u7ec4\u6210\u540d\u8bcd\u89e3\u91ca"
})

因此，转而构造https://wenku.baidu.com/api/doc/getdocinfo?callback=cb& doc_id=df3abfc36137ee06eff9183f&t=1585882305929&_=1585882305806获取相应信息

很明显，只需要doc_id以及t与_即可

doc_id上面已经提到可在网页源代码中直接获取，故使用爬虫直接爬取源代码并提取即可
t与_不难看出是与时间相关的属性，简单测试即可发现这是时间戳，而在python中提供了获取时间戳的 time.time()函数，值得注意的是该时间戳为秒级时间戳，需要转换成毫秒级

一些细节

对于requests的应用为常规的get操作
在进行后续操作前需要对网页源代码进行爬取，以获取文档信息，如：
- title、docType、docID、totalPageNum
在获取网页的json数据后，获取( 和 )之前的内容，将json字符串转换为python的字典类型，便于管理
将数据中的文本内容，即键值为"c"的内容，以追加写方式写入title.txt中

完整代码

from requests import get
from requests.exceptions import ReadTimeout
from chardet import detect
from bs4 import BeautifulSoup
from os import getcwd
from re import findall
from json import loads
from time import time


class GetTxt:
    def __init__(self, url, savepath):
        self.url = url
        self.savepath = savepath if savepath != '' else getcwd()
        self.html = ''
        self.wkinfo = {}  # 存储文档基本信息:title、docType、docID
        self.txturls = []

        self.getHtml()
        self.getWkInfo()

    # 获取网站源代码
    def getHtml(self):
        try:
            header = {'User-Agent': 'Mozilla/5.0 '
                                    '(Macintosh; Intel Mac OS X 10_14_6) '
                                    'AppleWebKit/537.36 (KHTML, like Gecko) '
                                    'Chrome/78.0.3904.108 Safari/537.36'}
            response = get(self.url, headers=header)
            self.transfromEncoding(response)
            self.html = BeautifulSoup(response.text, 'html.parser')  # 格式化

        except ReadTimeout as e:
            print(e)
            return None

    # 转换网页源代码为对应编码格式
    def transfromEncoding(self, html):
        # 检测并修改html内容的编码方式
        html.encoding = detect(html.content).get("encoding")

    # 获取文档基本信息:名字,类型,文档ID
    def getWkInfo(self):
        items = ["'title'", "'docType'", "'docId'", "'totalPageNum"]
        for item in items:
            ls = findall(item + ".*'", str(self.html))
            if len(ls) != 0:
                message = ls[0].split(':')
                self.wkinfo[eval(message[0])] = eval(message[1])

    # 获取json字符串
    def getJson(self, url):
        """
        :param url: json文件所在页面的url
        :return: json格式字符串
        """
        response = get(url)
        # 获取json格式数据
        jsonstr=response.text[response.text.find('(') + 1:response.text.rfind(')')]
        return jsonstr

    # 获取json字符串对应的字典
    def convertJsonToDict(self, jsonstr):
        """
        :param: jsonstr: json格式字符串
        :return: json字符串所对应的python字典
        """
        textdict = loads(jsonstr)  # 将json字符串转换为python的字典对象
        return textdict

    # 获取包含txt文本的json文件的url
    def getTxtUrlForTXT(self):
        timestamp = round(time() * 1000)  # 获取时间戳
        # 构造请求url,获取json文件所在url的参数
        messageurlprefix = "https://wenku.baidu.com/api/doc/getdocinfo?" \
                           "callback=cb&doc_id="
        messageurlsuffix = self.wkinfo.get("docId") + "&t=" + \
                           str(timestamp) + "&_=" + str(timestamp + 1)

        textdict = self.convertJsonToDict(
            self.getJson(messageurlprefix + messageurlsuffix))

        # 获取json文件所在url的参数
        self.txturls.append("https://wkretype.bdimg.com/retype/text/" +
                            self.wkinfo.get('docId') +
                            textdict.get('md5sum') +
                            "&callback=cb&pn=1&rn=" +
                            textdict.get("docInfo").get("totalPageNum") +
                            "&rsign=" + textdict.get("rsign") + "&_=" +
                            str(timestamp))

    # 将文本内容保存
    def saveToTxt(self, content):
        savepath = self.savepath + '/' + self.wkinfo.get('title') + '.txt'
        with open(savepath, "a") as f:
            f.write(content)

    def getTXT(self):
        self.getTxtUrlForTXT()
        for url in self.txturls:
            textls = self.convertJsonToDict(self.getJson(url))
            for text in textls:
                content = text.get("parags")[0].get("c")
                self.saveToTxt(content)


if __name__ == '__main__':
  	# 存储路径为空则默认在当前目录下生成
    GetTxt('https://wenku.baidu.com/view/df3abfc36137ee06eff9183f.html?from=search', '存储').getTXT()

测试

对上述网站运行程序后得到正确txt文件

通过以上分析，就完成了对百度文库txt文档的爬取