用 Python 爬取网页 PDF 和文档

1 爬取网页 PDF

以 https://reader.jojokanbao.cn/rmrb 上 PDF 的下载为例

1.1 在日历控件中输入时间

参考博客：selenium+Python(Js处理日历控件)
网页的日期框中可以直接输入日期

找到输入框对应的标签，根据 class name 进行内容清楚和输入（如果标签有 id 属性可以根据 id 进行输入框确定），代码如下

browser.find_element_by_class_name('el-input__inner').clear()browser.find_element_by_class_name('el-input__inner').send_keys('1976-10-09')

输入日期后需要回车才能刷新页面，使用 selenium 模拟键盘事件参考博客：selenium-模拟键盘事件(回车、删除、刷新等)
实现代码如下

from selenium.webdriver.common.keys import Keysbrowser.find_element_by_class_name('el-input__inner').send_keys(Keys.ENTER)

1.2 下载 PDF 文件

使用 selenium 下载 PDF 文件参考博客：python selenium 下载pdf文件
需要将简单的 browser = webdriver.Chrome() 替换为如下代码

# PDF 文件保存路径down_load_dir = os.path.abspath(".")options = webdriver.ChromeOptions()options.add_experimental_option("excludeSwitches", ['enable-automation'])prefs = {"download.default_directory": down_load_dir,"download.prompt_for_download": False,"download.directory_upgrade": True,"plugins.always_open_pdf_externally": True}options.add_experimental_option('prefs', prefs)options.add_argument("--disable-blink-features=AutomationControlled")browser = webdriver.Chrome(options=options)

1.3 selenium 访问网站被反爬限制封锁

参考博客：python之selenium访问网站被反爬限制封锁解决方法
添加如下代码

browser.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source": """Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"""})

1.4 完整代码

from selenium import webdriverfrom selenium.webdriver.common.keys import Keysimport timefrom bs4 import BeautifulSoupimport osurl = 'https://reader.jojokanbao.cn/rmrb'down_load_dir = os.path.abspath(".")options = webdriver.ChromeOptions()options.add_experimental_option("excludeSwitches", ['enable-automation'])prefs = {"download.default_directory": down_load_dir,"download.prompt_for_download": False,"download.directory_upgrade": True,"plugins.always_open_pdf_externally": True}options.add_experimental_option('prefs', prefs)options.add_argument("--disable-blink-features=AutomationControlled")browser = webdriver.Chrome(options=options)browser.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source": """Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"""})browser.get(url)browser.find_element_by_class_name('el-input__inner').clear()browser.find_element_by_class_name('el-input__inner').send_keys('1976-10-09')# 输入完成后，敲击键盘上的回车键browser.find_element_by_class_name('el-input__inner').send_keys(Keys.ENTER)time.sleep(5)data = browser.page_source# print(data)# 获取文档的下载链接soup = BeautifulSoup(data)body = soup.find('div', attrs={'class': 'el-col el-col-24 el-col-xs-24 el-col-sm-12 el-col-md-12 el-col-lg-12 el-col-xl-12'})link = body.find_all("a")[0].get("href")print(link)# 有了开始对 options 的设置，这一步可以直接下载 PDF 文档browser.get(link)time.sleep(5)browser.close()

2 爬取网页文档

以 https://www.laoziliao.net/rmrb/ 上的文档下载为例

2.1 遇到的问题

整体实现思路和爬取小说（https://blog.csdn.net/mycsdn5698/article/details/133465660）的一样，实现过程中遇到了一些问题：
（1）BeautifulSoup 怎样获取标签间文本内容
获取标签的某个属性，例如 a 标签的 href 属性，代码如下

data = requests.get(url = url, headers = headers)data.encoding = 'UTF-8'soup = BeautifulSoup(data.text, 'html.parser')body = soup.find('div', attrs={'id': 'month_box'})for item in body.find_all('a'):link = item.get("href")print(link)

获取标签间的文本内容
如果标签属性较少，则可以使用正则提取，例子及其代码如下

findTitle = re.compile(r'(.*" />,re.S)for card in soup.find_all('div', class_="card mt-2"):# 提取标题card_title = re.findall(findTitle, str(card))[0]print(card_title)

如果标签属性较多，则可以参考博客：beautifulsoup怎样获取标签间文本内容，例子及其代码如下

data = requests.get(url = news_link, headers = headers)data.encoding = 'UTF-8'soup = BeautifulSoup(data.text, 'html.parser')for context in soup.find_all('div', class_="card mt-2"):# 提取标题news_title = context.find('h2').stringprint(news_title)

（2）将 br 标签替换为换行符
例子如下

方法一：使用 get_text()
缺点：br 标签会变成一些空格，而不是换行

for news_context in context.find_all('div', class_="card-body lh-lg"):tmp_context = news_context.get_text()print(tmp_context)

方法二：参考博客 https://blog.csdn.net/u012587107/article/details/80543977
缺点：str(news_context) 的使用导致 div 标签出现，且变成了

for news_context in context.find_all('div', class_="card-body lh-lg"):tmp_context = (str(news_context).replace('
','\n')).replace('
','\n')# str(news_context) 的使用导致和也出现了tmp_context = (tmp_context.replace('','')).replace('','')# 新闻标题的第一行前是俩Tab，将其替换为四个空格tmp_context = tmp_context.replace('　　','')print(tmp_context)

2.2 完整代码

注意：ANSI 编码的文本在 kindle 打开会有部分乱码，UTF-8 编码的不会

import requestsimport reimport timefrom bs4 import BeautifulSoupheaders = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36'}findTitle = re.compile(r'(.*" />,re.S)url = "https://www.laoziliao.net/rmrb/1946-06"# 获取该月内所有日报的链接data = requests.get(url = url, headers = headers)data.encoding = 'UTF-8'soup = BeautifulSoup(data.text, 'html.parser')body = soup.find('div', attrs={'id': 'month_box'})for item in body.find_all('a'):link = item.get("href")# print(link)# 根据 link 创建每天报纸的 TXT# 提取最后一个斜杠后面的字符last_slash_index = link.rfind("/")if last_slash_index != -1:TXT_name = link[last_slash_index + 1:]TXT_name = TXT_name.replace("-", "")print(TXT_name)# ansi 编码用 kindle 打开有乱码with open("./TXTs/"+TXT_name+".txt", "w", encoding='utf-8') as f:# 访问每天的报纸data = requests.get(url = link, headers = headers)data.encoding = 'UTF-8'soup = BeautifulSoup(data.text, 'html.parser')# 每一版的内容都放在 class="card mt-2" 的 div 中for card in soup.find_all('div', class_="card mt-2"):# 提取标题card_title = re.findall(findTitle, str(card))[0]f.write(card_title + '\n')# print(card_title)# 提取新闻链接news = card.find_all('a')news_link = news[0].get('href')if "#" in news_link:index = news_link.index("#")news_link = news_link[:index]print(news_link)time.sleep(1)# 访问当天的每一版新闻，每一条新闻都放在 class="card mt-2" 的 div 中data = requests.get(url = news_link, headers = headers)data.encoding = 'UTF-8'soup = BeautifulSoup(data.text, 'html.parser')for context in soup.find_all('div', class_="card mt-2"):# # 提取标题# news_title = context.find('h2').string# print(news_title)# 提取新闻内容，存放在 class="card-body lh-lg" 的 div 中for news_context in context.find_all('div', class_="card-body lh-lg"):# 把
换成换行符tmp_context = (str(news_context).replace('
','\n')).replace('
','\n')# str(news_context) 的使用导致和也出现了tmp_context = (tmp_context.replace('','')).replace('','')# 新闻标题的第一行前是俩Tab，将其替换为四个空格tmp_context = tmp_context.replace('　　','')f.write(tmp_context + '\n')# print(tmp_context)f.write('\n\n')

3 一些资源推荐

除了上述两个作为例子的网站，还有时光印记经典珍藏系列，可以免费查看部分资料，全部资料的话是收费的。

用 Python 爬取网页 PDF 和文档

目录

1 爬取网页 PDF

1.1 在日历控件中输入时间

1.2 下载 PDF 文件

1.3 selenium 访问网站被反爬限制封锁

1.4 完整代码

2 爬取网页文档

2.1 遇到的问题

`(.*" />,re.S)for card in soup.find_all('div', class_="card mt-2"):# 提取标题card_title = re.findall(findTitle, str(card))[0]print(card_title)`

2.2 完整代码

3 一些资源推荐

最新关注

热文推荐

前端利器——炫酷的CodePen

Flink 系列四 Flink 运行时架构

2023 年KPI (KPI:Key Performance Indicator)

春节要闻回顾 |比特币突破5万美元；美国检察官督促联邦法官接受币安认罪协议…

2022最新PyCharm安装教程（简单详细）

PostgreSQL数据库命令行执行SQL脚本的三种方式

用 Python 爬取网页 PDF 和文档

目录

1 爬取网页 PDF

1.1 在日历控件中输入时间

1.2 下载 PDF 文件

1.3 selenium 访问网站被反爬限制封锁

1.4 完整代码

2 爬取网页文档

2.1 遇到的问题

(.*" />,re.S)for card in soup.find_all('div', class_="card mt-2"):# 提取标题card_title = re.findall(findTitle, str(card))[0]print(card_title)

2.2 完整代码

3 一些资源推荐

相关文章

最新关注

热文推荐

`(.*" />,re.S)for card in soup.find_all('div', class_="card mt-2"):# 提取标题card_title = re.findall(findTitle, str(card))[0]print(card_title)`