爬虫：爬取新闻内容及图片，存入数据库

一、需求

二、代码

一、需求

1、对新闻主页上的新闻进行爬取，要求解析出标题、内容、新闻类型、图片并存入数据库。

2、只爬取带有图片的新闻，一张即可。

二、代码

以下是对新华网爬取的代码示例。

import requests as rqfrom bs4 import BeautifulSoupimport re,osimport datetimefrom datetime import timedeltafrom difflib import SequenceMatcherfrom gbase import GBASE_DB from conf import IMGPATH,LOCALPATH,PICSIZExinhua_dict = {'politics':1, #时政 'culture':2, #文化 'health':3,#健康 'fortune':4,#财经 'world':5,#国际 }def classify_news(s, news_list):for li in news_list:if li in s:return lireturn Nonedef get_xinhua_news(url):'''爬取新华网标题、内容、分类、图片'''newsWeb = rq.get(url)newsWeb.encoding = 'utf-8'soup = BeautifulSoup(newsWeb.text,'html.parser')#获取标题title_element = soup.find('span', class_='title')title = title_element.get_text(strip=True)#获取分类news_type = xinhua_dict[classify_news(url,xinhua_dict.keys())]#获取内容content_element = soup.find('div', id='detail')paragraphs = content_element.find_all('p')content = '\n'.join(paragraph.get_text(strip=True) for paragraph in paragraphs)content = re.sub('\n+', '\n', content).replace('"', '\\"').replace('\n', '\\n')#获取图片jpg_element = soup.find_all('img')jpg_pattern = re.compile(r'src="([^"]*1n\.(jpg|jpeg))"')j_list = jpg_pattern.findall(str(jpg_element))for j in j_list:jpg_path = os.path.basename(j[0])jpg_url = url[:url.find('c_')] + jpg_pathpicture = rq.get(jpg_url)if picture.status_code==200:if len(picture.content)>PICSIZE:with open(LOCALPATH+jpg_path,"wb") as f:f.write(picture.content)return_path = IMGPATH+jpg_pathbreakelse:passreturn title,news_type,content,return_pathdef main():db = GBASE_DB()newsUrl = 'http://www.xinhuanet.com/'newsWeb = rq.get(newsUrl)newsWeb.encoding = 'utf-8'soup = BeautifulSoup(newsWeb.text,'lxml')#获取新闻网址列表link_list = []li_elements = soup.find_all('li')for li_element in li_elements:a_element = li_element.find('a')if a_element:url = a_element.get('href')if url.startswith("http://www.news.cn/") and url.endswith(".htm") and 'c_' in url and classify_news(url,xinhua_dict.keys())!=None:link_list.append(url)#逐个解析新闻网址for link in link_list:try:title,news_type,content,jpg_path = get_xinhua_news(link)sql = '''insert into table_name(title,type,content,image)values ('{}', {},'{}','{}')'''.format(title,news_type,content,jpg_path)db.execute_sql(sql)print('（成功）爬取新华网：',title)except Exception as e:print('爬取失败：',link,' :',e)continueif __name__ == '__main__':main()

首先，对新华网主页进行爬取，获取页面上所有的新闻链接，存放进入link_list列表中。

然后，依次访问每一个新闻链接，并解析标题、内容，需要对空格、特殊字符等做一下清洗。根据子频道路径进行分类，并爬取像素值大于阈值的图片（避免爬取到页面上的二维码等小图），图片保存在服务器本地某个文件夹下，如果没有符合条件的图片，则会报错，在main函数中抛出异常，跳过此新闻链接的爬取。

最后，存入数据库。

爬虫：爬取新闻内容及图片，存入数据库

一、需求

二、代码

最新关注

热文推荐

2024最新AI大模型产品汇总

零信任-Zscaler零信任介绍(7)

《SpringBoot篇》15.SpringBoot整合MongoDB超详细教程（包括安装教程）

MySql 知识大汇总

详解如何使用Net将HTML简历导出为PDF格式

linux基础(3)–实用指令2（时间指令、搜索指令和压缩指令）

爬虫：爬取新闻内容及图片，存入数据库

一、需求

二、代码

相关文章

最新关注

热文推荐