目录
一、需求
二、代码
一、需求
1、对新闻主页上的新闻进行爬取,要求解析出标题、内容、新闻类型、图片并存入数据库。
2、只爬取带有图片的新闻,一张即可。
二、代码
以下是对新华网爬取的代码示例。
import requests as rqfrom bs4 import BeautifulSoupimport re,osimport datetimefrom datetime import timedeltafrom difflib import SequenceMatcherfrom gbase import GBASE_DB from conf import IMGPATH,LOCALPATH,PICSIZExinhua_dict = {'politics':1, #时政 'culture':2, #文化 'health':3,#健康 'fortune':4,#财经 'world':5,#国际 }def classify_news(s, news_list):for li in news_list:if li in s:return lireturn Nonedef get_xinhua_news(url):'''爬取新华网标题、内容、分类、图片'''newsWeb = rq.get(url)newsWeb.encoding = 'utf-8'soup = BeautifulSoup(newsWeb.text,'html.parser')#获取标题title_element = soup.find('span', class_='title')title = title_element.get_text(strip=True)#获取分类news_type = xinhua_dict[classify_news(url,xinhua_dict.keys())]#获取内容content_element = soup.find('div', id='detail')paragraphs = content_element.find_all('p')content = '\n'.join(paragraph.get_text(strip=True) for paragraph in paragraphs)content = re.sub('\n+', '\n', content).replace('"', '\\"').replace('\n', '\\n')#获取图片jpg_element = soup.find_all('img')jpg_pattern = re.compile(r'src="([^"]*1n\.(jpg|jpeg))"')j_list = jpg_pattern.findall(str(jpg_element))for j in j_list:jpg_path = os.path.basename(j[0])jpg_url = url[:url.find('c_')] + jpg_pathpicture = rq.get(jpg_url)if picture.status_code==200:if len(picture.content)>PICSIZE:with open(LOCALPATH+jpg_path,"wb") as f:f.write(picture.content)return_path = IMGPATH+jpg_pathbreakelse:passreturn title,news_type,content,return_pathdef main():db = GBASE_DB()newsUrl = 'http://www.xinhuanet.com/'newsWeb = rq.get(newsUrl)newsWeb.encoding = 'utf-8'soup = BeautifulSoup(newsWeb.text,'lxml')#获取新闻网址列表link_list = []li_elements = soup.find_all('li')for li_element in li_elements:a_element = li_element.find('a')if a_element:url = a_element.get('href')if url.startswith("http://www.news.cn/") and url.endswith(".htm") and 'c_' in url and classify_news(url,xinhua_dict.keys())!=None:link_list.append(url)#逐个解析新闻网址for link in link_list:try:title,news_type,content,jpg_path = get_xinhua_news(link)sql = '''insert into table_name(title,type,content,image)values ('{}', {},'{}','{}')'''.format(title,news_type,content,jpg_path)db.execute_sql(sql)print('(成功)爬取新华网:',title)except Exception as e:print('爬取失败:',link,' :',e)continueif __name__ == '__main__':main()
首先,对新华网主页进行爬取,获取页面上所有的新闻链接,存放进入link_list列表中。
然后,依次访问每一个新闻链接,并解析标题、内容,需要对空格、特殊字符等做一下清洗。根据子频道路径进行分类,并爬取像素值大于阈值的图片(避免爬取到页面上的二维码等小图),图片保存在服务器本地某个文件夹下,如果没有符合条件的图片,则会报错,在main函数中抛出异常,跳过此新闻链接的爬取。
最后,存入数据库。