ChatGPT实战100例 – (04) 自动化爬虫

一、需求与思路

需求：解析网页元素太复杂，要让他自动解析

操作步骤：

ChatGPT编写相关脚本
python跑起来

需要的前置技能：听说过python爬虫库requests和bs4
没听过？简单：

requests是一个Python HTTP请求库,用于获取网页数据。
bs4是BeautifulSoup的缩写,是一个HTML/XML解析库,用于从网页数据中提取信息。

二、油猴子脚本

问题：
写一段python的bs4库解析，试着从这段html中解析如下结构化信息：
图标类别标题网址简介
并使用json表示，使用英文字段

<div class="url-card io-px-2 col-6 col-2a col-sm-2a col-md-2a col-lg-3a col-xl-5a col-xxl-6a ">    <div class="url-body default "> <a href="/sites/4661.html?menu-id=2331&mininav-id=4543"            target="\_blank" data-id="4661" data-url="https://murf.ai" class="card no-c mb-4 site-4661" title="Murf AI">            <div class="card-body url-content d-flex align-items-center">                <div class="url-img rounded-circle mr-2 d-flex align-items-center justify-content-center"> <img                        class="lazy unfancybox loaded" src="https://api.iowen.cn/favicon/murf.ai.png"                        data-src="https://api.iowen.cn/favicon/murf.ai.png" height="auto" alt="Murf AI"                        data-was-processed="true"> </div>                <div class="url-info flex-fill">                    <div class="text-sm overflowClip\_1"> <strong>Murf AI</strong> </div>                    <p class="overflowClip\_1 m-0 text-muted text-xs">                        AI语音生成器在20种语言。120+逼真的文本到语音的声音，创造完美的AI画外音。轻松地从文字转换为语音。</p>                </div>            </div>        </a> <a href="/go/?url=aHR0cHM6Ly9tdXJmLmFpLw%3D%3D" target="\_blank"            rel="external nofollow noopener" title="" class="togo text-center text-muted is-views" data-id="4661"            data-toggle="tooltip" data-placement="right" data-original-title="直达"><i class="iconfont icon-goto"></i></a>    </div></div>

回答：

英文的
Here’s a Python code to extract the structured information from the HTML and represent it in JSON format:

import requestsfrom bs4 import BeautifulSoupimport jsonurl = "页面的url"response = requests.get(url)soup = BeautifulSoup(response.content, "html.parser")results = []for card in soup.find_all("div", class_="url-card"):    icon = card.find("img")["data-src"]    category = card.find("a", class_="card")["data-id"]    title = card.find("strong").text    website = card.find("a", class_="card")["data-url"]    description = card.find("p", class_="overflowClip_1").text    result = {        "icon": icon,        "category": category,        "title": title,        "website": website,        "description": description    }    results.append(result)json_result = json.dumps(results)print(json_result)

This code extracts the icon, category, title, website, and description of each tool listed on the page and saves them in a list of dictionaries. Finally, it converts the list to a JSON string and prints it.