闲鱼爬虫爬爬乐：抓包分析与Playwright实战

前言#

最近在做电商数据分析,目标是抓取闲鱼某个特定卖家的所有在售商品（包括高清图、文案、价格）。起初尝试用 requests 直接请求接口，结果撞上了阿里系著名的 x-sign 签名验证和 WUA 设备指纹墙。耗费大量时间去逆向 JS 并不划算，于是我转向了 Playwright + 流量监听 的方案。

本文将详细记录如何从复杂的网络请求中定位核心数据包，以及在编写代码时遇到的几个雷点。

一、硬核抓包：数据到底藏在哪？#

闲鱼 Web 端的数据加载逻辑非常典型：HTML 骨架 + API 异步加载数据。这意味着直接爬 HTML 源码是拿不到数据的，我们必须去分析 Network 网络请求。

1. 列表页抓包分析#

打开卖家主页，按 F12 打开开发者工具，切换到 Network -> Fetch/XHR。当我们向下滑动加载更多商品时，会发现一个关键请求：

接口名称：mtop.idle.web.xyh.item.list

请求方式：GET/POST

关键载荷 (Payload)：

1
{
2
  "userId": "2217xxxxxxxxx",  // 目标卖家ID
3
  "pageNumber": 2,            // 页码
4
  "scene": "seller_home"
5
}

数据结构：响应数据通常在 data.items 或 data.cardList 中。这里包含了商品的 itemId（商品ID）、title（标题）和 soldPrice（价格）。拿到 itemId 是第一步，因为详细的文案和高清大图，必须进详情页才能拿到。

抓包1

2. 详情页的”俄罗斯套娃”#

当你点击进入商品详情页（https://www.goofish.com/item?id=...），会发现找不到标准的 “description” 字段。

经过逐个排查 mtop 开头的请求，我锁定了核心包：

接口名称：mtop.taobao.idle.pc.detail

数据陷阱：你以为数据在 data.desc 里？错。闲鱼为了复用移动端逻辑，把核心数据压缩成了一个 JSON 字符串，塞进了另一个字段里。

请看这个**“套娃”结构**：

1
{
2
  "data": {
3
    "itemDO": {
4
      "shareData": {
5
        // 注意：这个字段的值是一个 String，不是 Object！
6
        "shareInfoJsonString": "{\"contentParams\":{\"mainParams\":{\"content\":\"这里才是真正的文案...\",\"images\":[...]}}}"
7
      }
8
    }
9
  }
10
}

抓包2

实战结论：我们在写代码时，必须进行二次解析：先解析外层 JSON，提取出 shareInfoJsonString，再对这个字符串进行 json.loads，才能拿到高清原图列表（images）和详细描述（content）。

二、避坑指南：四大”雷点”#

在开发过程中，我踩了无数坑，总结如下：

💥 雷点 1：无头模式的”隐形墙”#

现象：代码设置 headless=True（不显示浏览器）运行时，死活抓不到数据，扫描到的商品数为 0。

原因：阿里风控能够检测无头浏览器（Headless Chrome）的特征。

解决：必须强制开启浏览器界面 headless=False。如果不想看界面，可以手动最小化窗口。

💥 雷点 2：风控登录请求#

现象：有时候爬着爬着就会弹出登录请求，导致程序中断

原因：可能是爬取触发风控，要求用户登录

解决：在脚本启动时增加检测逻辑，如果发现页面跳转到登录页，暂停脚本，人工扫码登录，获取到 Cookie 后再继续。但是如果触发滑块验证直接无解，请适当调低频率。

💥 雷点 3：图片加载拖慢速度#

现象：抓取 300 个商品需要半小时，大部分时间花在加载图片上。

原因：我们只需要 JSON 数据，但浏览器默认会下载网页上的所有图片，浪费带宽和时间。

解决：利用 Playwright 的路由拦截功能，屏蔽所有图片和字体文件的请求。

四个雷点只有三个不是很正常（大嘘

三、Python 关键代码解析#

1. 资源拦截#

凡是图片、字体，统统直接拒绝，不要下载。

1
# 拦截无用资源，极大提升页面加载速度
2
await page.route(
3
    "**/*.{png,jpg,jpeg,gif,webp,svg,ttf,woff,woff2}",
4
    lambda route: route.abort()
5
)

2. 被动流量监听（核心逻辑）#

不主动发起 requests.get，而是像监听器一样，守在浏览器旁边，一旦发现目标数据包经过，就把它截获下来。

1
async def handle_detail(response):
2
    # 只拦截详情接口，且状态码必须是 200
3
    if "mtop.taobao.idle.pc.detail" in response.url and response.status == 200:
4
        try:
5
            json_data = await response.json()
6
            # 调用解析函数处理数据
7
            parsed_data = parse_detail_packet(json_data)
8
            if parsed_data:
9
                # 存入结果容器
10
                result_container["data"] = parsed_data
11
        except:
12
            pass
13

14
# 注册监听器
15
page.on("response", handle_detail)

3. “套娃”数据解析#

这是对应上面抓包分析的代码实现，专门处理 shareInfoJsonString。

1
def parse_detail_packet(data):
2
    # 1. 安全校验：过滤非目标卖家的商品（防止抓到推荐商品）
3
    current_seller_id = data.get("sellerDO", {}).get("sellerId")
4
    if current_seller_id != TARGET_USER_ID:
5
        return None
6

7
    # 2. 定位套娃字符串
8
    item_do = data.get("itemDO", {})
9
    share_json_str = item_do.get("shareData", {}).get("shareInfoJsonString", "")
10

11
    desc_text = ""
12
    images = []
13

14
    # 3. 二次解析
15
    if share_json_str:
16
        inner_data = json.loads(share_json_str) # 关键一步
17
        main_params = inner_data.get("contentParams", {}).get("mainParams", {})
18

19
        # 提取真正的文案
20
        desc_text = main_params.get("content", "")
21
        # 提取高清图列表
22
        images = [img['image'] for img in main_params.get("images", []) if 'image' in img]
23

24
    return {
25
        "title": item_do.get("title"),
26
        "desc": desc_text,
27
        "images": images
28
    }

4. 异步并发控制#

为了既快又不被封 IP，我们使用 asyncio.Semaphore 来限制同时打开的浏览器标签页数量（推荐 3 个）。

1
# 限制最大并发数为 3
2
semaphore = asyncio.Semaphore(3)
3

4
async def worker(context, pid):
5
    async with semaphore:  # 只有拿到信号量才能执行
6
        page = await context.new_page()
7
        try:
8
            await page.goto(f"https://www.goofish.com/item?id={pid}")
9
            # ... 等待数据包捕获 ...
10
        finally:
11
            await page.close()
12
            # 随机休息 1-2 秒，模拟真人操作频率
13
            await asyncio.sleep(random.uniform(1, 2))

总结#

通过 Playwright 的 “浏览器自动化 + 流量监听” 模式，我们成功绕过了复杂的签名验证。

实战效果：

速度：3 线程并发，抓取 300 个商品详情 + 下载 2000 多张图片，耗时约 15 分钟
稳定性：配合扫码登录和断点续传机制，可以稳定同步店铺数据

成果1

还可以配上 Python 正则分类，实现商品种类分类：

1
CATEGORIES = {
2
    "卫衣": ["卫衣", "帽衫", "HOODIE", "套头", "连帽"],
3
    "外套": ["外套", "夹克", "JACKET", "棉服", "羽绒服", "棒球服", "开衫", "皮衣", "底特律", "教练夹克", "飞行员", "帆布", "工装"],
4
    "短袖": ["短袖", "T恤", "TEE", "半袖", "短䄂"],
5
    "长袖": ["长袖"],
6
    "衬衫": ["衬衫", "SHIRT"],
7
    "裤子": ["裤子", "牛仔裤", "长裤", "短裤", "牛仔", "微喇"],
8
    "毛衣": ["毛衣", "针织", "马海毛", "海马毛", "圆领毛衣"],
9
}
10

11
def get_category_from_text(text):
12
    """根据文本内容识别类别"""
13
    text = text.upper()
14
    for category, keywords in CATEGORIES.items():
15
        for keyword in keywords:
16
            if keyword.upper() in text:
17
                return category
18
    return "其他"

成功2

附上完整代码#

1
import asyncio
2
from playwright.async_api import async_playwright
3
import json
4
import time
5
import os
6
import requests
7
import re
8
import random
9
import shutil
10
from concurrent.futures import ThreadPoolExecutor
11

12
# --- ⚙️ 配置 ---
13
BASE_DIR = "闲鱼店铺爬取"
14
TARGET_USER_ID = "xxxxxxxxx" # 填写要爬取的商家id，就是URL后面那段数字
15
HEADLESS = False       # 必须显示浏览器
16
CONCURRENCY = 3        # 同时打开的网页数量 (建议 2-3，太快会被封IP)
17
WORKER_DELAY = (1, 2)  # 单个窗口抓完后的休息时间 (秒)
18

19
# 图片下载专用线程池
20
img_executor = ThreadPoolExecutor(max_workers=10)
21

22
def sanitize_filename(name):
23
    name = re.sub(r'[\\/:*?"<>|]', '', name).replace('\n', '')
24
    return name[:50].strip()
25

26
def download_image_sync(url, save_path):
27
    """(同步) 图片下载任务，丢给线程池跑"""
28
    try:
29
        if os.path.exists(save_path): return
30
        if url.startswith("http:"): url = url.replace("http:", "https:")
31
        if ".jpg_" in url or ".png_" in url or ".heic_" in url:
32
            url = url.split("_")[0]
33

34
        headers = {"User-Agent": "Mozilla/5.0"}
35
        resp = requests.get(url, headers=headers, timeout=15)
36
        if resp.status_code == 200:
37
            with open(save_path, 'wb') as f:
38
                f.write(resp.content)
39
    except: pass
40

41
def parse_detail_packet(data):
42
    """解析详情包 (逻辑不变)"""
43
    try:
44
        # 卖家校验
45
        seller_obj = data.get("sellerDO", {}) or data.get("seller", {})
46
        current_seller_id = str(seller_obj.get("sellerId", "")) or str(seller_obj.get("userId", ""))
47
        if current_seller_id and current_seller_id != TARGET_USER_ID:
48
            return None # 过滤推荐
49

50
        item_do = data.get("itemDO", {})
51
        share_data = item_do.get("shareData", {})
52
        share_json_str = share_data.get("shareInfoJsonString", "")
53

54
        desc_text = ""
55
        images = []
56

57
        if share_json_str:
58
            inner_data = json.loads(share_json_str)
59
            main_params = inner_data.get("contentParams", {}).get("mainParams", {})
60
            desc_text = main_params.get("content", "")
61
            img_list = main_params.get("images", [])
62
            for img_obj in img_list:
63
                if "image" in img_obj: images.append(img_obj["image"])
64

65
        if not desc_text: desc_text = item_do.get("desc", "无描述")
66
        if not images:
67
            for img in item_do.get("imageInfos", []):
68
                if "url" in img: images.append(img["url"])
69

70
        price = item_do.get("soldPrice", "0")
71
        if price == "0": price = item_do.get("priceInfo", {}).get("price", "0")
72

73
        return {
74
            "title": item_do.get("title", ""),
75
            "price": price,
76
            "desc": desc_text,
77
            "images": images,
78
            "itemId": item_do.get("itemId", "")
79
        }
80
    except:
81
        return None
82

83
async def scan_online_ids(context):
84
    """【第一步】异步全量扫描"""
85
    page = await context.new_page()
86
    online_ids = set()
87

88
    print(f"📡 正在扫描卖家主页...")
89
    await page.goto(f"https://www.goofish.com/personal?userId={TARGET_USER_ID}")
90

91
    # 登录检测
92
    await asyncio.sleep(3)
93
    if "login" in page.url:
94
        print("\n🔴🔴🔴 检测到未登录！请手动扫码！🔴🔴🔴")
95
        while "login" in page.url:
96
            await asyncio.sleep(1)
97
        print("✅ 登录成功，继续扫描...")
98

99
    async def handle_list(response):
100
        if "item.list" in response.url and response.status == 200:
101
            try:
102
                resp = await response.json()
103
                data = resp.get("data", {})
104
                items = data.get("cardList") or data.get("items") or []
105
                for item in items:
106
                    pid = item.get("cardData", {}).get("detailParams", {}).get("itemId")
107
                    if not pid: pid = item.get("data", {}).get("itemId")
108
                    if pid: online_ids.add(str(pid))
109
                print(f"   🔍 扫描中... 当前总数: {len(online_ids)}", end="\r")
110
            except: pass
111

112
    page.on("response", handle_list)
113

114
    # 快速翻页
115
    print("\n📜 正在向下滚动...")
116
    no_change = 0
117
    last_len = 0
118

119
    while True:
120
        await page.keyboard.press("End")
121
        await asyncio.sleep(1.5) # 缩短翻页间隔
122

123
        curr_len = len(online_ids)
124
        if curr_len == last_len:
125
            no_change += 1
126
            print(f"   ⚠️ 到底了? ({no_change}/6) 当前: {curr_len}   ", end="\r")
127
            if no_change >= 6:
128
                break
129
        else:
130
            no_change = 0
131
            last_len = curr_len
132

133
    await page.close()
134
    return online_ids
135

136
async def worker(context, pid, semaphore):
137
    """【第三步】并发工作单元"""
138
    async with semaphore: # 限制并发数，防止开太多浏览器卡死
139
        page = await context.new_page()
140
        # 屏蔽图片，极大提升速度
141
        await page.route("**/*.{png,jpg,jpeg,gif,webp,svg}", lambda route: route.abort())
142

143
        result_container = {"data": None}
144

145
        # 定义监听器
146
        async def handle_detail(response):
147
            if "mtop.taobao.idle.pc.detail" in response.url and response.status == 200:
148
                try:
149
                    json_data = await response.json()
150
                    parsed = parse_detail_packet(json_data)
151
                    if parsed: result_container["data"] = parsed
152
                except: pass
153

154
        page.on("response", handle_detail)
155

156
        try:
157
            # print(f"🚀 [抓取中] ID: {pid}") # 嫌吵可以注释
158
            await page.goto(f"https://www.goofish.com/item?id={pid}")
159

160
            # 等待数据包，最多等 6 秒
161
            start_t = time.time()
162
            while result_container["data"] is None and time.time() - start_t < 6:
163
                await asyncio.sleep(0.2)
164

165
            item_info = result_container["data"]
166

167
            if item_info:
168
                # 1. 保存文案
169
                title = item_info["title"]
170
                safe_title = sanitize_filename(title)
171
                folder_name = f"{safe_title}_{pid}"
172
                item_dir = os.path.join(BASE_DIR, folder_name)
173
                os.makedirs(item_dir, exist_ok=True)
174

175
                with open(os.path.join(item_dir, "文案.txt"), "w", encoding="utf-8") as f:
176
                    f.write(f"【标题】：{title}\n【价格】：{item_info['price']}\n【链接】：https://www.goofish.com/item?id={pid}\n\n【描述】：\n{item_info['desc']}")
177

178
                print(f"   ✅ [下载] {safe_title[:10]}... | 投递 {len(item_info['images'])} 张图")
179

180
                # 2. 图片丢进后台线程池
181
                for idx, url in enumerate(item_info["images"]):
182
                    save_path = os.path.join(item_dir, f"{idx+1}.jpg")
183
                    img_executor.submit(download_image_sync, url, save_path)
184
            else:
185
                print(f"   ⚠️ [跳过] 抓取失败或非目标商品: {pid}")
186

187
        except Exception as e:
188
            print(f"   ❌ [异常] ID {pid}: {e}")
189
        finally:
190
            await page.close()
191
            # 极速版：只休息 1-2 秒
192
            await asyncio.sleep(random.uniform(*WORKER_DELAY))
193

194
async def main():
195
    if not os.path.exists(BASE_DIR): os.makedirs(BASE_DIR)
196

197
    async with async_playwright() as p:
198
        print(f"🚀 启动浏览器 (并发数: {CONCURRENCY})...")
199
        browser = await p.chromium.launch(headless=HEADLESS)
200
        context = await browser.new_context()
201

202
        # 1. 全量扫描在线 ID
203
        online_ids = await scan_online_ids(context)
204

205
        # 安全阀
206
        if len(online_ids) < 10:
207
            print("\n\n❌ 扫描到的商品太少，可能是登录失效，程序终止以防误删。")
208
            await browser.close()
209
            return
210

211
        # 2. 本地对比
212
        local_ids = set()
213
        local_map = {} # ID -> Path
214
        for folder in os.listdir(BASE_DIR):
215
            if "_" in folder:
216
                pid = folder.split("_")[-1]
217
                local_ids.add(pid)
218
                local_map[pid] = os.path.join(BASE_DIR, folder)
219

220
        to_download = list(online_ids - local_ids)
221
        to_delete = local_ids - online_ids
222

223
        print(f"\n📊 报告: 在线 {len(online_ids)} | 本地 {len(local_ids)} | 新增 {len(to_download)} | 删除 {len(to_delete)}")
224

225
        # 3. 执行删除
226
        if to_delete:
227
            print(f"🧹 清理 {len(to_delete)} 个下架商品...")
228
            for pid in to_delete:
229
                try:
230
                    shutil.rmtree(local_map[pid])
231
                    print(f"   ❌ 删除: {pid}")
232
                except: pass
233

234
        # 4. 执行并发下载
235
        if to_download:
236
            print(f"\n⚡ 启动并发下载引擎 (队列: {len(to_download)})...")
237
            semaphore = asyncio.Semaphore(CONCURRENCY) # 控制同时打开几个网页
238
            tasks = []
239
            for pid in to_download:
240
                task = asyncio.create_task(worker(context, pid, semaphore))
241
                tasks.append(task)
242

243
            # 等待所有任务完成
244
            await asyncio.gather(*tasks)
245
        else:
246
            print("✨ 没有新商品需要下载。")
247

248
        print("\n🎉 同步完成！等待图片下载收尾...")
249
        img_executor.shutdown(wait=True)
250
        print("✅ 退出。")
251
        await browser.close()
252

253
if __name__ == "__main__":
254
    asyncio.run(main())

法律声明
爬虫技术仅用于学习交流，请勿用于非法商业用途，并严格遵守平台 Robots 协议。