puppeteer抓取微信公众号内容

      Posted on
      2020-09-06 14:29:53

      In
      
          代码
        
      Author:
      
        可乐小可爱メ

1. 背景

继前面 puppeteer抓取SPA(客户端渲染)InfoQ.cn内容, 老大又说试试微信公众号的...

2. 目标分析

这里选了以下几个公众号：

CSDN，程序员成长指北，前端大学; InfoQ

进入首页，获取列表页连接地址:

3. 思路分析

刚开始觉得参考上一篇，模拟获取数据，解析并且整理页面dom数据, 但是公众号列表却都是只有 title， cover(配图),

很明显，数据量不够.

换个思路，抓取页面的ajax请求，并分析，加工，伪造，直接获取返回数据.

4. 技术选型

puppeteer & superagent； (这次只做数据爬取，后续的 SQL写入和定时任务就不展开写了，可以参考前面 puppeteer抓取SPA(客户端渲染)InfoQ.cn内容)

5. 代码

5.1 相关依赖

const puppeteer = require("puppeteer");
const qs = require("qs");
const superagent = require("superagent");
const Express = require("express");
const app = Express();

//公众号infoQ
const target = "http://mp.weixin.qq.com/mp/homepage?__biz=MjM5MDE0Mjc4MA==&hid=13&sn=462df247fb563cc7c339ebcf49244e3a&scene=18#wechat_redirect" 

const userAgent =
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36 MicroMessenger/10.24";
const listPath = "mp.weixin.qq.com/mp/homepage";

5.2 模拟访问页面，获取接口地址以及参数

const pageQuery = async () => {
  return new Promise((resolve) => {
    puppeteer.launch().then(async (browser) => {
      const page = await browser.newPage();
      page.setUserAgent(userAgent);
      await page.goto(target);
      //监听
      page.on("request", async (request) => {
        const url = request.url();
        if (url.indexOf(listPath) > 0) {
          const query = qs.parse(url.split("?")[1]);
          resolve(query);
        }
      });

      //执行交互
      await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
      await page.waitForNavigation({
        timeout: 0,
        waitUntil: ["networkidle0"],
      });
      await browser.close();
    });
  });
};

5.3 伪造请求

const agent = async (query, cid = 0) => {
  return new Promise((resolve) => {
    query.r = Math.random();
    query.begin = 1;
    query.cid = cid;
    query.count = 50;
    superagent
      .post(listPath)
      .query(query)
      .then((result) => {
        resolve(result.body.appmsg_list);
      });
  });
};

5.4 数据返回

app.get("/", async (req, res) => {
  res.setHeader("Content-Type", "application/json");
  const query = await pageQuery();
  Promise.all([agent(query, 0), agent(query, 1), agent(query, 2)]).then((s) => {
    const arr1 = s[0],
      arr2 = s[1],
      arr3 = s[2];
    res.send([...arr1, ...arr2, ...arr3]);
  });
});
app.listen(10001);

6. 代码注释 & 说明点

6.1 这里通过分析所有的公众号列表页的共同点:

6.1.1 接口地址一致: const listPath = "mp.weixin.qq.com/mp/homepage"; （这里注意有的公众号是 http 有的是 https，所以不匹配 protocol）

6.1.2 query参数一致;

{
  __biz: "MjM5MDE0Mjc4MA==", 
  hid: "13",
  sn: "462df247fb563cc7c339ebcf49244e3a",
  scene: "18",
  cid: "0",
  begin: "6",
  count: "5",
  action: "appmsg_list",
  f: "json",
  r: "0.00018560432035696905",
  appmsg_token: ""
}

6.1.3 主要修改其中三个参数: cid 取值 0， 1，2 分别指向列表页的三个tab， [技术头条] / [5G] / [开发周刊] ; begin 是 sql 开始id； count 是sql 查询条目数; r 是随机数;

6.2 这里防止 puppeteer内存泄漏，只执行一次，并且及时关闭。

6.3 用 Promise.all 直接抓取三个tab页，各传 begin：1 ， count: 50, 合并返回.

6.4 抓取的前提是能获取目标公众号文章列表页的url.

7. 抓取结果

puppeteer抓取微信公众号内容

当前评论 (0) 登录后评论