Python 采集 Facebook 评论插件、留言外挂程序

【 Python 网络爬虫 】 同时被 2 个专栏收录
30 篇文章 4 订阅
12 篇文章 20 订阅


【1x00】写在前面

本文的采集代码适用于 Facebook 评论插件的评论采集。仅用于 Python 编程技术交流!

Facebook 评论插件官网:https://developers.facebook.com/products/social-plugins/comments

本文以 https://www.chinatimes.com/realtimenews/20210529003827-260407 为例。


【2x00】逻辑分析

【2x01】第一页

在页面的 Facebook 评论插件位置右键查看框架源代码,我们就可以看到第一页评论页面的源码,直接访问这个 URL 就可以看到评论信息。

在这里插入图片描述
这个页面的 URL 为:https://www.facebook.com/plugins/feedback.php?app_id=1379575469016080&channel=https%3A%2F%2Fstaticxx.facebook.com%2Fx%2Fconnect%2Fxd_arbiter%2F%3Fversion%3D46%23cb%3Df22d8c81d4ce144%26domain%3Dwww.chinatimes.com%26origin%3Dhttps%253A%252F%252Fwww.chinatimes.com%252Ff5f738a4fa595%26relation%3Dparent.parent&container_width=924&height=100&href=https%3A%2F%2Fwww.chinatimes.com%2Frealtimenews%2F20210529003827-260407&locale=zh_TW&numposts=5&order_by=reverse_time&sdk=joey&version=v3.2&width

我们将其格式化后,得到有以下参数:

https://www.facebook.com/plugins/feedback.php?
app_id: 1379575469016080
channel: https://staticxx.facebook.com/x/connect/xd_arbiter/?version=46#cb=f22d8c81d4ce144
domain: www.chinatimes.com
origin: https%3A%2F%2Fwww.chinatimes.com%2Ff5f738a4fa595
relation: parent.parent
container_width: 924
height: 100
href: https://www.chinatimes.com/realtimenews/20210529003827-260407
locale: zh_TW
numposts: 5
order_by: reverse_time
sdk: joey
version: v3.2
width

以上参数中,app_id 需要我们去获取,domain 为该网站的域名,href 为该页面的 URL,剩下的其他参数经测试,对结果无影响,可直接复制过去。

直接在原页面搜索 app_id 的值,可以发现有个 meta 标签里面有这个值,直接使用 Xpath 匹配即可,注意,经过测试,部分使用了这个插件的页面是没有 app_id 的,不需要这个值也能获取,所以要注意报错处理。

try:
    app_id = content.xpath('//meta[@property="fb:app_id"]/@content')[0]
except IndexError:
    pass

在这里插入图片描述

对于第一页的所有评论,我们搜索评论文字的 Unicode 编码,可以在 response 中找到对应内容,直接将包含评论信息的这一段提取出来即可。

在这里插入图片描述


【2x02】下一页

点击载入其他留言,可以看到新的请求,类似于:https://www.facebook.com/plugins/comments/async/4045370158886862/pager/reverse_time/,请求方式为 post。URL 中 async 后面的一串数字为 targetID,可以在请求返回的数据中获取。

在这里插入图片描述
在这里插入图片描述
Form data 如下:

app_id: 1379575469016080
after_cursor: AQHReYdcksX9wFZEKA3MgNmN8PCRr7N3tFfZZuIKpCKnIuv-SxCycw4uZ1LqhtMr7RVkGyqACNdpkd9uJJ1jk6ne9g
limit: 10
__user: 0
__a: 1
__dyn: 7xe6EgU4e1QyUbFp62-m1FwAxu13wKxK7Emy8W3q322aewTwl8eU4K3a3W1DwUx60Vo1upE4W0LEK1pwo8swaq1xwEwhU1382gKi8wnU1e42C0BE1co3rw9O0RE5a1qw8W0b1w
__csr: 
__req: 1
__hs: 18777.PHASED:plugin_feedback_pkg.2.0.0.0
dpr: 1
__ccg: EXCELLENT
__rev: 1003879025
__s: ::lw3b8e
__hsi: 6968076253228168178
__comet_req: 0
locale: zh_TW
lsd: AVp5kXcGShk
jazoest: 2975
__sp: 1

app_id 和前面一样,after_cursor 的值通过搜索可以在上一页评论数据里面找到,换句话说,这一页的数据里面包含一个 after_cursor 的值,这个值是下一页请求 Form data 里面的参数。经测试其他参数的值不影响最终结果。

在这里插入图片描述


【2x03】回复别人的评论

回复别人的评论分为两种,第一种是直接可以看到的,第二种是需要点击“更多回复”才能看到的。第一种可以直接获取,第二种需要再次发送新的请求才能获取,新的请求的 URL 类似于:https://www.facebook.com/plugins/comments/async/comment/4045370158886862_4046939882063223/pager/ ,请求方式和下一页的请求方式一样,其中 URL comment 后面的一串数字仍然是 targetID, Form data 里的 after_cursor 参数可以在楼主的评论数据里面获取。


【3x00】完整代码

完整代码 Github 地址(点亮 star 有 buff 加成):
https://github.com/TRHX/Python3-Spider-Practice/tree/master/facebook-comments

# ====================================
# --*-- coding: utf-8 --*--
# @Time    : 2021-05-30
# @Author  : TRHX • 鲍勃
# @Blog    : www.itrhx.com
# @CSDN    : itrhx.blog.csdn.net
# @FileName: facebook.py
# @Software: PyCharm
# ====================================


import requests
import json
import time
from lxml import etree


# ==============================  测试链接  ============================== #
# https://www.chinatimes.com/realtimenews/20210529003827-260407
# https://tw.appledaily.com/life/20210530/IETG7L3VMBA57OD45OC5KFTCPQ/
# https://www.nownews.com/news/5281470
# https://www.thejakartapost.com/life/2019/06/03/how-to-lose-belly-fat-in-seven-days.html
# https://mcnews.cc/p/25224
# https://news.ltn.com.tw/news/world/breakingnews/3550262
# https://www.npf.org.tw/1/15857
# https://news.pts.org.tw/article/528425
# https://news.tvbs.com.tw/life/1518745
# ==============================  测试链接  ============================== #


PAGE_URL = 'https://www.chinatimes.com/realtimenews/20210529003827-260407'
PROXIES = {'http': 'http://127.0.0.1:10809', 'https': 'http://127.0.0.1:10809'}
# PROXIES = None # 如果不需要代理则设置为 None


class FacebookComment:
    def __init__(self):
        self.json_name = 'facebook_comments.json'
        self.domain = PAGE_URL.split('/')[2]
        self.iframe_referer = 'https://{}/'.format(self.domain)
        self.user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'
        self.channel_base_url = 'https%3A%2F%2Fstaticxx.facebook.com%2Fx%2Fconnect%2Fxd_arbiter%2F%3Fversion%3D46%23cb%3Df17861604189654%26domain%3D{}%26origin%3Dhttps%253A%252F%252F{}%252Ff9bd3e89788d7%26relation%3Dparent.parent'
        self.referer_base_url = 'https://www.facebook.com/plugins/feedback.php?app_id={}&channel={}&container_width=924&height=100&href={}&locale=zh_TW&numposts=5&order_by=reverse_time&sdk=joey&version=v3.2&width'
        self.comment_base_url = 'https://www.facebook.com/plugins/comments/async/{}/pager/reverse_time/'
        self.reply_base_url = 'https://www.facebook.com/plugins/comments/async/comment/{}/pager/'
        self.target_id = ''
        self.referer = ''
        self.app_id = ''

    @staticmethod
    def find_value(html: str, key: str, num_chars: int, separator: str) -> str:
        pos_begin = html.find(key) + len(key) + num_chars
        pos_end = html.find(separator, pos_begin)
        return html[pos_begin: pos_end]

    @staticmethod
    def save_comment(filename: str, information: json) -> None:
        with open(filename, 'a+', encoding='utf-8') as f:
            f.write(information + '\n')

    def get_app_id(self) -> None:
        headers = {'user-agent': self.user_agent}
        response = requests.get(url=PAGE_URL, headers=headers, proxies=PROXIES)
        html = response.text
        content = etree.HTML(html)
        try:
            app_id = content.xpath('//meta[@property="fb:app_id"]/@content')[0]
            self.app_id = app_id
        except IndexError:
            pass

    def get_first_parameter(self) -> str:
        channel_url = self.channel_base_url.format(self.domain, self.domain)
        referer_url = self.referer_base_url.format(self.app_id, channel_url, PAGE_URL)
        headers = {
            'authority': 'www.facebook.com',
            'upgrade-insecure-requests': '1',
            'user-agent': self.user_agent,
            'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
            'sec-fetch-site': 'cross-site',
            'sec-fetch-mode': 'navigate',
            'sec-fetch-dest': 'iframe',
            'referer': self.iframe_referer,
            'accept-language': 'zh-CN,zh;q=0.9'
        }
        response = requests.get(url=referer_url, headers=headers, proxies=PROXIES)
        data = response.text

        after_cursor = self.find_value(data, "afterCursor", 3, separator='"')
        target_id = self.find_value(data, "targetID", 3, separator='"')
        # rev = find_value(data, "consistency", 9, separator='}')

        # 提取并保存最开始的评论
        tree = etree.HTML(data)
        script = tree.xpath('//body/script[last()]/text()')[0]
        html_begin = script.find('"comments":') + len('"comments":')
        html_end = script.find('"meta"')
        result = script[html_begin:html_end].strip()
        result_dict = json.loads(result[:-1])
        comment_type = 'first'
        self.processing_comment(result_dict, comment_type)
        self.target_id = target_id
        self.referer = referer_url
        return after_cursor

    def get_comment(self, after_cursor: str, comment_url: str) -> None:
        """
        :param after_cursor: 字符串,下一页的 cursor
        :param comment_url: 字符串,评论页面的 URL
        :return: None
        """
        num = 1
        while after_cursor:
            post_data = {
                'app_id': self.app_id,
                'after_cursor': after_cursor,
                'limit': 10,
                'iframe_referer': self.iframe_referer,
                '__user': 0,
                '__a': 1,
                '__dyn': '7xe6EgU4e3W3mbG2KmhwRwqo98nwgUbErxW5EyewSwMwyzEdU5i3K1bwOw-wpUe8hwem0nCq1ewbWbwmo62782CwOwKwEwhU1382gKi8wl8G0jx0Fw9q0B82swdK0D83mwkE5G0zE16o',
                '__csr': '',
                '__req': num,
                '__beoa': 0,
                '__pc': 'PHASED:plugin_feedback_pkg',
                'dpr': 1,
                '__ccg': 'GOOD',
                # '__rev': rev,
                # '__s': ':mfgzaz:f4if6y',
                # '__hsi': '6899699958141806572',
                '__comet_req': 0,
                'locale': 'zh_TW',
                # 'jazoest': '22012',
                '__sp': 1
            }
            headers = {
                'user-agent': self.user_agent,
                'content-type': 'application/x-www-form-urlencoded',
                'accept': '*/*',
                'origin': 'https://www.facebook.com',
                'sec-fetch-site': 'same-origin',
                'sec-fetch-mode': 'cors',
                'sec-fetch-dest': 'empty',
                'referer': self.referer,
                'accept-language': 'zh-CN,zh;q=0.9'
            }
            response = requests.post(url=comment_url, headers=headers, proxies=PROXIES, data=post_data)
            data = response.text

            if 'xml version' in data:
                html_data = data.split('\n', 1)[1]
            else:
                html_data = data
            if 'for (;;);' in html_data:
                json_text = html_data.split("for (;;);")[1]
                json_dict = json.loads(json_text)
                # print(json_dict)
                comment_type = 'second'
                self.processing_comment(json_dict, comment_type)
                try:
                    after_cursor = json_dict['payload']['afterCursor']
                except KeyError:
                    after_cursor = False
                # try:
                #     rev = json_dict['hsrp']['hblp']['consistency']['rev']
                # except KeyError:
                #     rev = ''
            else:
                after_cursor = False
            num += 1

    def processing_comment(self, comment_dict: dict, comment_type: str) -> None:
        """
        :param comment_dict: 字典,所有评论信息,不同页面传来的数据可能结构不一样
        :param comment_type: 字符串,用来标记第一页和非第一页的评论
        :return: None
        """
        try:
            comment_dict = comment_dict['payload']
        except KeyError:
            comment_dict = comment_dict

        # 如果为 first,表示是第一页评论,则全部储存,否则要去掉重复的第一个
        if comment_type == 'first':
            comment_ids = comment_dict['commentIDs']
        else:
            comment_ids = comment_dict['commentIDs'][1:]

        # 第一次储存,储存所有一级评论
        self.extract_comment(comment_dict, comment_ids)

    def extract_comment(self, comment_dict: dict, comment_ids: list) -> None:
        """
        :param comment_dict: 字典,所有的评论信息
        :param comment_ids: 列表,所有评论的 ID
        :return: None
        """
        for i in range(len(comment_ids)):

            # ==================  info  ================== #
            crawl_timestamp = int(time.time())
            crawl_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())

            # =================  comment  ================ #
            comment = comment_dict['idMap'][comment_ids[i]]
            comment_id = comment_ids[i]
            target_id = comment['targetID']
            created_timestamp = comment['timestamp']['time']
            created_time_text = comment['timestamp']['text']
            created_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(float(created_timestamp)))
            comment_type = comment['type']
            ranges = comment['ranges']
            like_count = comment['likeCount']
            has_liked = comment['hasLiked']
            can_like = comment['canLike']
            can_edit = comment['canEdit']
            hidden = comment['hidden']
            high_lighted_words = comment['highlightedWords']
            spam_count = comment['spamCount']
            can_embed = comment['canEmbed']
            try:
                reply_count = comment['public_replies']['totalCount']
            except KeyError:
                reply_count = 0
            report_uri = 'https://www.facebook.com' + comment['reportURI']
            content = comment['body']['text']

            # =================  author  ================= #
            author_id = comment['authorID']
            author = comment_dict['idMap'][author_id]
            author_name = author['name']
            thumb_src = author['thumbSrc']
            uri = author['uri']
            is_verified = author['isVerified']
            author_type = author['type']

            comment_result_dict = {
                'info': {
                    'pageURL': PAGE_URL,                      # 原始页面链接
                    'crawlTimestamp': crawl_timestamp,        # 爬取时间戳
                    'crawlTime': crawl_time                   # 爬取时间
                },
                'comment': {
                    'type': comment_type,                     # 类型
                    'commentID': comment_id,                  # 评论 ID
                    'targetID': target_id,                    # 目标 ID,若为回复 A 的评论,则其值为 A 的评论 ID
                    'createdTimestamp': created_timestamp,    # 评论时间戳
                    'createdTime': created_time,              # 评论时间
                    'createdTimeText': created_time_text,     # 评论时间(年月日)
                    'likeCount': like_count,                  # 该条评论获得的点赞数
                    'replyCount': reply_count,                # 该条评论下的回复数
                    'spamCount': spam_count,                  # 该条评论被标记为垃圾信息的次数
                    'hasLiked': has_liked,                    # 该条评论是否被你点赞过
                    'canLike': can_like,                      # 该条评论是否可以被点赞
                    'canEdit': can_edit,                      # 该条评论是否可以被编辑
                    'hidden': hidden,                         # 该条评论是否被隐藏
                    'canEmbed': can_embed,                    # 该条评论是否可以被嵌入到其他网页
                    'ranges': ranges,                         # 不知道啥含义
                    'highLightedWords': high_lighted_words,   # 该条评论被高亮的单词
                    'reportURI': report_uri,                  # 举报该条评论的链接
                    'content': content,                       # 该条评论的内容
                },
                'author': {
                    'type': author_type,                      # 类型
                    'authorID': author_id,                    # 该条评论作者的 ID
                    'authorName': author_name,                # 该条评论作者的用户名
                    'isVerified': is_verified,                # 该条评论作者是否已认证过
                    'uri': uri,                               # 该条评论作者的 facebook 主页
                    'thumbSrc': thumb_src                     # 该条评论作者的头像链接
                }
            }

            print(comment_result_dict)
            self.save_comment(self.json_name, json.dumps(comment_result_dict, ensure_ascii=False))

            # 第二次储存,储存所有二级评论(回复别人的评论,且不用点击“更多回复”就能看见的评论)
            # 判断依据,是否存在 commentIDs
            try:
                reply_ids = comment['public_replies']['commentIDs']
                self.extract_comment(comment_dict, reply_ids)
            except KeyError:
                pass

            # 第三次储存,储存所有三级评论(回复别人的评论,但是需要点击“更多回复”才能看见的评论)
            # 判断依据,是否存在 afterCursor
            try:
                reply_after_cursor = comment['public_replies']['afterCursor']
                reply_id = comment_ids[i]
                reply_url = self.reply_base_url.format(reply_id)
                self.get_comment(reply_after_cursor, reply_url)
            except KeyError:
                pass

    def run(self) -> None:
        self.get_app_id()
        after_cursor = self.get_first_parameter()
        if len(after_cursor) < 20:
            print('\n{} 评论采集完毕!'.format(PAGE_URL))
        else:
            comment_url = self.comment_base_url.format(self.target_id)
            self.get_comment(after_cursor, comment_url)
            print('\n{} 评论采集完毕!'.format(PAGE_URL))


if __name__ == '__main__':
    FC = FacebookComment()
    FC.run()

【4x00】数据截图

在这里插入图片描述
在这里插入图片描述

  • 3
    点赞
  • 5
    评论
  • 1
    收藏
  • 一键三连
    一键三连
  • 扫一扫,分享海报

相关推荐
©️2020 CSDN 皮肤主题: 数字20 设计师:CSDN官方博客 返回首页
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值