MANKA の Blog
怎么爬 Twitter (GraphQL)

2021-05-12

#Twitter#Twitter Monitor#Twitter Graphql#Twitter Api

上接 怎么爬Twitter

目前常用 Twitter 接口状态

名称ResufulGraphql备注
UserInfooo
Searcho?印象中Search曾短暂使用过Graphql,但不确定
TimelinexoRestful会无限429
Statusoo

Twitter 混用 Graphql api(以下简称graphql) 和 Restful(以下简称 restful 或 rest) 有很长一段时间了,虽然我写这篇文章的时候只是启用了时间线,但是现在又逐渐在主题帖、用户信息以及…… NFT 头像信息上面动手脚,我觉得这玩意迟早会替代掉 restful ,而最近重爬了 Twitter Monitor 的所有推文数据,修理了不少以前留下来的bug,顺便 restful 时间线开始无限429,翻各种 issue 都没人解答,我觉得是时候准备迁移了

于是开始整理这边的文章

RATE LIMIT

类型次数备注
UserByRestId500
UserByScreenName1000
UserTweets1000
TweetDetail1000conversation

UserInfo

由于这边的函数自带 multi_curl,所以我写得轻松一点

<?phprequire(__DIR__ . '/init.php');$fetch = new Tmv2\Fetch\Fetch();$token = $fetch->tw_get_token();$count = 0;$change_count = 0;$tmp = array_fill(0, 50, 783214);//"twitter"$end = false;for(;;) {    $users = $fetch->tw_get_userinfo($tmp, $token);    foreach ($users as $user) {        if ($user === NULL || isset($user["errors"])) {            $token = $fetch->tw_get_token();            $change_count++;            echo "change token $change_count: -->" . $token[1] . "<--\n";            break;        }        $count++;        $tmpInfo = path_to_array("user_info_legacy", $user);        echo "-->" . $count . ' '. $tmpInfo["name"] .' ('. $tmpInfo["screen_name"] .")<--\n";    }}//UserByRestId //-->498 Twitter (Twitter)<--//-->499 Twitter (Twitter)<--//-->500 Twitter (Twitter)<--//change token 1: -->1495801929439535111<--//UserByScreenName//-->998 Twitter (Twitter)<--//-->999 Twitter (Twitter)<--//-->1000 Twitter (Twitter)<--//change token 1: -->1495802088294690816<--

TimeLine

由于Graphql接口比较慢(估计是生成过程的优化实在顶不住大量数据的混合),单线程循环跑起来很耗时间,我写了一个脚本,以10并发请求100条最新推文尝试找到这个极限。

<?phprequire(__DIR__ . '/init.php');//这个是Twitter Monitor的init.php$fetch = new Tmv2\Fetch\Fetch();$token = $fetch->tw_get_token();$count = 0;$tweet_count = 0;$graphqlObject = [    "userId" => 783214,    "count" => 100,    "withHighlightedLabel" => true,    "withTweetQuoteCount" => true,    "withQuickPromoteEligibilityTweetFields" => true,    "withSuperFollowsUserFields" => true,    "withSuperFollowsTweetFields" => true,    "withDownvotePerspective" => false,    "withReactionsMetadata" => false,    "includePromotedContent" => true,    "withReactionsPerspective" => false,    "withTweetResult" => false,    "withReactions" => false,    "withUserResults" => false,    "withVoice" => true,    "withNonLegacyCard" => true,    "withBirdwatchPivots" => false,    "withV2Timeline" => false];$tmp = array_fill(0, 9, "https://twitter.com/i/api/graphql/" . queryhqlQueryIdList["UserTweets"]["queryId"] . "/UserTweets?variables=" . urlencode(json_encode($graphqlObject)));$end = false;for(;;) {    $tweets = $fetch->tw_fetch_multi($tmp, $token);    foreach ($tweets as $tweet) {        $generateTweetData = new Tmv2\Core\Core($tweet, true, [], false);        echo "-->" . $count .'-' . $tweet_count . ' '. $generateTweetData->cursor["top"] .' '. $generateTweetData->cursor["bottom"] ."<--\n";        if ($generateTweetData->errors[0] !== 0) {            $end = true;            break;        }        $tweet_count += 100;        $count++;    }    if ($end) {        break;    }}//一次性代码追求什么性能和漂亮,能跑就行//输出//...//-->997-99700 HCaAgIDEm/+RuSkAAA== HBaQgLnJntzU4CUAAA==<--//-->998-99800 HCaAgICkmv+RuSkAAA== HBaQgLnJntzU4CUAAA==<--//-->999-99900  <--

最后发现998是最后一次能显示cursor,到999就没了,但这个现实是从0开始的,所以暂且认为 Timeline的极限是 999 次,再多就需要更换guest-token了,频繁更换guest-token可能会导致429,这时需要考虑用 代理池/多IP/分布式 等方法

Token 池

由于一个guest-token有使用次数和有效期(10800s)的限制,所以制作一个token池是可行的,我正在尝试制作一个 Token 池,做完将会补充此段

queryId

这些id还是存在于 main文件,可以参考以下脚本获取:

<?phppreg_match('/https:\/\/abs\.twimg\.com\/responsive-web\/client-web([^\/]+|)\/main\.[^.]+\.js/', file_get_contents("https://twitter.com/"), $link);//get js$jsString = ($link[0]??"");if ($jsString != "") {    preg_match_all('/{queryId:"([^"]+)",operationName:"([^"]+)",operationType:"([^"]+)"/', file_get_contents($jsString), $queryIdList);    $list = [];    for ($x = 0; $x < count($queryIdList[0]); $x++) {        $list[$queryIdList[2][$x]] = [            "queryId" => $queryIdList[1][$x],            "operationName" => $queryIdList[2][$x],            "operationType" => $queryIdList[3][$x],        ];    }    file_put_contents(__DIR__ . '/graphqlQueryIdList.json', json_encode($list));}

列表挺长的,我只列出 Twitter Monitor 需要用到的几个,其他请自行寻找用处

{  "UsersByRestIds": {    "queryId": "I5nvpI91ljifos1Y3Lltyg",    "operationName": "UserByRestId",    "operationType": "query"  },  "UserByScreenName": {    "queryId": "7mjxD3-C6BxitPMVQ6w0-Q",    "operationName": "UserByScreenName",    "operationType": "query"  },  "UserTweets": {    "queryId": "LNhjy8t3XpIrBYM-ms7sPQ",    "operationName": "UserTweets",    "operationType": "query"  },  "UserTweetsAndReplies": {    "queryId": "Vg5aF036K40ST3FWvnvRGA",    "operationName": "UserTweetsAndReplies",    "operationType": "query"  },  "TweetDetail": {    "queryId": "bRL1YYMraLIBpo1PGLeFcw",    "operationName": "TweetDetail",    "operationType": "query"  },}

链接拼接的格式就是

let url = `https://twitter.com/i/api/graphql/${queryId}/${operationName}/?variables=` + encodeURIComponent(JSON.stringify(Variables))

这些queryId可能会被更新或者删除,但暂时没发现使用旧queryId会造成什么不良影响

标注 * 的是非必须

* csrf-token

首先这玩意我真不知道什么环境下才会强制启用,估计是登录以后才会需要,不是必须的,本地生成

//ct0 in cookie//x-csrf-token in headerconst t = (() => {  const e = window.crypto || window.msCrypto;  if (!e) return;  const t = new Uint8Array(32);  e.getRandomValues(t);  let n = "";  for (let e = 0; e < t.length; e++) n +=    t[e].toString(16).substr(-1);  return n})();

从最后生成的结果来看……不就是32位随机字符串嘛,我就直接

echo md5(time());

是的,首次访问会设置,但都不是必须的,我先摆个 pattern 在这里 /set-cookie: ([^;]+);/

guest_id_marketing: v1%3A164301325110776087guest_id_ads: v1%3A164301325110776087personalization_id: "v1_FBBNMaLDB1sdu2yWcCdHIQ=="guest_id: v1%3A164301325110776087

guest-token

  • 通过
    curl 'https://twitter.com' --compressed

    此时得到的网页会有以下几行赋予 guest-token,就是那个 gt
    <script nonce="MDRjZmJlNWItYWNmOC00MTdiLWIxYjUtYTFhZTUyYTc2ODg4">  document.cookie = decodeURIComponent("gt=1232704521454999999; Max-Age=10800;  Domain=.twitter.com; Path=/; Secure");</script>
  • curl 'https://api.twitter.com/1.1/guest/activate.json' \-X 'POST' \-H 'authorization: Bearer   AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjh  LTvJu4FA33AGWWjCpTnA' \--compressed

    使用这种方式可以顺便取得上面那几个cookie

authorization

Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA

这玩意我就没见它变过

用户信息

Request

  • Method: GET
  • URL:
    • by screen name https://twitter.com/i/api/graphql/7mjxD3-C6BxitPMVQ6w0-Q/UserByScreenName?variables={VARIABLES}
    • by user id https://twitter.com/i/api/graphql/I5nvpI91ljifos1Y3Lltyg/UserByRestId?variables={VARIABLES}
      • VARIABLES:
          {    "screen_name": "USER_SCREEN_NAME",//by screen name    "withSafetyModeUserFields": true,    "withSuperFollowsUserFields": true  }
          {    "userId": "USER_ID",//by user id    "withSafetyModeUserFields": true,    "withSuperFollowsUserFields": true  }
  • Headers:
    • Content-Type: application/json
    • x-guest-token: 1232704521454999999
    • authorization: Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA

Response

  • Body
    • success
    {  "data": {    "user": {      "result": {        "__typename": "User",        "id": "VXNlcjo3ODMyMTQ=",        "rest_id": "783214",        "affiliates_highlighted_label": {},        "has_nft_avatar": true,//nft头像的边框是六边形        "legacy": {          "blocked_by": false,          "blocking": false,          "can_dm": false,          "can_media_tag": true,          "created_at": "Tue Feb 20 14:35:54 +0000 2007",          "default_profile": false,          "default_profile_image": false,          "description": "What's happening?!",          "entities": {            "description": { "urls": []},            "url": {              "urls": [                {                  "display_url": "about.twitter.com",                  "expanded_url": "https://about.twitter.com/",                  "url": "https://t.co/DAtOo6uuHk",                  "indices": [0, 23]                }              ]            }          },          "fast_followers_count": 0,          "favourites_count": 6292,          "follow_request_sent": false,          "followed_by": false,          "followers_count": 60784817,          "following": false,          "friends_count": 12,          "has_custom_timelines": true,          "is_translator": false,          "listed_count": 87616,          "location": "everywhere",          "media_count": 2439,          "muting": false,          "name": "Twitter",          "normal_followers_count": 60784817,          "notifications": false,          "pinned_tweet_ids_str": [],          "profile_banner_extensions": {            "mediaColor": {              "r": {                "ok": {                  "palette": [                    { "percentage": 65.52, "rgb": { "blue": 0, "green": 0, "red": 0 }},                    { "percentage": 18.59, "rgb": { "blue": 221, "green": 144, "red": 6 }},                    { "percentage": 10.43, "rgb": { "blue": 124, "green": 58, "red": 252 }},                    { "percentage": 3.27, "rgb": { "blue": 105, "green": 69, "red": 1 }},                    { "percentage": 0.69,"rgb": { "blue": 89, "green": 44, "red": 153}}                  ]                }              }            }          },          "profile_banner_url": "https://pbs.twimg.com/profile_banners/783214/1642704439",          "profile_image_extensions": {            "mediaColor": {              "r": {                "ok": {                  "palette": [                    { "percentage": 71.78,"rgb": { "blue": 255, "green": 227, "red": 182}},                    { "percentage": 11.06,"rgb": { "blue": 255, "green": 192, "red": 90}},                    { "percentage": 7.59,"rgb": { "blue": 252, "green": 249, "red": 218}},                    { "percentage": 6.51,"rgb": { "blue": 25, "green": 23, "red": 16}},                    { "percentage": 0.35,"rgb": { "blue": 254, "green": 204, "red": 1}}                  ]                }              }            }          },          "profile_image_url_https": "https://pbs.twimg.com/profile_images/1486805599367180290/Lp3amoqK_normal.jpg",          "profile_interstitial_type": "",          "protected": false,          "screen_name": "Twitter",          "statuses_count": 14967,          "translator_type": "regular",          "url": "https://t.co/DAtOo6uuHk",          "verified": true,          "want_retweets": false,          "withheld_in_countries": []        },        "professional": {          "rest_id": "1420110046596374541",          "professional_type": "Business",          "category": []        },        "smart_blocked_by": false,        "smart_blocking": false,        "super_follow_eligible": false,        "super_followed_by": false,        "super_following": false,        "legacy_extended_profile": {          "birthdate": { "day": 21, "month": 3, "visibility": "Public", "year_visibility": "Self"}        },        "is_profile_translatable": false      }    }  }}
    • failure
      • 被封禁的 @realDonaldTrump
        {  "data": {    "user": {      "result": {        "__typename": "UserUnavailable",        "unavailable_message": {          "entities": [            {              "fromIndex": 28,              "toIndex": 32,              "ref": {                "type": "TimelineUrl",                "url": "https://help.twitter.com/rules-and-policies/twitter-rules",                "urlType": "ExternalUrl"              }            }          ],          "rtl": false,          "text": "Twitter 会冻结违反 Twitter 规则的账号。了解更多"        },        "reason": "Suspended"      }    }  }}
      • 不存在的帐号,脸滚键盘打的就不说是谁了
        { "data": {}}//不存在的用户啥都不返回了
  • 与旧版相比基本没有什么改变,只需要修改两点,下面是前后对比:
    //rest apiconst userInfo = ...//取得信息let id_str = user_info.id_strlet user_info = user_info//GraphQLconst userInfo = ...//通过上述手段取得信息let id_str = user_info.data.user.result.rest_idlet user_info = user_info.data.user.result.legacy

关注者和正在关注

关注者

  • Method: GET
  • URL: https://twitter.com/i/api/graphql/neVf0YKN1h09TFZr4D43MA/Followers?variables={VARIABLES}
    • VARIABLES:
      {  "userId": "USER_ID",  "count": 20,  "includePromotedContent": false,  "withSuperFollowsUserFields": true,  "withDownvotePerspective": false,  "withReactionsMetadata": false,  "withReactionsPerspective": false,  "withSuperFollowsTweetFields": true,  "__fs_interactive_text": false,  "__fs_responsive_web_uc_gql_enabled": false,  "__fs_dont_mention_me_view_api_enabled": false}
  • Headers:
    • Content-Type: application/json
    • x-guest-token: 1232704521454999999
    • authorization: Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA

推文内容

时间线

  • Method: GET
  • URL: https://twitter.com/i/api/graphql/LNhjy8t3XpIrBYM-ms7sPQ/UserTweets?variables={VARIABLES}&features={FEATURES}
    • VARIABLES:
        {    "userId": "USER_ID",    "count": 20,//这个值不宜过大,会导致503,Twitter Monitor 默认最大配置为500    "withHighlightedLabel": true,    "withTweetQuoteCount": true,    "includePromotedContent": true,    "withTweetResult": false,    "withReactions": false,    "withUserResults": false,    "withVoice": false,    "withNonLegacyCard": true,    "withBirdwatchPivots": false,    "cursor": "CURSOR"  }//TODO timeline_v2
    • FEATURES:
        {    "dont_mention_me_view_api_enabled": true,    "interactive_text_enabled": true,    "responsive_web_uc_gql_enabled": false,    "vibe_tweet_context_enabled": false,    "responsive_web_edit_tweet_api_enabled": false,    "standardized_nudges_misinfo": false,    "responsive_web_enhance_cards_enabled": false  }
  • Headers:
    • Content-Type: application/json
    • x-guest-token: 1232704521454999999
    • authorization: Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA

Response

  • Body
    • success
    //太长了我不想放了
    • failure
    {  "data": {    "user": {      "result": {        "__typename": "UserUnavailable"      }    }  }}
  • count 默认值为 20,上限为100
  • cursor 后面会提及获取方式,可不填,不填则获取最近的 count
  • userId 为用户的数字 UID,就是上面的 rest_id
  • 请求里面的 features 其实存在很久了,直到最新的接口不加上就不返回内容了……
  • 最终可获取推文量为850

Tweets

此处出现大量的结构变化,虽然初次处理很烦,但一劳永逸

上面那句话就是扯淡,实际上暗改更多了

这次更新最明显的特征就是合并了 globalObjectstimeline

在新版全部timeline信息都在 JSON.data.user.result.timeline.timeline.instructions[0].entries或者JSON.data.user.result.timeline.timeline.instructions[1].entries,主要取决于TimelineClearCache有无出现

{  "instructions": [    { "type": "TimelineClearCache" },    { "type": "TimelineAddEntities", "entries": [...] },  ]}
  • *TimelineClearCache 估计是拿来清理不需要的节点,比如删推就可以通过此处清理,我猜的,因为没实践过
  • TimelineAddEntities 时间线上的所有信息都在这个节点的 entries 节点内

以下以 NODE 代称 JSON.data.user.result.timeline.timeline.instructions[1].entries 的一个节点

cursor

向上向下刷新用的cursor仍然位于最后两个NODE节点。

tweets

因为合并了全部内容,所以每个节点内不再是纯粹的推文,需要判断 NODE.content.entryType 的值是否为 TimelineTimelineItem ,如不是则可能是各种乱七八糟的用户推荐或者广告

下面是一些常用组件的迁移方向

  • 所有原本在 globalObjects.tweets 内节点的内容都被移至 NODE.content.itemContent.tweet_results.result.legacy,但tweet_id被转移到 NODE.content.itemContent.tweet_results.result.rest_id
  • 以前需要到 globalObjects.users 寻找到用户信息也被移到 NODE.content.itemContent.tweet.core.user_results.result.legacytweet_id被转移到 NODE.content.itemContent.tweet.core.user_results.result.rest_id
  • 以前被视为独立的转推推文(位于 globalObjects.tweets )被移到 NODE.content.itemContent.tweet_results.result.legacy.retweeted_status_result.result
  • 被引用的推文从 globalObjects.tweets 转移到 NODE.content.itemContent.tweet.quoted_status_result.result
  • 转推的原始推文信息被移动到了 NODE.content.itemContent.tweet.legacy.retweeted_status.legacy,不使用原始推文会丢失所有 extended_entities 的内容,同时各种 hashtag、url 等文字的替换会出现位置错误的问题(这个等一等,等我买个老花镜来比较它跟上面那个是什么关系)
  • 转推的媒体被转移到 NODE.content.itemContent.tweet.legacy.retweeted_status.legacy.extended_entities.media(好乱啊,让我捋捋)

Cards

卡片转移到 NODE.content.itemContent.tweet_results.result.card.legacy

原本我以为会很复杂,其实还是不需要做大量变动,如果以前有写过这部分处理就会发现卡片的内容被移到 legacy,所以可以重新将binding_values改为以前的kv对模式:

//重新将 Array 改回 Object$tmpBindingValueList = [];foreach ($cardInfo["binding_values"] as $bindingValue) {    $tmpBindingValueList[$bindingValue["key"]] = $bindingValue["value"];}$cardInfo["binding_values"] = $tmpBindingValueList;//这是改成 graphql 的代码//$tmpList = [];//foreach ($cardInfo["binding_values"] as $key => $value) {//    $tmpList[] = ["key" => $key, "value" => $value];//}//$cardInfo["binding_values"] = $tmpList;

NSFW

这个一般只会在图片处提醒一下,但也在一些地区(比如日本)某些推文整篇都被限制,根据 Notices on Twitter and what they mean,被标记成成人内容的推文会被限制,但不同地区为什么会有不同的标准,我暂且不明白,先放一个例子,这类推文一般无法在非登录状态下取得

{  "entryId": "tombstone-1469626851568271362",  "sortIndex": "1469626851568271362",  "content": {    "entryType": "TimelineTimelineItem",    "itemContent": {      "itemType": "TimelineTombstone",      "tombstoneDisplayType": "Inline",      "tombstoneInfo": {        "text": "",        "richText": {          "rtl": false,          "text": "年齢制限のある成人向けコンテンツです。このコンテンツは、18歳未満のユーザーには適切でない可能性があります。このメディアを表示するには、Twitterにログインしてください。詳細はこちら",          "entities": [            {              "fromIndex": 76,              "toIndex": 80,              "ref": {"type": "TimelineUrl","url": "https://twitter.com","urlType": "ExternalUrl"}            },            {              "fromIndex": 87,              "toIndex": 93,              "ref": {"type": "TimelineUrl","url": "https://help.twitter.com/rules-and-policies/notices-on-twitter","urlType": "ExternalUrl"}            }          ]        }      }    }  }}

Errors

TODO 本节待更新

以前判断挺轻松的,只需要判断有没有errors就行了,现在需要判断不存在data.user.result.timeline,错误原因出现在data.user.result.__typename

twitter会偷懒,现在错误原因基本都是 Something went wrong……

致谢

  • Juicpt 指出不少新的变动
  • 评论区的大家

参考


评论区