上接 怎么爬Twitter
目前常用 Twitter 接口状态
名称 | Resuful | Graphql | 备注 |
---|---|---|---|
UserInfo | o | o | |
Search | o | ? | 印象中Search 曾短暂使用过Graphql ,但不确定 |
Timeline | x | o | Restful 会无限429 |
Status | o | o |
Twitter 混用 Graphql api(以下简称graphql) 和 Restful(以下简称 restful 或 rest) 有很长一段时间了,虽然我写这篇文章的时候只是启用了时间线,但是现在又逐渐在主题帖、用户信息以及…… NFT 头像信息上面动手脚,我觉得这玩意迟早会替代掉 restful ,而最近重爬了 Twitter Monitor 的所有推文数据,修理了不少以前留下来的bug,顺便 restful 时间线开始无限429,翻各种 issue 都没人解答,我觉得是时候准备迁移了
于是开始整理这边的文章
RATE LIMIT
类型 | 次数 | 备注 |
---|---|---|
UserByRestId | 500 | |
UserByScreenName | 500 | |
UserTweets | 500 | |
TweetDetail | 500 | 即 conversation ,取得投票结果需要这个接口 |
AudioSpaceById | 500 | |
BroadCast | 187 | 好奇怪的数字 |
Search | 250 | * 搜索接口并不使用graphql |
Recommendation | 60 | 就是那个 "你可能会喜欢" |
疑似 graphql api 一律限制 500,有效期从 3 小时砍到 15 分钟
UserInfo
由于这边的函数自带 multi_curl
,所以我写得轻松一点
<?php
require(__DIR__ . '/init.php');
$fetch = new Tmv2\Fetch\Fetch();
$token = $fetch->tw_get_token();
$count = 0;
$change_count = 0;
$tmp = array_fill(0, 50, 783214);//"twitter"
$end = false;
for(;;) {
$users = $fetch->tw_get_userinfo($tmp, $token);
foreach ($users as $user) {
if ($user === NULL || isset($user["errors"])) {
$token = $fetch->tw_get_token();
$change_count++;
echo "change token $change_count: -->" . $token[1] . "<--\n";
break;
}
$count++;
$tmpInfo = path_to_array("user_info_legacy", $user);
echo "-->" . $count . ' '. $tmpInfo["name"] .' ('. $tmpInfo["screen_name"] .")<--\n";
}
}
//UserByRestId
//-->498 Twitter (Twitter)<--
//-->499 Twitter (Twitter)<--
//-->500 Twitter (Twitter)<--
//change token 1: -->1495801929439535111<--
//UserByScreenName
//-->998 Twitter (Twitter)<--
//-->999 Twitter (Twitter)<--
//-->1000 Twitter (Twitter)<--
//change token 1: -->1495802088294690816<--
TimeLine
由于Graphql接口比较慢(估计是生成过程的优化实在顶不住大量数据的混合),单线程循环跑起来很耗时间,我写了一个脚本,以10
并发请求100
条最新推文尝试找到这个极限。
<?php
require(__DIR__ . '/init.php');//这个是Twitter Monitor的init.php
$fetch = new Tmv2\Fetch\Fetch();
$token = $fetch->tw_get_token();
$count = 0;
$tweet_count = 0;
$graphqlObject = [
"userId" => 783214,
"count" => 100,
"withHighlightedLabel" => true,
"withTweetQuoteCount" => true,
"withQuickPromoteEligibilityTweetFields" => true,
"withSuperFollowsUserFields" => true,
"withSuperFollowsTweetFields" => true,
"withDownvotePerspective" => false,
"withReactionsMetadata" => false,
"includePromotedContent" => true,
"withReactionsPerspective" => false,
"withTweetResult" => false,
"withReactions" => false,
"withUserResults" => false,
"withVoice" => true,
"withNonLegacyCard" => true,
"withBirdwatchPivots" => false,
"withV2Timeline" => false
];
$tmp = array_fill(0, 9, "https://twitter.com/i/api/graphql/" . queryhqlQueryIdList["UserTweets"]["queryId"] . "/UserTweets?variables=" . urlencode(json_encode($graphqlObject)));
$end = false;
for(;;) {
$tweets = $fetch->tw_fetch_multi($tmp, $token);
foreach ($tweets as $tweet) {
$generateTweetData = new Tmv2\Core\Core($tweet, true, [], false);
echo "-->" . $count .'-' . $tweet_count . ' '. $generateTweetData->cursor["top"] .' '. $generateTweetData->cursor["bottom"] ."<--\n";
if ($generateTweetData->errors[0] !== 0) {
$end = true;
break;
}
$tweet_count += 100;
$count++;
}
if ($end) {
break;
}
}
//一次性代码追求什么性能和漂亮,能跑就行
//输出
//...
//-->997-99700 HCaAgIDEm/+RuSkAAA== HBaQgLnJntzU4CUAAA==<--
//-->998-99800 HCaAgICkmv+RuSkAAA== HBaQgLnJntzU4CUAAA==<--
//-->999-99900 <--
最后发现998是最后一次能显示cursor
,到999就没了,但这个现实是从0开始的,所以暂且认为 Timeline的极限是 999 次,再多就需要更换guest-token
了,频繁更换guest-token
可能会导致429,这时需要考虑用 代理池/多IP/分布式 等方法
Token 池
由于一个guest-token
有使用次数和有效期(10800s)的限制,所以制作一个token池是可行的,我正在尝试制作一个 Token 池,做完将会补充此段
queryId
这些id还是存在于 main文件,可以参考以下脚本获取:
之前的脚本已经失效,新的获取方式请参考 BANKA2017/twitter-monitor ~/apps/scripts/updateQueryIdList.mjs,如果要用其他语言重构需要注意以下几点:
- 必须要设置合理的
User-Agent
,直接用curl或者axios这种会返回错误的信息 - 这个脚本不稳定,未来可能会再次失效,需要持续关注
列表挺长的,我只列出 Twitter Monitor 需要用到的几个,其他请自行寻找用处
{
"UsersByRestIds": {
"queryId": "I5nvpI91ljifos1Y3Lltyg",
"operationName": "UserByRestId",
"operationType": "query"
},
"UserByScreenName": {
"queryId": "7mjxD3-C6BxitPMVQ6w0-Q",
"operationName": "UserByScreenName",
"operationType": "query"
},
"UserTweets": {
"queryId": "LNhjy8t3XpIrBYM-ms7sPQ",
"operationName": "UserTweets",
"operationType": "query"
},
"UserTweetsAndReplies": {
"queryId": "Vg5aF036K40ST3FWvnvRGA",
"operationName": "UserTweetsAndReplies",
"operationType": "query"
},
"TweetDetail": {
"queryId": "bRL1YYMraLIBpo1PGLeFcw",
"operationName": "TweetDetail",
"operationType": "query"
},
}
链接拼接的格式就是
let url = `https://twitter.com/i/api/graphql/${queryId}/${operationName}/?variables=` + encodeURIComponent(JSON.stringify(Variables))
这些queryId
可能会被更新或者删除,但暂时没发现使用旧queryId
会造成什么不良影响
2022.09.06 更新
这些 queryId
与请求时的 features
参数相关,如无必要请务必要不要随意更新,更新后请及时补充相关请求的 features
所需要的参数,若缺少相关参数会返回如下内容
{
"errors": [
{
"message": "The following features cannot be null: responsive_web_enhance_cards_enabled",
"extensions": {
"name": "BadRequestError",
"source": "Client",
"code": 336,
"kind": "Validation",
"tracing": {
"trace_id": "eeeeeeeeeeeeeeee"
}
},
"code": 336,
"kind": "Validation",
"name": "BadRequestError",
"source": "Client",
"tracing": {
"trace_id": "eeeeeeeeeeeeeeee"
}
}
]
}
Guest Token & Cookie
标注 *
的是非必须
* csrf-token
首先这玩意我真不知道什么环境下才会强制启用,估计是登录以后才会需要,不是必须的,本地生成
//ct0 in cookie
//x-csrf-token in header
const t = (() => {
const e = window.crypto || window.msCrypto;
if (!e) return;
const t = new Uint8Array(32);
e.getRandomValues(t);
let n = "";
for (let e = 0; e < t.length; e++) n +=
t[e].toString(16).substr(-1);
return n
})();
从最后生成的结果来看……不就是32位随机字符串嘛,我就直接
echo md5(time());
* 首次访问的set-cookie
是的,首次访问会设置,但都不是必须的,我先摆个 pattern 在这里 /set-cookie: ([^;]+);/
guest_id_marketing: v1%3A164301325110776087
guest_id_ads: v1%3A164301325110776087
personalization_id: "v1_FBBNMaLDB1sdu2yWcCdHIQ=="
guest_id: v1%3A164301325110776087
guest-token
- 通过
curl 'https://twitter.com' --compressed
此时得到的网页会有以下几行赋予guest-token
,就是那个gt
<script nonce="MDRjZmJlNWItYWNmOC00MTdiLWIxYjUtYTFhZTUyYTc2ODg4"> document.cookie = decodeURIComponent("gt=1232704521454999999; Max-Age=10800; Domain=.twitter.com; Path=/; Secure"); </script>
- 或
curl 'https://api.twitter.com/1.1/guest/activate.json' \ -X 'POST' \ -H 'authorization: Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjh LTvJu4FA33AGWWjCpTnA' \ --compressed
使用这种方式可以顺便取得上面那几个cookie
authorization
Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA
这玩意我就没见它变过
用户信息
Request
- Method: GET
- URL:
- by screen name
https://twitter.com/i/api/graphql/7mjxD3-C6BxitPMVQ6w0-Q/UserByScreenName?variables={VARIABLES}
- by user id
https://twitter.com/i/api/graphql/I5nvpI91ljifos1Y3Lltyg/UserByRestId?variables={VARIABLES}
- VARIABLES:
{ "screen_name": "USER_SCREEN_NAME",//by screen name "withSafetyModeUserFields": true, "withSuperFollowsUserFields": true }
{ "userId": "USER_ID",//by user id "withSafetyModeUserFields": true, "withSuperFollowsUserFields": true }
- VARIABLES:
- by screen name
- Headers:
- Content-Type: application/json
- x-guest-token: 1232704521454999999
- authorization: Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA
Response
- Body
- success
{ "data": { "user": { "result": { "__typename": "User", "id": "VXNlcjo3ODMyMTQ=", "rest_id": "783214", "affiliates_highlighted_label": {}, "has_nft_avatar": true,//nft头像的边框是六边形 "legacy": { "blocked_by": false, "blocking": false, "can_dm": false, "can_media_tag": true, "created_at": "Tue Feb 20 14:35:54 +0000 2007", "default_profile": false, "default_profile_image": false, "description": "What's happening?!", "entities": { "description": { "urls": []}, "url": { "urls": [ { "display_url": "about.twitter.com", "expanded_url": "https://about.twitter.com/", "url": "https://t.co/DAtOo6uuHk", "indices": [0, 23] } ] } }, "fast_followers_count": 0, "favourites_count": 6292, "follow_request_sent": false, "followed_by": false, "followers_count": 60784817, "following": false, "friends_count": 12, "has_custom_timelines": true, "is_translator": false, "listed_count": 87616, "location": "everywhere", "media_count": 2439, "muting": false, "name": "Twitter", "normal_followers_count": 60784817, "notifications": false, "pinned_tweet_ids_str": [], "profile_banner_extensions": { "mediaColor": { "r": { "ok": { "palette": [ { "percentage": 65.52, "rgb": { "blue": 0, "green": 0, "red": 0 }}, { "percentage": 18.59, "rgb": { "blue": 221, "green": 144, "red": 6 }}, { "percentage": 10.43, "rgb": { "blue": 124, "green": 58, "red": 252 }}, { "percentage": 3.27, "rgb": { "blue": 105, "green": 69, "red": 1 }}, { "percentage": 0.69,"rgb": { "blue": 89, "green": 44, "red": 153}} ] } } } }, "profile_banner_url": "https://pbs.twimg.com/profile_banners/783214/1642704439", "profile_image_extensions": { "mediaColor": { "r": { "ok": { "palette": [ { "percentage": 71.78,"rgb": { "blue": 255, "green": 227, "red": 182}}, { "percentage": 11.06,"rgb": { "blue": 255, "green": 192, "red": 90}}, { "percentage": 7.59,"rgb": { "blue": 252, "green": 249, "red": 218}}, { "percentage": 6.51,"rgb": { "blue": 25, "green": 23, "red": 16}}, { "percentage": 0.35,"rgb": { "blue": 254, "green": 204, "red": 1}} ] } } } }, "profile_image_url_https": "https://pbs.twimg.com/profile_images/1486805599367180290/Lp3amoqK_normal.jpg", "profile_interstitial_type": "", "protected": false, "screen_name": "Twitter", "statuses_count": 14967, "translator_type": "regular", "url": "https://t.co/DAtOo6uuHk", "verified": true, "want_retweets": false, "withheld_in_countries": [] }, "professional": { "rest_id": "1420110046596374541", "professional_type": "Business", "category": [] }, "smart_blocked_by": false, "smart_blocking": false, "super_follow_eligible": false, "super_followed_by": false, "super_following": false, "legacy_extended_profile": { "birthdate": { "day": 21, "month": 3, "visibility": "Public", "year_visibility": "Self"} }, "is_profile_translatable": false } } } }
- failure
- 被封禁的 @realDonaldTrump
{ "data": { "user": { "result": { "__typename": "UserUnavailable", "unavailable_message": { "entities": [ { "fromIndex": 28, "toIndex": 32, "ref": { "type": "TimelineUrl", "url": "https://help.twitter.com/rules-and-policies/twitter-rules", "urlType": "ExternalUrl" } } ], "rtl": false, "text": "Twitter 会冻结违反 Twitter 规则的账号。了解更多" }, "reason": "Suspended" } } } }
- 不存在的帐号,脸滚键盘打的就不说是谁了
{ "data": {}}//不存在的用户啥都不返回了
- 被封禁的 @realDonaldTrump
- 与旧版相比基本没有什么改变,只需要修改两点,下面是前后对比:
//rest api const userInfo = ...//取得信息 let id_str = user_info.id_str let user_info = user_info //GraphQL const userInfo = ...//通过上述手段取得信息 let id_str = user_info.data.user.result.rest_id let user_info = user_info.data.user.result.legacy
关注者和正在关注
关注者
- Method: GET
- URL:
https://twitter.com/i/api/graphql/neVf0YKN1h09TFZr4D43MA/Followers?variables={VARIABLES}
- VARIABLES:
{ "userId": "USER_ID", "count": 20, "includePromotedContent": false, "withSuperFollowsUserFields": true, "withDownvotePerspective": false, "withReactionsMetadata": false, "withReactionsPerspective": false, "withSuperFollowsTweetFields": true, "__fs_interactive_text": false, "__fs_responsive_web_uc_gql_enabled": false, "__fs_dont_mention_me_view_api_enabled": false }
- VARIABLES:
- Headers:
- Content-Type: application/json
- x-guest-token: 1232704521454999999
- authorization: Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA
Verified 用户验证
Verified 原本指的是那些在用户名(name)后面带有小蓝勾的用户,一般为政企或名人,需要由Twitter验证
2022 年马斯克收购 Twitter 后开始为 Twitter Blue 用户提供小蓝勾,这种方式在 Twitter 中被称作 Blue Verified,用于校验的字段被写作is_blue_verified
,可以从JSON.data.user.result.is_blue_verified
验证
至此,只要用户符合Blue Verified或者原版Verified其中一种就可以获得小蓝勾
Twitter 又加了一种小金标,只要字段 ext_verified_type
值为 Business
即可展示小金标,在此以前 Twitter 借用了为各国官媒添加标记的位置来标识此类账号。目前暂时还不知道这个字段还能有什么值
同时,使用了新的 GrapHQL QueryID
查询 UsersVerifiedAvatars
接口即可批量查询用户是否取得Blue Verified,这个接口原本用于查询用户是否拥有 NFT 头像
另外有人写了浏览器插件用于快速查成分
- Method: GET
- URL:
https://twitter.com/i/api/graphql/AkfLpq1RURPtDOcd56qyCg/UsersVerifiedAvatars?variables={VARIABLES}&features={FEATURES}
- VARIABLES:
{ "userIds": ["uid1", "uid2", "uid3"]//and more... }
- FEATURES:
{ "responsive_web_twitter_blue_verified_badge_is_enabled": true }
- VARIABLES:
- Headers:
- Content-Type: application/json
- x-guest-token: 1232704521454999999
- authorization: Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA
Response
- Body
- success
{ "result": { "__typename": "User", "is_blue_verified": true, "has_nft_avatar": false, "rest_id": "1511811738076856322" } }
- failure
{ "code": 366, "message": "NumericString value expected. Received " }
推文内容
时间线
- Method: GET
- URL:
https://twitter.com/i/api/graphql/LNhjy8t3XpIrBYM-ms7sPQ/UserTweets?variables={VARIABLES}&features={FEATURES}
- VARIABLES:
{ "userId": "USER_ID", "count": 20,//这个值不宜过大,会导致503,Twitter Monitor 默认最大配置为500 "withHighlightedLabel": true, "withTweetQuoteCount": true, "includePromotedContent": true, "withTweetResult": false, "withReactions": false, "withUserResults": false, "withVoice": false, "withNonLegacyCard": true, "withBirdwatchPivots": false, "cursor": "CURSOR" }//TODO timeline_v2
- FEATURES:
{ "dont_mention_me_view_api_enabled": true, "interactive_text_enabled": true, "responsive_web_uc_gql_enabled": false, "vibe_tweet_context_enabled": false, "responsive_web_edit_tweet_api_enabled": false, "standardized_nudges_misinfo": false, "responsive_web_enhance_cards_enabled": false }
- VARIABLES:
- Headers:
- Content-Type: application/json
- x-guest-token: 1232704521454999999
- authorization: Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA
Response
- Body
- success
//太长了我不想放了
- failure
{ "data": { "user": { "result": { "__typename": "UserUnavailable" } } } }
count
默认值为20
,上限为100
cursor
后面会提及获取方式,可不填,不填则获取最近的count
条userId
为用户的数字UID
,就是上面的rest_id
- 请求里面的
features
其实存在很久了,直到最新的接口不加上就不返回内容了…… - 最终可获取推文量为
850
Tweets
此处出现大量的结构变化,虽然初次处理很烦,但一劳永逸
上面那句话就是扯淡,实际上暗改更多了
这次更新最明显的特征就是合并了 globalObjects
和 timeline
在新版全部timeline信息都在 JSON.data.user.result.timeline.timeline.instructions[0].entries
或者JSON.data.user.result.timeline.timeline.instructions[1].entries
,主要取决于TimelineClearCache
有无出现
{
"instructions": [
{ "type": "TimelineClearCache" },
{ "type": "TimelineAddEntities", "entries": [...] },
]
}
- *
TimelineClearCache
估计是拿来清理不需要的节点,比如删推就可以通过此处清理,我猜的,因为没实践过 TimelineAddEntities
时间线上的所有信息都在这个节点的entries
节点内
以下以 NODE
代称 JSON.data.user.result.timeline.timeline.instructions[1].entries
的一个节点
cursor
向上向下刷新用的cursor仍然位于最后两个NODE
节点。
tweets
因为合并了全部内容,所以每个节点内不再是纯粹的推文,需要判断 NODE.content.entryType
的值是否为 TimelineTimelineItem
,如不是则可能是各种乱七八糟的用户推荐或者广告
下面是一些常用组件的迁移方向
- 所有原本在
globalObjects.tweets
内节点的内容都被移至NODE.content.itemContent.tweet_results.result.legacy
,但tweet_id
被转移到NODE.content.itemContent.tweet_results.result.rest_id
- 以前需要到
globalObjects.users
寻找到用户信息也被移到NODE.content.itemContent.tweet.core.user_results.result.legacy
但tweet_id
被转移到NODE.content.itemContent.tweet.core.user_results.result.rest_id
- 以前被视为独立的转推推文(位于
globalObjects.tweets
)被移到NODE.content.itemContent.tweet_results.result.legacy.retweeted_status_result.result
- 被引用的推文从
globalObjects.tweets
转移到NODE.content.itemContent.tweet.quoted_status_result.result
- 转推的原始推文信息被移动到了
NODE.content.itemContent.tweet.legacy.retweeted_status.legacy
,不使用原始推文会丢失所有extended_entities
的内容,同时各种 hashtag、url 等文字的替换会出现位置错误的问题(这个等一等,等我买个老花镜来比较它跟上面那个是什么关系) 转推的媒体被转移到(好乱啊,让我捋捋)NODE.content.itemContent.tweet.legacy.retweeted_status.legacy.extended_entities.media
Cards
卡片转移到 NODE.content.itemContent.tweet_results.result.card.legacy
原本我以为会很复杂,其实还是不需要做大量变动,如果以前有写过这部分处理就会发现卡片的内容被移到 legacy
,所以可以重新将binding_values
改为以前的kv对模式:
//重新将 Array 改回 Object
$tmpBindingValueList = [];
foreach ($cardInfo["binding_values"] as $bindingValue) {
$tmpBindingValueList[$bindingValue["key"]] = $bindingValue["value"];
}
$cardInfo["binding_values"] = $tmpBindingValueList;
//这是改成 graphql 的代码
//$tmpList = [];
//foreach ($cardInfo["binding_values"] as $key => $value) {
// $tmpList[] = ["key" => $key, "value" => $value];
//}
//$cardInfo["binding_values"] = $tmpList;
NSFW
这个一般只会在图片处提醒一下,但也在一些地区(比如日本)某些推文整篇都被限制,根据 Notices on Twitter and what they mean,被标记成成人内容的推文会被限制,但不同地区为什么会有不同的标准,我暂且不明白,先放一个例子,这类推文一般无法在非登录状态下取得
2022-11-11 更新
得到这些信息的共同点是使用了新的Bearer Token
,关于新旧Bearer Token
的异同请看我的另一篇文章
{
"entryId": "tombstone-1469626851568271362",
"sortIndex": "1469626851568271362",
"content": {
"entryType": "TimelineTimelineItem",
"itemContent": {
"itemType": "TimelineTombstone",
"tombstoneDisplayType": "Inline",
"tombstoneInfo": {
"text": "",
"richText": {
"rtl": false,
"text": "年齢制限のある成人向けコンテンツです。このコンテンツは、18歳未満のユーザーには適切でない可能性があります。このメディアを表示するには、Twitterにログインしてください。詳細はこちら",
"entities": [
{
"fromIndex": 76,
"toIndex": 80,
"ref": {"type": "TimelineUrl","url": "https://twitter.com","urlType": "ExternalUrl"}
},
{
"fromIndex": 87,
"toIndex": 93,
"ref": {"type": "TimelineUrl","url": "https://help.twitter.com/rules-and-policies/notices-on-twitter","urlType": "ExternalUrl"}
}
]
}
}
}
}
}
而媒体资源上的NSFW内容由上传者自行标记,可选的类型包括 裸体、暴力和敏感内容
Errors
TODO 本节待更新
以前判断挺轻松的,只需要判断有没有errors
就行了,现在需要判断不存在data.user.result.timeline
,错误原因出现在data.user.result.__typename
twitter会偷懒,现在错误原因基本都是 Something went wrong
……
致谢
- Juicpt 指出不少新的变动
- 评论区的大家
参考
- https://help.twitter.com/en/rules-and-policies/notices-on-twitter Notices on Twitter and what they mean
- https://help.twitter.com/en/managing-your-account/about-twitter-verified-accounts How to get verified on Twitter
- GraphQL at Twitter