Skip to content

Latest commit

 

History

History
100 lines (75 loc) · 12.6 KB

File metadata and controls

100 lines (75 loc) · 12.6 KB

Bluesky API Issues/Tips

This file purpose is to document issues and notable details related to the Bluesky API found during the development of the bsky formatting and normalizing tools within Twitwi and the minet bsky command within minet. We assume it isn't exhaustive, and every new notable issue or even clarification on some already existing ones is very welcomed.

Issues/Tips related to Bluesky API in general

Tip : App passwords and rate limit

When using the minet bsky command, you need to be registered using App passwords. There is a rate limit for an app password of 3000 requests per 5 minutes (i.e. 600 requests per minute, or 10 requests per second). This limit is shared across all the commands using an app password associated to the same Bluesky account, and is almost never reached when calling one command at a time. However, if you plan to do a large number of requests, and due to the slowness of response, it is recommended to call multiple commands in parallel (e.g. using tmux sessions). Note that it does not matter if the commands are using the same app password or not, as the rate limit is shared across all app passwords associated to the same Bluesky account.

Note that when you reach the rate limit, you cannot connect to your Bluesky account until the limit is reset (you then need to wait about 5 minutes).

Issue/Tip : Retrying on errors from the API

Sometimes, when doing too much requests in a short time, or when using some specific route with big quantities of data to retrieve, the Bluesky API raises errors. Here are the most commons of the unpredictable ones:

  • ExpiredToken (HTTP Status 400): Somehow happens after 2 hours on a request (which worked fine before). For now, it seems that it is raised when using app-bsky-feed-search-posts route on queries matching a very dense amount of posts in a short time range of publication. Associated Minet exception : BlueskyExpiredToken.
  • HTTP Statuses 502: the most common error raised by the API when doing many requests in a short time.
  • UpstreamFailure: happens sometimes when doing many requests in a short time. Associated Minet exception : BlueskyUpstreamFailureError.

Minet bypass them by waiting and retrying automatically after a short period of time.

Issues/Tips related to post payloads

Issue : Wrong lang field value

The lang field of a post payload is supposed to correspond to the language of the post content. However, it appears that this field is not always correctly set. For example, some posts written in French have their lang field set to en (English) instead of fr (French) and vice versa. Indeed, the language detection algorithm is active only when the text reaches a certain length (115 characters according to our experiments), and even then, the user can choose not to have it applied. When not applied or when the text is too short, the lang field is set to the user preferred language, but the user can override it and set it to any value.

To illustrate this issue, we made a collection of all posts having the lang attribute set to fr (French) using the minet bsky search-posts "lang:fr" command. In this collection, we found that only about 72% of the 32,969,538 posts are actually written in French when detecting their language using Lingua Rust language detection library on the post content.

As a workaround, it is recommended to use a language detection library on the post content to determine its language more accurately.

Tip : Unique identifier of a post

The unique identifier of a regular post on Bluesky is its uri field, which follows the pattern at://{user_did}/app.bsky.feed.post/{post_did}.

Tip : Different types of publications on Bluesky

There are several types of publications on Bluesky, which might break your normalization process if you don't know them:

  • Regular posts:
    • their url follow the pattern https://bsky.app/profile/{user_handle or user_did}/post/{post_did}
    • their uri follow the pattern at://{user_did}/app.bsky.feed.post/{post_did}
  • Feeds: specialized news feeds that allow the user to choose the topics that interest them the most, rather than relying on a central algorithm that decides for them based on their activity
    • their url follow the pattern https://bsky.app/profile/{user_handle or user_did}/feed/{feed_did}
    • their uri follow the pattern at://{user_did}/app.bsky.feed.generator/{feed_did}
  • Lists: user-created lists that can be add to their feed
    • their url follow the pattern https://bsky.app/profile/{user_handle or user_did}/lists/{list_did}
    • their uri follow the pattern at://{user_did}/app.bsky.graph.list/{list_did}
  • Starter packs: predefined lists of users to follow (when creating a new account for example)
    • their url follow the pattern https://bsky.app/starter-pack/{user_handle or user_did}/{starter_pack_did}
    • their uri follow the pattern at://{user_did}/app.bsky.graph.starterpack/{starter_pack_did}

When normalizing Bluesky posts, it is recommended to check the type field of the post payload to determine the type of publication and handle it accordingly.

Issue : Personalized links in post content

The users have the possibility to publish content on Bluesky from a personal server, allowing them to personalize the data of their posts, including link encoding and placement. Consequently, the links/mentions found in the text field of a post payload might be shifted when the alignement on the link/mention characters is not done properly, especially if the text contains emojis or special characters. For examples:

Usually, Twitwi replaces in the original text the links by their real link (replacing some text by the link when having a personalized link for example), but when having undecidable cases like the previous one, it doesn't replace anything at all.

Issue : False posting dates

The users have the possibility to publish content on Bluesky from a personal server, allowing them to personalize the data of their posts, including date of publication of the post. Note that there are two distinct timestamps associated to a post: according to the documentation of the app-bsky-feed-search-posts route, the createdAt timestamp is "assumed to represent the original time of posting, but clients are allowed to insert any value", and the indexedAt timestamp "generally represents the time the record was "first seen" by an API server". However, we noticed during the development of minet that not only the createdAt timestamp can be modified by the user posting, but the indexedAt timestamp can be too, meaning that for some posts, it is (for now) impossible to know their real date of publication. Moreover, if you plan to manipulate datasets of Bluesky posts, keep in mind that as users are allowed to enter the date they want when posting, it is possible to find some posts with aberrant dates, such as 1970-01-01 (timestamp 0), but funnier ones can be found, as 0001-10-14 or 0200-01-01... Internet never fails to surprise us!

Issues/Tips related to user profile payloads

Tip : Different user profile payload structures

There are two different structures for user profile payloads on Bluesky:

Issues/Tips related to specific Bluesky API routes

minet bsky search-posts is the Minet command using this route.

Tip : Precision of the datetime timestamps

The datetime timestamps retrieved by the Bluesky API are precised to the microsecond. However, when using since and until args, the effective precision is only up to the millisecond. Indeed, when testing with timestamps differing only by microseconds, the results are the same as if the timestamps were rounded to the nearest millisecond.

Consequently, in this document, when referring to "datetime" timestamps, we will consider that their precision is up to the millisecond.

Issue : Sorting of retrieved posts

The command retrieves posts sorted by latest ranking order. The documentation of the app-bsky-feed-search-posts route precises that when using since and until args, the user is "expected to use 'sortAt' timestamp, which may not match 'createdAt'", meaning that the datetime which they are basing their ranking order on is this 'sortAt' timestamp. Still according to their documentation, the sortAt timestamp is 'defined as the "earlier" of the createdAt and indexedAt timestamps', (where createdAt timestamp is "assumed to represent the original time of posting, but clients are allowed to insert any value", and indexedAt timestamp "generally represents the time the record was "first seen" by an API server").

As "clients are allowed to insert any value", i.e. users can enter the date they want when posting, some aberrant dates can be found! Check on the False posting dates part for more information.

Using sortAt timestamp prevent posts which have an artificial createdAt timestamp (e.g. set in the future) from appearing everytime at the start of the results of a search query. Even though it is smart, after some tests, it seems that the posts retrieved by the route are not always perfectly sorted by the sortAt timestamp, either using since and/or until or not. Indeed, this timestamp depends on the createdAt one, and we observed that in some cases, this value is artificial, which might be the source of the issue (cf their indexation code).

Issue : Paging

The app-bsky-feed-search-posts route appears to limit the number of results to 10,000 posts per query. To work around this limitation, Minet uses datetime range paging when this limit is reached: it makes a new query with the until parameter adjusted to the createdAt timestamp of the oldest post retrieved so far, allowing it to collect posts that were published before those already collected. However, due to this method, if there are more than 10,000 unique posts with the same datetime, we won't be able to get them all. Moreover, when reaching that limit and time paging, we noticed that Bluesky API doesn't return exactly the same 10,000 posts again: some new posts are found, but most are already seen, and most importantly it seems that there is no logic behind the order of these posts, meaning we are for now unable to ensure we retrieve the exact same posts when executing the same command multiple times... This issue is being investigated.

Tip : since and until args overwrite the ones in the query

When using since and/or until args of the minet bsky search-posts command, they overwrite the corresponding since: and/or until: parts of the query string when provided. For example, the following three commands are equivalent:

minet bsky search-posts "lang:fr since:2024-12-01T00:00:00.000Z" --since 2024-11-01T00:00:00.000Z > bsky-fr.csv
minet bsky search-posts "lang:fr" --since 2024-11-01T00:00:00.000Z > bsky-fr.csv
minet bsky search-posts "lang:fr since:2024-11-01T00:00:00.000Z" > bsky-fr.csv