A containerised Python based listener that ingests the Bluesky firehose and exports the processed data to a Postgres database to allow for easy collection of samples from BlueSky for research and wider purposes.
BlueSky is an up-an-coming social media site based on the AT protocol, which may merit further study. However the ATProtocol can be difficult to interpret and code around, therefore this application is designed to ease the barrier of entry for bulk collection and analysis of BlueSky data.
- Install Docker
- Navigate to the root folder of the repository
- Run '''docker compose up'''
- Connect via your preferred method to the PostgreSQL database hosted locally at port 5432
Please note: that the bind mounted volumes are difficult to delete, due to security features within Docker.
You will need to use a command such as docker exec --privileged --user root <CONTAINER_ID> chown -R "$(id -u):$(id -g)" <TARGET_DIR>
, more details here.
Please note: Some instability has been observed when running WideSky for the first time. If the logs show connection errors between the Python and Postgres containers after building the project for the first time, please restart the application.
The schema for Postgres database is as follows:
did | first_known_as | also_known_as |
---|---|---|
TEXT PRIMARY KEY | TEXT | TEXT |
cid | created_at | did | commit | text | langs | facets | has_embed | embed_type | embed_refs | external_uri | has_record | record_cid | record_uri | is_reply | reply_root_cid | reply_root_uri | reply_parent_cid | reply_parent_uri |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TEXT PRIMARY KEY | TIMESTAMP WITH TIME ZONE | TEXT | TEXT | TEXT | TEXT ARRAY | JSONB | BOOLEAN | TEXT | TEXT ARRAY | TEXT | BOOLEAN | TEXT | TEXT | BOOLEAN | TEXT | TEXT | TEXT | TEXT |
cid | created_at | did | commit | subject_cid | subject_url |
---|---|---|---|---|---|
TEXT PRIMARY KEY | TIMESTAMP WITH TIME ZONE | TEXT | TEXT | TEXT | TEXT |
cid | created_at | did | commit | subject_cid | subject_uri |
---|---|---|---|---|---|
TEXT PRIMARY KEY | TIMESTAMP WITH TIME ZONE | TEXT | TEXT | TEXT | TEXT |
- Async functionality
- Exponential backoff for reconnections to firehose and reattempts for plc.directory
- Rotating Logging bind mounted to a widesky/logs folder
- Async workers for processing and batching to Postgres
- Batched Postgres saving
- Implement graph.list post type
- Implement embed types
- images#main
- selectionQuote
- secret
- Others I have not seen?
- Improve error handling
- Add testing
- Capture PostgreSQL logs in logs/postgres
- Add webserver with metrics and ability to configure capture protocols
- Integrate with a crawler to reach back for full activity records of active users where not present already in data
- Add option to prevent HTTPX logging clogging up the logs
- Capture delete and other non-create events
Thanks particularly to David Peck whose work I have captured in the firehose_utils.py file, who implemented a lovely decoding of the CBOR protocol. Please see his work here: https://gist.github.com/davepeck/8ada49d42d44a5632b540a4093225719 and https://github.com/davepeck.
A technical paper will be released soon, for now please mention the github repository and in academic works please mention my ORCID (https://orcid.org/0009-0000-1581-4021).
This work is licensed under the LGPL-3.0.
In case of questions please contact @jhculb.
For contributions and bug reports, open an issue here.