Update README.md

This commit is contained in:
c0mmando 2025-03-01 17:44:19 +00:00 committed by GitHub
parent c931799f4e
commit 74da2c49e0
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

117
README.md
View File

@ -1,2 +1,115 @@
# discourse-to-markdown-archiver
This script archives posts and renders topics to Markdown from one or more Discourse instances
# Discourse to Markdown Archiver
This script archives posts and renders topics to Markdown from one or more [Discourse](https://www.discourse.org/) installations. It downloads posts from specified Discourse servers via their API, archives them as JSON files (avoiding duplicates), and renders topic threads into Markdown files. Each site is stored in its own subdirectory along with a separate metadata file tracking synchronization details.
# Discourse Forums to Archive
- https://forum.hackliberty.org
- https://forums.whonix.org
- https://forum.torproject.net
- https://discuss.privacyguides.net
## Features
- Archive new posts as JSON.
- Render topics to Markdown files.
- Support for multiple Discourse sites concurrently (one site at a time).
- Separate metadata tracking per site (last synchronization date and archived post IDs).
- Concurrent rendering of topics using a thread pool for improved performance.
- Exponential backoff for HTTP requests to handle rate limits or transient errors.
## Requirements
- Python 3.7+
- Standard library modules (argparse, concurrent.futures, functools, etc.)
- Optionally, the [rich](https://github.com/willmcgugan/rich) module for improved logging output.
Install it via pip:
```bash
pip install rich
```
## Usage
Run the script from the command line.
### Command-Line Arguments
- `--urls`: A comma-separated list of Discourse server URLs.
Example:
```bash
--urls "https://forum.hackliberty.org,https://forums.whonix.org"
```
If not provided, the script defaults to `https://forum.hackliberty.org`. You can also set the `DISCOURSE_URLS` environment variable.
- `--target-dir` or `-t`: The base directory where archives and rendered topics will be stored.
Default is `./archive`.
Each site will have its own subdirectory (using the site's hostname).
- `--debug`: Run in debug mode.
### Example
To archive posts and render topics from two sites and store the data in the `./archive` directory:
```bash
./archive_and_render.py --urls "https://forum.hackliberty.org,https://forums.whonix.org" --target-dir ./archive
```
Alternatively, using environment variables:
```bash
export DISCOURSE_URLS="https://forum.hackliberty.org,https://forums.whonix.org"
export TARGET_DIR="./archive"
./archive_and_render.py
```
## Directory Structure
After executing the script, the base target directory will be structured as follows:
```
./archive/
site1.example.com/
posts/
2023-09-September/
0000000123-username-topic-slug.json
...
rendered-topics/
2023-09-September/
2023-09-15-topic-slug-id123.md
...
.metadata.json
site2.example.com/
posts/
...
rendered-topics/
...
.metadata.json
```
Each site's `.metadata.json` contains:
- `last_sync_date`: The ISO formatted date of the last successful sync.
- `archived_post_ids`: A list of post IDs that have been archived, used to avoid duplicate downloads across invocations.
## Logging
The script uses the logging module for feedback during processing. If the optional `rich` module is installed, rich logging output is enabled.
## Troubleshooting
- **Network Issues / Rate Limits**: The script incorporates an exponential backoff when encountering errors (such as rate limits). If requests repeatedly fail, check the network connectivity or adjust the server's rate limit settings.
- **JSON Decoding Errors**: The script will log a warning if it fails to decode JSON from the API. Ensure the target Discourse instance is accessible and responding correctly.
## Customization
- Adjust the number of threads in the `render_topics_concurrently()` function by modifying the `max_workers` parameter.
- Customize directories or filename formats in the `save()` and `save_rendered()` methods of the `Post` and `Topic` classes.
## License
This script is provided under the MIT license.
## Acknowledgements
This tool was created with inspiration from community discussions and use cases for archiving and reporting data from Discourse installations. Shout out to https://github.com/jamesob/discourse-archive which is where most of the code came from.
Happy archiving!