Update README.md

2025-05-31 16:21:43 +05:30 · 2025-03-01 17:44:19 +00:00
parent c931799f4e
commit 74da2c49e0
1 changed files with 115 additions and 2 deletions
--- a/README.md
+++ b/README.md
@ -1,2 +1,115 @@
-# discourse-to-markdown-archiver
-This script archives posts and renders topics to Markdown from one or more Discourse instances
+# Discourse to Markdown Archiver
+
+This script archives posts and renders topics to Markdown from one or more [Discourse](https://www.discourse.org/) installations. It downloads posts from specified Discourse servers via their API, archives them as JSON files (avoiding duplicates), and renders topic threads into Markdown files. Each site is stored in its own subdirectory along with a separate metadata file tracking synchronization details.
+
+# Discourse Forums to Archive
+- https://forum.hackliberty.org
+- https://forums.whonix.org
+- https://forum.torproject.net
+- https://discuss.privacyguides.net
+
+## Features
+
+- Archive new posts as JSON.
+- Render topics to Markdown files.
+- Support for multiple Discourse sites concurrently (one site at a time).
+- Separate metadata tracking per site (last synchronization date and archived post IDs).
+- Concurrent rendering of topics using a thread pool for improved performance.
+- Exponential backoff for HTTP requests to handle rate limits or transient errors.
+
+## Requirements
+
+- Python 3.7+
+- Standard library modules (argparse, concurrent.futures, functools, etc.)
+- Optionally, the [rich](https://github.com/willmcgugan/rich) module for improved logging output.  
+  Install it via pip:
+  ```bash
+  pip install rich
+  ```
+
+## Usage
+
+Run the script from the command line.
+
+### Command-Line Arguments
+
+- `--urls`: A comma-separated list of Discourse server URLs.  
+  Example:  
+  ```bash
+  --urls "https://forum.hackliberty.org,https://forums.whonix.org"
+  ```
+  If not provided, the script defaults to `https://forum.hackliberty.org`. You can also set the `DISCOURSE_URLS` environment variable.
+
+- `--target-dir` or `-t`: The base directory where archives and rendered topics will be stored.  
+  Default is `./archive`.  
+  Each site will have its own subdirectory (using the site's hostname).
+
+- `--debug`: Run in debug mode.
+
+### Example
+
+To archive posts and render topics from two sites and store the data in the `./archive` directory:
+
+```bash
+./archive_and_render.py --urls "https://forum.hackliberty.org,https://forums.whonix.org" --target-dir ./archive
+```
+
+Alternatively, using environment variables:
+
+```bash
+export DISCOURSE_URLS="https://forum.hackliberty.org,https://forums.whonix.org"
+export TARGET_DIR="./archive"
+./archive_and_render.py
+```
+
+## Directory Structure
+
+After executing the script, the base target directory will be structured as follows:
+
+```
+./archive/
+    site1.example.com/
+        posts/
+            2023-09-September/
+                0000000123-username-topic-slug.json
+                ...
+        rendered-topics/
+            2023-09-September/
+                2023-09-15-topic-slug-id123.md
+                ...
+        .metadata.json
+    site2.example.com/
+        posts/
+            ...
+        rendered-topics/
+            ...
+        .metadata.json
+```
+
+Each site's `.metadata.json` contains:
+- `last_sync_date`: The ISO formatted date of the last successful sync.
+- `archived_post_ids`: A list of post IDs that have been archived, used to avoid duplicate downloads across invocations.
+
+## Logging
+
+The script uses the logging module for feedback during processing. If the optional `rich` module is installed, rich logging output is enabled.
+
+## Troubleshooting
+
+- **Network Issues / Rate Limits**: The script incorporates an exponential backoff when encountering errors (such as rate limits). If requests repeatedly fail, check the network connectivity or adjust the server's rate limit settings.
+- **JSON Decoding Errors**: The script will log a warning if it fails to decode JSON from the API. Ensure the target Discourse instance is accessible and responding correctly.
+
+## Customization
+
+- Adjust the number of threads in the `render_topics_concurrently()` function by modifying the `max_workers` parameter.
+- Customize directories or filename formats in the `save()` and `save_rendered()` methods of the `Post` and `Topic` classes.
+
+## License
+
+This script is provided under the MIT license. 
+
+## Acknowledgements
+
+This tool was created with inspiration from community discussions and use cases for archiving and reporting data from Discourse installations. Shout out to https://github.com/jamesob/discourse-archive which is where most of the code came from. 
+
+Happy archiving!