From eeb66ad83ac96b42ac8ee2d5b46f514bd5bad179 Mon Sep 17 00:00:00 2001 From: Joe Thornber Date: Thu, 14 Sep 2017 11:58:56 +0100 Subject: [PATCH] [doc] add some ramblings on thinp metadata v2 --- doc/thinp-version-2/notes.md | 155 +++++++++++++++++++++++++++++++++++ 1 file changed, 155 insertions(+) create mode 100644 doc/thinp-version-2/notes.md diff --git a/doc/thinp-version-2/notes.md b/doc/thinp-version-2/notes.md new file mode 100644 index 0000000..ab6e634 --- /dev/null +++ b/doc/thinp-version-2/notes.md @@ -0,0 +1,155 @@ +It's time for a major update to the thin provisioning target. This is a chance +to add new features, and address deficiencies in the current version. + +Features +======== + +Features that we should consider (some are more realistic than others). + +- Performance enhancements for solid state storage. eg, streaming writes. + Take erasure size into consideration. + +- Compression. + +- Resilience in the face of damaged metadata. Measure potential data loss + compared to size of damage. + +- Support zeroed data in the metadata to avoid storing zeroes on disk. + +- Get away from the fixed block size. + + Since it's always a compromise between provisioning performance and snapshot + efficiency. + +- Performance improvement for metadata. + + Space maps are too heavy. + +- Performance improvement for multicore. + +- Reduce metadata size. + +- Efficient use of multiple devices. + + Currently thinp is totally unaware of how the data device is built up. + +Anti-features +============= + +Not considering these at all: + +- Dedup. + +Metadata +======== + +Problems with the existing metadata +----------------------------------- + +- Btrees are fragile + Either use a different data structure, or add enough info that trees can be + inferred and rebuilt. + +- metadata is huge + Start using ranges. + +- space maps + + Reference counting ranges will be more tedious. Find free now needs to find + ranges quickly. + +Ideas +----- + +- What could we use instead of btrees? + Skip lists. Difficult to make these fit the persistent-data scheme I think + these are better as an in core data structure (where spacial locality is less + important). + +- Drop reference counting from space maps completely. + + This would allow them to be implemented with a simpler data structure, like a + radix tree or a trie. It would be impossible to ascertain which blocks were + free without a complete walk of the metadata. This is possibly ok if the + metadata shrinks drastically through the use of ranges. + +- Space maps do not need to be 'within' the persistent-data structure system + since we never snapshot them. + + +Blob abstraction +================ + +A storage abstraction, a bit different from a block device. Presents a virtual +address space. + + (read dev-id begin end data) + (write dev-id begin end data) + (erase dev-id begin end) + (copy src-dev-id src-begin src-end dest-dev-id dest-begin) + +How do we cope with a device being split across different blobs? We need a +data structure to hold this metadata information: + + (map dev-id begin end) -> [(blob begin end)] + +Could we use bloom filters in some way? (can't see how, we'd need to cope with +erasure and false positives). + +Write: + +We always want to write into the highest priority blob (ie. SSD), so we need to +write to new blob, commit, then we can erase from old blobs. + +Read: + +Look up blobs, issue IOs and wait for all to complete. + +Erase: + +Look up blobs, issue erase. + +Dealing with atomicity +---------------------- + +Blobs store their metadata in different ways, do they individually implement +transactions, or can we enforce transactionality from above? I think the +address space has to be managed for all blobs in one space. So each blob +presents a *physical* address space, and the core maps thin devices to physical +spaces. + +Journal blob: records changes in a series, efficient for SSDs, slow start up +since we need to walk the journal to build an in core map. + +Transparent blob: no smarts phys addresses are translated to the data dev with a +linear mapping. This suggests we have to have pretty much all of current thinp +metadata in the core. + +Compression blob: Adds an additional layer of remapping to provide compression. + +Aging +----- + +Data ages from one blob to another. Because the journal blob is held in +temporal order it's trivial to work out what should be archived. But the +transparent one? Perhaps this should be another instance of the journal blob? + + +ALL blobs now mix metadata and data. Core metadata needs to go somewhere +(special dev id for fast blob)? + + +Temp btrees +----------- + +If we're journalling we can relax the way we use btrees. There's a couple of +options: + + - Treat the btree as totally expendable, use no shadowing at all. + + - Commit period for btree can be controlled by the journal, avoiding commits + whenever a REQ_FLUSH comes in. + + + +