thin-provisioning-tools/doc/thinp-version-2/notes.md

4.2 KiB

It's time for a major update to the thin provisioning target. This is a chance to add new features, and address deficiencies in the current version.

Features

Features that we should consider (some are more realistic than others).

  • Performance enhancements for solid state storage. eg, streaming writes. Take erasure size into consideration.

  • Compression.

  • Resilience in the face of damaged metadata. Measure potential data loss compared to size of damage.

  • Support zeroed data in the metadata to avoid storing zeroes on disk.

  • Get away from the fixed block size.

    Since it's always a compromise between provisioning performance and snapshot efficiency.

  • Performance improvement for metadata.

    Space maps are too heavy.

  • Performance improvement for multicore.

  • Reduce metadata size.

  • Efficient use of multiple devices.

    Currently thinp is totally unaware of how the data device is built up.

Anti-features

Not considering these at all:

  • Dedup.

Metadata

Problems with the existing metadata

  • Btrees are fragile Either use a different data structure, or add enough info that trees can be inferred and rebuilt.

  • metadata is huge Start using ranges.

  • space maps

    Reference counting ranges will be more tedious. Find free now needs to find ranges quickly.

Ideas

  • What could we use instead of btrees? Skip lists. Difficult to make these fit the persistent-data scheme I think these are better as an in core data structure (where spacial locality is less important).

  • Drop reference counting from space maps completely.

    This would allow them to be implemented with a simpler data structure, like a radix tree or a trie. It would be impossible to ascertain which blocks were free without a complete walk of the metadata. This is possibly ok if the metadata shrinks drastically through the use of ranges.

  • Space maps do not need to be 'within' the persistent-data structure system since we never snapshot them.

Blob abstraction

A storage abstraction, a bit different from a block device. Presents a virtual address space.

(read dev-id begin end data) (write dev-id begin end data) (erase dev-id begin end) (copy src-dev-id src-begin src-end dest-dev-id dest-begin)

How do we cope with a device being split across different blobs? We need a data structure to hold this metadata information:

(map dev-id begin end) -> [(blob begin end)]

Could we use bloom filters in some way? (can't see how, we'd need to cope with erasure and false positives).

Write:

We always want to write into the highest priority blob (ie. SSD), so we need to write to new blob, commit, then we can erase from old blobs.

Read:

Look up blobs, issue IOs and wait for all to complete.

Erase:

Look up blobs, issue erase.

Dealing with atomicity

Blobs store their metadata in different ways, do they individually implement transactions, or can we enforce transactionality from above? I think the address space has to be managed for all blobs in one space. So each blob presents a physical address space, and the core maps thin devices to physical spaces.

Journal blob: records changes in a series, efficient for SSDs, slow start up since we need to walk the journal to build an in core map.

Transparent blob: no smarts phys addresses are translated to the data dev with a linear mapping. This suggests we have to have pretty much all of current thinp metadata in the core.

Compression blob: Adds an additional layer of remapping to provide compression.

Aging

Data ages from one blob to another. Because the journal blob is held in temporal order it's trivial to work out what should be archived. But the transparent one? Perhaps this should be another instance of the journal blob?

ALL blobs now mix metadata and data. Core metadata needs to go somewhere (special dev id for fast blob)?

Temp btrees

If we're journalling we can relax the way we use btrees. There's a couple of options:

  • Treat the btree as totally expendable, use no shadowing at all.

  • Commit period for btree can be controlled by the journal, avoiding commits whenever a REQ_FLUSH comes in.