[Nix-dev] Please test Nix store auto-optimise

Ertugrul Söylemez ertesx at gmx.de
Sat Apr 25 17:31:32 CEST 2015


> IMO much of the time overhead is basically unavoidable, as you need to
> do lots of random probes to a huge table (order of million entries,
> e.g.  1.5M in my case). Putting a single Bloom filter in front would
> shave I/O of most *failed* probes, but otherwise we can't get much
> better, I believe (being a PhD student in data structures).

There is some overhead, but it doesn't have to be anywhere near the one
we have right now.  The filesystem stores a lot more information than
necessary.  Basically making the on-disk structure smaller and more
domain-specific makes the operation noticably faster, and I can't think
of a smaller structure to store hash-maps than patricia trees.  A simple
hash-to-block mapping is all we need.

I'm not sure whether the Bloom filter would really help.  Since it is
impossible to predict the required size you would have to use some form
of growing Bloom filter or proactively create a filter that is so large
that it might destroy its own purpose.


> Anyway, this deduplication work seems better delegated to some other
> SW/library. Surely there are some already. If we wanted to deduplicate
> *similar* (unequal) files, we would even need to get FS-dependent. In
> the typical ext4 case, we don't even have reflinks AFAIK...

That's what I'm saying.  A separate tool should handle this, and it
would depend on the FS (in the sense of being a function of the FS), but
not be bound to any particular one.  For traditional FSes like ext4 it
would do hard-linking, but perhaps in a slightly smarter way.


Greets,
Ertugrul
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 472 bytes
Desc: not available
Url : http://lists.science.uu.nl/pipermail/nix-dev/attachments/20150425/5dbaa47e/attachment.bin 


More information about the nix-dev mailing list