[Nix-dev] Please test Nix store auto-optimise

Ertugrul Söylemez ertesx at gmx.de
Sat Apr 25 15:56:32 CEST 2015


> Once I wondered if using reflinks instead of hardlinks might be better
> from some point of view, but it probably won't be a big difference.

Tl;dr:  The current linking strategy has a noticable space and time
overhead that is large enough to be unacceptable for me, so I turned it
off.  The overhead can be removed in two ways:  Use reflinking when
available and more importantly get rid of the `.links` directory.

When the filesystem supports reflinking, it should be used.  The
advantages are:

  * From the viewpoint of the FS user this is completely transparent.

  * Most modern FSes support block-level deduplication.

  * The FS ensures consistency.  No potentially dangerous assumptions
    about the atomicity of FS operations.

  * Can be done in the background even on systems without nix-daemon.

  * Can be used safely to copy files into the store when `src` points to
    the local filesystem.

  * The `.links` approach would no longer work, forcing us to do
    something more sensible, for example a database file.

To deduplicate new builds, simply maintain a fast database of block
hashes and compare the newly written files against it.  If a match is
found, instruct the filesystem to deduplicate and that's it.  After GC
check to see if the GCed blocks still exist, otherwise remove them from
the database.

Should it happen that the database becomes inconsistent with the store
(for example because the system crashed during deduplication), nothing
bad will happen except that a few blocks might be wasted.

It would also help developers.  For example it's not uncommon for me to
point `src` to the local filesystem.  And since I'm doing a lot with
machine learning nowadays, often enough there are some huge files in a
`data` or `examples` directory.  Each time I do nix-build those are
copied verbatim to the store, which not only takes an unnecessarily long
time, but also wastes space.  As a btrfs user I would definitely benefit
From reflinking.  It would still require reading the entire file for
hashing, but the space waste would be gone.

The requirement to have this huge .links directory would be gone, making
both the optimisation process and garbage collection faster.  Just
listing its contents takes more than a minute for me with most of the
time spent disk-thrashing.  I'm not even sure if it's really required at
all, since for book-keeping a database file would be a better option.

I'd also like to note that store optimisation is probably better handled
by a separate program.


Greets,
Ertugrul
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 472 bytes
Desc: not available
Url : http://lists.science.uu.nl/pipermail/nix-dev/attachments/20150425/574d3c1f/attachment-0001.bin 


More information about the nix-dev mailing list