[Nix-dev] Raiding Debian repository and other adventures

phreedom at yandex.ru phreedom at yandex.ru
Tue Jul 2 05:14:26 CEST 2013


Hi Nixers

I have been trying to tackle package freshness and security
issues for the last several days.

I have some preliminary findings I want to share. They are
mostly in form of a brain dump, but a better writeup
(which will follow) unfortunately would take too much time right now.

The first naive attempt was to attempt to reuse an existing distro.
After all they supposedly have security advisories linked to their
package names, follow upstream closely due to huge number of
maintainers and in case of debian, they have watchfiles which they
supposedly use to automatially pick up new upstream releases.

So all we need is to map nixpkgs package name to $BIG_DISTRO package names, 
right?

== How to match packages from a different distro

Of course the first idea is match by name with some trivial automated cleanup
like adding/removing/replacing prefixes such as  ruby, python, kde, xorg.
Surprisingly this gives good results: about 3.4K out of 5.3k total nixpkgs 
matched to debian,

1k out of 1.6k packages from arch core repository could be mapped to nixpkgs.
Most of unmapped arch packages in fact don't exist in nixpkgs.

=== Other ideas:

After a little bit of brainstorming, I tried to match tarball names
and locations instead of package names. After all, even if tarballs
are copied elsewhere they are unlikely to be renamed. So if we strip
mirror-specific stuff like domain, we could end up with a very reliable
package identifier!

I ran tests on arch packages matched by name and tarball names turned
out to be a very reliable ID. The only difficulty was different versions
in nixpkgs and arch. I used levenshtein distance to estimate just how similar
urls are. To actually use this in production, levenshtein algo should be
modified with different costs: very expensive to touch '/'(preserve dir 
structure),
very cheap to modify numbers(to alter versions) and letters somewhere 
inbetween
because we want to handle a,b,c suffixes.

But having proven that the idea has some merit, I was after a much larger fish, 
Debian. Debian doesn't have tarball urls, because they shit tarballs 
themselves, but they seemed to have something even more valuable: watchfiles. 
If only these could be imported into nixpkgs!

I have hacked their autoupdate tool to provide a list of all available 
upstream tarballs, not just the most recent one. This would in theory produce 
the same tarball names as in nixpkgs (not counting mirror domain and root dir 
but that's easy to handle) thus enabling almost bullet-proof matching.

== Some stats on debian packages

total packages:
find ftp.debian.org/ -iname '*.debian.tar.gz'|wc -l
23007

Packages may not be unique, that is several versions
of the same package may be in the repo.

total watchfiles(extracted using a simple script):
find watchfiles/|wc -l
20347

Watchfiles that returned some URLs:
find deb_urls/|wc -l
16400

Watchfiles tend to be present for more popular/important packages,
which also tend to have several versions in the repos at once.
Due to this skew, the sheer number of watchfiles covers less
unique packages than it seems at first.

== Reliability of matching deb packages to nix packages

A random sampling of nix files I did for this writeup:
find /etc/nixos/nixpkgs/pkgs/ -iname '*.nix'|sort --random-sort |head

/etc/nixos/nixpkgs/pkgs/tools/misc/refind/default.nix -- not found in debian
/etc/nixos/nixpkgs/pkgs/development/ocaml-modules/typeconv/default.nix -- not 
found in debian
/etc/nixos/nixpkgs/pkgs/development/compilers/gcc/4.4/default.nix -- debian 
has no watchfile for gcc. would probably match by name.
/etc/nixos/nixpkgs/pkgs/development/libraries/haskell/hscolour/default.nix -- 
exact url match.
/etc/nixos/nixpkgs/pkgs/development/libraries/haskell/MonadCatchIO-
mtl/default.nix -- exact url match.
/etc/nixos/nixpkgs/pkgs/development/libraries/pgen/default.nix -- not found in 
debian
/etc/nixos/nixpkgs/pkgs/development/libraries/haskell/text/0.11.0.6.nix -- 
exact url match
/etc/nixos/nixpkgs/pkgs/tools/networking/axel/default.nix -- match by name. 
debian watchfile present, but broken.
/etc/nixos/nixpkgs/pkgs/desktops/gnome-2/platform/libbonoboui/default.nix -- 
match by name, no watchfile.
/etc/nixos/nixpkgs/pkgs/development/libraries/haskell/GLFW/default.nix  -- 
match by url, watchfile present
/etc/nixos/nixpkgs/pkgs/development/libraries/haskell/appar/default.nix -- 
match by url, watchfile present
/etc/nixos/nixpkgs/pkgs/development/libraries/db4/db4-4.8.nix -- no match by 
name, no watchfile
/etc/nixos/nixpkgs/pkgs/development/libraries/irrlicht/default.nix -- match by 
name, watchfile present
/etc/nixos/nixpkgs/pkgs/development/python-modules/dbus/default.nix -- match 
by name, by url, watchfile present.
/etc/nixos/nixpkgs/pkgs/development/tools/misc/gengetopt/default.nix -- match 
by name, url, watchfile present
/etc/nixos/nixpkgs/pkgs/misc/emulators/mupen64plus/1.5.nix -- match by name, 
watchfile present but references broken googlecode.debian.net like may 
others.fixable
/etc/nixos/nixpkgs/pkgs/development/libraries/fox/default.nix -- no match by 
name(only olde 1.6 available), watchfile present by only for 1.6.* series
/etc/nixos/nixpkgs/pkgs/development/libraries/Xaw3d/default.nix -- match by 
name, watchfile broken
/etc/nixos/nixpkgs/pkgs/applications/video/quvi/library.nix -- match by name, 
watchfile broken
/etc/nixos/nixpkgs/pkgs/tools/misc/xdaliclock/default.nix -- match by name, 
watchfile present
/etc/nixos/nixpkgs/pkgs/tools/archivers/zip/default.nix -- match by name, 
watchfile absent
/etc/nixos/nixpkgs/pkgs/os-specific/linux/apparmor/default.nix -- match by 
name, watchfile buggy
/etc/nixos/nixpkgs/pkgs/development/libraries/haskell/profunctor-
extras/default.nix -- match by url, watchfile present
/etc/nixos/nixpkgs/pkgs/tools/bluetooth/bluedevil/default.nix -- match by 
name, url. watchfile present
/etc/nixos/nixpkgs/pkgs/applications/audio/qsynth/default.nix --match by name, 
url. watchfile present

Of course I did lot more tests and poking around. This is just a sample.

Apart from the surprising number of nixpkgs-only packages(likely due to much 
lower barrier to contribution, unlike debian and similar to Arch AUR),
watchfiles don't seem to be very useful if name-based mapping for hackage 
packages is implemented.

Watchfiles can be somewhat useful when they are present. Especially valuable 
would be the long tail, but debian has quite some bitrot in this area :(

== Generic upstream checkers for popular code repositories

Quick grepping indicates that about 10k(50%) of debian's watchfiles could 
potentially be replaced with a very simple generic script. There may be some 
quirks with version names though.

grep sf\.net -ir watchfiles/|wc -l
3067
grep kde -ir watchfiles/|wc -l
452
grep gnome -ir watchfiles/|wc -l
798
grep hacka -ir watchfiles/|wc -l
1272
grep cpan -ir watchfiles/|wc -l
3122
grep python\.org -ir watchfiles/|wc -l
624
grep googlecode -ir watchfiles/|wc -l
544
grep github -ir watchfiles/|wc -l
1064
grep freedesktop -ir watchfiles/|wc -l
195
grep ruby -ir watchfiles/|wc -l
652
grep gnu\. -ir watchfiles/|wc -l
580
grep savanna -ir watchfiles/|wc -l
166
grep launchpa -ir watchfiles/|wc -l
424

== why debian watchfiles suck. lessons learned

* Maintainers are (lazy) people
* Maintainers probably have other ways to watch for release such as RSS, MLs 
or personal
contacts. debian is huge and probably can afford a nontechnical solution to 
this problem.
* Writing resilient and reliable watchfiles requires skill and understanding of 
what can break.
It can be practically obtained only when you deal with a large sample of 
tarball names.
* Expecting hundreds of maintainers acquire this knowledge independently is 
not reasonable.
* Upstream needs to actually be aware of the fact that the releases are 
watched by software and take care to not break it.
* Educating upstream is even less practical thus watchfiles themselves are 
subject to bit rot,
especially for long-tail packages.

== Screw Debian or Does Upstream Like Weird Tarball Names?

I decided to try and write a generic tarball name parser. Given
tarball name, try producing some fixed prefix which would set
apart the tarball for the package from tarballs for other packages
if they were to end up in the same location, and a version number.

How to test if it works? Why not try matching the generated version
number and the version number we have in nixpkgs...

A little bit of regexp magic, poking around messy logs, and I have about
350 packages which don't parse or don't match out of 5.3K in nixpkgs.
This includes packages which ship in unversioned tarballs, custom tarballs,
snapshots, weird choice of version format in nixpkgs, and of course some
tarballs with REALLY stupid names and even... several nix packages whose
version doesn't match the tarball they ship :)

== How likely is it that we can directly monitor large repositories like SF 
and automatically pick updates?

SF doesn't enforce versioning approach the way cpan, hackage etc do.
Still, SF projects are somewhat forgiving because each project ships at most
a handful of different packages.

I decided to try a more difficult task, gentoo distfiles. This is a flat dir
with source tarballs of pretty much all packages gentoo ships. If problems
with crosstalk between upstream package names exist, this is the place we're
sure to encounter them.

Unfortunately, I didn't have the time to implement all common versioning
"features" upstream likes to use, so I went for the most popular x.y.z for
the test. The tool I wrote is smart enough to handle archive format changes,
complain if it finds new versioning "features" it doesn't support. I'm 
attaching a sample output of the tool.

It has this bug: "zsnes/zsnes:1.51 has new version 151 according to 
GentooDistUpdater" which I know how to fix, but I risk ending up using my 
laptop as a pillow, so I'd rather do it tomorrow. Apart from this, random 
sampling has caught crosstalk betwen json packages in ruby and perl. This 
again can be handled automatically.

So, nixpkgs indeed has quite some problems with package freshness and probably
security-related minor version bumps(or rather lack of them), especially since
the tool right now did significantly less than is possible. But not for much
longer.

These results are very preliminary, you could even say I'm rushing. Yet, I'm
cautioutsly optimistic that it is indeed possible to keep packages fresh 
without extra maintainer effort.

== Plan for this week

* Brush up the code
* add more tests 
* add more sources of update notifications. In an ideal world, each package
would be covered by 2-3 different sources, so even if one of them fails,
the package doesn't start lagging behind important updates. A good combination
would be gentoo(for breadth), arch(for freshness), direct monitoring of
large collections. Could try raiding the numerous gentoo dev repos to get both 
breadth and freshness.
* Add CVE monitoring. I think this is a topic for another writeup.

Too bad I really want to sleep now :)

== More practical ideas for tackling the long tail

* write next version guessing code: increment a number, change alpha to beta 
etc.
* hack the upstream: test if directory listing is allowed. if so, grab next 
version candidate tarballs. 

== wild stuff to try. brain dump ahead!

* google for pages linking to the current tarball and see if anything 
suspiciously new appears.
* subscribe to all mailing lists and watch for release announcements
* use ohloh to find upstream vcs repositories and watch tagging/branching 
activity

ZZZzzzzz......
-------------- next part --------------
A non-text attachment was scrubbed...
Name: log.zip
Type: application/zip
Size: 18704 bytes
Desc: not available
Url : http://lists.science.uu.nl/pipermail/nix-dev/attachments/20130702/c94f1f1c/attachment-0001.zip 


More information about the nix-dev mailing list