I’ve noted in my previous post that I recently built a 12TB storage server; half a terabyte has been already reserved for Gentoo’s distfiles, both as a local mirror to update my boxes without having to re-download everything, and because the tinderboxes require a lot of distfiles by definition (since they build the whole tree).
The original way I used to download all the files was to simply pass the whole list of files to emerge -Of
that ran in parallel with the tinderbox process.. unfortunately this has shown to be of limited reliability; in particular due to REQUIRED_USE there are situations where a newly-introduced requirement will cause the fetch not to behave, and thus will slow down tinderboxing of new packages while the new files are fetched. Plus, if the tinderbox masked all the version of a particular package (which it can do when no version of said package builds in its environment), and I passed it to emerge -f
, it wouldn’t fetch anything at all — you can’t really run a single emerge -f
command, as the command line arguments limit is hit much sooner, so xargs
splits it into multiple, serial calls. And as a final straw, whenever the tinderbox has to fallback to an older version of a package, it’ll have to find that distfile as well, which might not be in its cache already.
To solve all these issues and make good use of the new box that stores the data, I was given by Zac the set of infra scripts that are used to manage distfiles; in particular the mirror-dist
script, written by Brian a long time ago, is the one that takes care of fetching the packages from the upstream sources and add them to the master mirror. Looking at its output I’m .. honestly scared.
Let’s begin with the whole kernel.org issue: you probably already know that their master server was compromised and all the attached services have been disabled, including the network of mirrors for both kernel- and non-kernel-related software (among others, Linux-PAM is also hosted at kernel.org). Well, this means that all the upstream fetch URIs for those packages are unusable. Due to the nature of the mirror-dist
script, it was obviously not going to fetch the packages out of Gentoo mirrors, until I asked Fabio to hack around it (I’m no good with Python), and get the packages from Gentoo mirrors first, so until that point, it was unable to fetch any package released on kernel.org. Lovely.
There is a second condition that is outside of Gentoo’s control that is causing headaches to this, and probably to our mirror admins as well. It hasn’t gotten as much coverage as the whole kernel.org issue, but FSF found themselves not in compliance with the GPL, with respect to binutils, as some intermediate output was provided in the tarballs without the original sources used to generate that. So what did they decide to do? Revoke all the tarballs and replace them with a new release with new version numbers? No. Reissue them with an appended “a” noting it? No. They decided to simply rewrite all of them. Same filename, same URL, but different content. Congratulations for the headache you’re causing us!
But kernel.org projects, and GNU packages, are definitely not the only type of packages that have trouble with fetching; a number of upstream repositories no longer allows packages to be downloaded, and this causes major headaches if you don’t want to rely on the Gentoo mirrors’ network.
It has been proposed many times before to fix the SRC_URI
variable for the packages that point to unfetchable sources. I even opened a wishlist bug for it to check (with HTTP’s HEAD method) whether the file is available upstream or not (and the same goes for the homepage). Unfortunately I lack the Python skills to implement this and nobody else seems to be interested in this. I would have suggested this for GSoC, but .. let’s not go there, please.
But if you have the skills, and the time, having repoman check for the availability of the files before committing would be a godsend — it would have, for instance, prevented committing one system and two system-related packages in the past months without their respective patchsets. Well, if we were to also ban direct access to mirror://gentoo/
of course.