Why the tinderbox is a non-distributed effort

In my previous post about a run control for the tinderbox, Pavel suggested me to use Gearman to handle the run control section; as I explained him earlier this evening, this is probably too much right now; the basic run control I need is pretty simple (I can even keep using xargs, if Portage gave me hooks for merge succeeded/failed), the fancy stuff, the network interface, is sugar that Gearman wouldn’t help me with as far as I can see.

On the other hand, what I want to talk about is the reasoning why I don’t think the tinderbox should e a distributed effort, as many people try to suggest from time to time to reduce the load on my machine. Unfortunately to work well in distributed methods, the task has to feasibly become a “divide et impera” kind of task.

The whole point of the tinderbox for me is to verify the interactions between packages; it’s meant to find which packages break when they are used together, among other things, and that kind of things need for all the packages to be present at the same time, which precludes the use of a distributed tinderbox.

If you don’t think this is helpful, I can tell you quite a bit of interesting things about automagic deps but since I already wrote about them from time to time I’ll skip over it for now.

That kind of effort that can work with the distributed approach is that taken by Patrick of cleaning-up tinderboxes: after each merge the dependencies gets removed, and a library of binary packages is kept up to date to avoid building them multiple times a day. This obviously makes it possible to test multiple package chains at once in multiple systems, but it also adds some further overhead (as multiple boxes will have to rebuild the same binary packages if you don’t share them around).

On the other hand, I think I got an use for Gearman (an ebuild for which, mostly contributed by Pavel, is in my overlay; we’re working on it to polish): I already mused some time ago about checking the packages’ sources looking for at least those things that can be found easily via scripts (like over-canonicalisation that I well documented already). This is a task where divide-et-impera is very likely a working strategy. Extracting and analysing the sources is an I/O-bound task, not a CPU-bound task, so Yamato’s approach there is definitely a losing one.

To have a single box to have enough I/O speed to handle so many packages you end up resorting to very high end hardware (disks and controllers) which is very expensive. Way too expensive. On the other hand, having multiple boxes, even cheap or virtual (distributed among different real boxes of course) working independently but dividing their queue together, with proper coordination, you probably can beat those performances for less than half the price. Now, for this to work there are many prerequisites, a lot of which I’m afraid I won’t be able to tackle anytime soon yet.

First of all, I need to understand well how Gearman work since I only skimmed through it up to now. Then I need to find the hardware; if I can change my garage into a machine room, and connect it to my network, that might be a good place to start (I can easily use low-power old-style machines, I still have a few around that hadn’t found space to be put lately); I remember some users offering chroots in their boxes before; this might turn out pretty useful, if they can make virtual machines, or containers, they can also work on the analysis, in a totally distributed fashion).

The third problem is somewhat the hardest but the most interesting: finding more analysis to run on the sources; without building stuff. Thankfully, I have got the book (Secure Programming with Static Analysis) to help me coping with that task.

Wish me luck, and especially wish me to find time to work on this.

10 thoughts on “Why the tinderbox is a non-distributed effort

  1. Hi Diego,I’m not sure that I am convinced by your argumentation why this can’t be done in a distributed fashion. So two questions/comments for which it would be great to see your thoughts on (if you have time)1) Your underlying assumption in why Patricks approach is never going to work is that there is no way to reliably share binaries between systems. So the solution would simply be to find a way around this and share binaries, or am I missing something here?Basically with increase in harddisk-space and internet-connection speed I don’t see a fundamental problem as many of your users are on high-spec machines with fast connections and tremendous amounts of HD-space. Remember that your thinkerbox is not that different from what I have standing here for example.*) to counter one argument: normal objection against sharing binaries is safety. Yet this should be less of an issue if the purpose is purely for building inside a chroot/container…2) Your current approach is a completely blind search. In the old stage 1, when building things like GCC you had to build it twice and see if checksums match (I think).Currently you have a massively amount of people building packages, but the outcomes of the builds is (I think) never compared. Wouldn’t an approach in which users automatically submit checksum for packages build (with given useflags, GCCFlags) not allow to statistically determine which packages have problems by simply monitoring whether or not checksums are different and hence allow a much more directed search as it should not be to difficult to see the difference in packages that a user has installed.

    Like

  2. I never said that I don’t think Patrick’s approach is going to work; what “I have said”:https://blog.flameeyes.eu/2… is that we aim at different results. His method ensures that *missing* dependencies can be identified and thus taken care of. My method tries to identify automagic dependencies (although I admit it’s still a bit rough since it doesn’t run any automated verification at link time.For my approach to work instead, I need all the packages built in the same exact box with *all the rest installed*. This way I can check whether there are unexpected interaction between them.As for the checksums; same USE flags, same compiler flags, the result is anyway going to be different most of the times: quite a few packages make use of “build dates”, which means that any rebuild will be different.

    Like

  3. So essentially you are doing the same, with the difference being that Patrick has the minimum amount of packages pre-installed, while you have as much as possible installed to find collisions between them.Basically I have difficulties seeing the argument why they need to be build on the same box assuming that you can get the binaries to extract into the chroot for the same set of use-flags, compiler-flags, etc. (probably for the same snapshot of the tree as well).After extracting these to the chroot and building the thing you are interested in you should still get collisions for the packages you are building on top of the preinstalled packages. My feeling is that you can safe time in this fashion as most of the building is distributed and you can target corner-cases much easier.*) In this the IO-bottleneck does not appear so severe if all of the chroot can be kept in memory for the final machine running these checks. Which seems plausible for a server board filled to the max with memory.**) problem with checksum and “build dates”: isn’t that simple to fix by just setting the date in the chroot used to build the binaries to a predefined date for those willing to contribute?

    Like

  4. No the problem is not *collision* of packages; that’s easy. You can also just do that by recording all the packages files and checking them there.The problem is with *interaction* of packages; and that assumes that a package is built in an environment *where all the rest are present*, and thus @./configure@ is executed with almost everything present for sure.And that’s not even the only difference between my and Patrick’s approach. There is no “wrong” approach there, they are different and they need to be kept different. And I don’t think that a distributed tinderbox would be going anywhere near what I’m doing right now.

    Like

  5. Diego,I understand what you are doing, but we probably just don’t fully understand each-other.I agree that the final step of executing ./configure has to be in a single (powerfull) machine and is hence difficult to be distributed.All I am saying is that I think that the building of packages that you need to extract into your chroot can still be distributed. Basically you extract all except the one you are going to build into the chroot and then run ./configure on top of this to find the symbol-collisions, etc you are looking for.Note that my underlying assumption is that large sets of packages are relatively well tested as many have them installed/build them regularly. So my feeling is that looking for the corner-cases is probably the most important bit to do. But I might be wrong here.

    Like

  6. The point is that all the packages end up having to be built at least once in that situation, even those that are considered “well tested” before. Given that, it really doesn’t make sense for the others to be built outside, as they’d still be rebuilt within the tinderbox. And yes, sometimes I found issues even with “widely used packages”.It’s not just corner-cases I’m looking at, unfortunately. I wish it was that simple.

    Like

  7. After this I’ll stop…problem lies in the scale of the problem you are setting.As an analog: Assume you have packages A to Z and you build Z last and it only fails if all packages A to Y are installed. The number of permutations you have to check in this case is massive to hit the problem as it could be up to 26! (= 4E26) runs of your tinderbox if you are very unluckly…Bottomline is that it appears unlikely you will ever be able to check all these potential permutations with a single machine seeing the time a single run of the tinderbox takes. And the problem clearly becomes worse the more packages you are trying to test…Hence my feeling is that is important to considering how this could be distributed to be able to address the underlying challenge. Simply because with distributing the binaries not all of the machines have to build all of the packages reducing times for these tests for N*(number of machines) to 2N assuming that each of the machines itself could run one of the ./configure tests with all other packages pre-installed.

    Like

  8. The tinderbox processing problem is relatively unique. It’s not like trying to distribute processing for protein folding or CETI calculations. Nor is it even like attempting to use distcc.In the case of work like folding, CETI and distcc, each unit of processing is essentially self-contained.The tinderbox design goal isn’t really a test of each package but rather a test of how well each package ‘fits’ into a complete environment. Tinderbox is a stress test of a package’s external connections and relationships. i.e. The goal of tinderbox is to check for a worst case {maximum number of simultaneous packages} in a single image.Common distributed processing works hard to divide processing into independent manageable chunks. The implication is that each chunk has no external relationships or connections.In reality, if you do much work with distcc, you’ll find that you get the best results when all the PCs used (or containers) are configured identically. Distcc does take into account some of these issues, but nowhere near enough of them to support a tinderbox type load.I suggest there are two possible approaches.1) High Performance Cluster (HPC). Essentially, HPC ‘forces’ identical computational units under a master unit. You can look at this as a virtual single image with the workload spread across multiple computational units.2) Containers. Here, Diego would need to build a container image that can be loaded to multiple PCs prior to each run and then use the usual tactics to distribute load. The key here would be to have each container ‘point’ to common network shares where final results are held. i.e. Directories where the resulting libraries and includes and the source configuration files reside. You want to be sure that all the containers look to the same controlling configuration and that all the results generate and use the same dependencies.Remember, the goal of tinderbox is to stress test a package’s _external_ relationships. On that score, everyone should keep in mind that Diego has already imposed important restraints on tinderbox in order to at least make believe the project is doable.The most important restraint is to simply run each package with as many non-conflicting USE flags turned on as possible. i.e. Diego is _not_ testing all possible combinations of USE flags! By definition, this means Diego will _not_ find all possible external relationship problems.The theory appears to be that by limiting build time to running with all possible USE flags on, Diego should be able to observe the most common ebuild failures as well as gain an understanding of the root cause of said failures. That knowledge can then be passed around laterally to the Gentoo community as well as to upstream where needed.When doing any kind of programming or project, you have to start somewhere. The limitation above is an excellent one for managing the size of this project and will most likely result in the largest immediate improvement for everyone’s benefit.

    Like

  9. I look at it like turning up the feedback. Can’t hurt to turn it on for more machine tyoes. Are you imposing archetectual limits on the tinderbox? i686 or such? I also like the idea testing static source.A ‘shared’ approach might open up others. Surely x86_64 at the minimum. Users of any shared solution would want be able to allocate power usage as well as storage and processing power as well as set times to run.

    Like

  10. I never said that it shouldn’t be distributed in the sense of having multiple, independent tinderboxes working. I actually hope that will soon happen. Mark (Halcy0n) is working on setting one up for PPC64 on his box.I guess I should start keeping a git repository with all the scripts to track down the changes and stuff like that… maybe later on today.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s