We need Free Software Co-operatives, but we probably won’t get any

The recent GitHub craze that got a number of Free Software fundamentalists to hurry away from GitHub towards other hosting solutions.

Whether it was GitLab (a fairly natural choice given the nature of the two services), BitBucket, or SourceForge (which is trying to rebuild a reputation as a Free Software friendly hosting company), there are a number of options of new SaaS providers.

At the same time, a number of projects have been boasting (and maybe a bit too smugly, in my opinion) that they self-host their own GitLab or similar software, and suggested other projects to do the same to be “really free”.

A lot of the discourse appears to be missing nuance on the compromises that using SaaS hosting providers, self-hosting for communities and self-hosting for single projects, and so I thought I would gather my thoughts around this in one single post.

First of all, you probably remember my thoughts on self-hosting in general. Any solution that involves self-hosting will require a significant amount of ongoing work. You need to make sure your services keep working, and keep safe and secure. Particularly for FLOSS source code hosting, it’s of primary importance that the integrity and safety of the source code is maintained.

As I already said in the previous post, this style of hosting works well for projects that have a community, in which one or more dedicated people can look after the services. And in particular for bigger communities, such as KDE, GNOME, FreeDesktop, and so on, this is a very effective way to keep stewardship of code and community.

But for one-person projects, such as unpaper or glucometerutils, self-hosting would be quite bad. Even for xine with a single person maintaining just site+bugzilla it got fairly bad. I’m trying to convince the remaining active maintainers to migrate this to VideoLAN, which is now probably the biggest Free Software multimedia project and community.

This is not a new problem. Indeed, before people rushed in to GitHub (or Gitorious), they rushed in to other services that provided similar integrated environments. When I became a FLOSS developer, the biggest of them was SourceForge — which, as I noted earlier, was recently bought by a company trying to rebuild its reputation after a significant loss of trust. These environments don’t only include SCM services, but also issue (bug) trackers, contact email and so on so forth.

Using one of these services is always a compromise: not only they require an account on each service to be able to interact with them, but they also have a level of lock-in, simply because of the nature of URLs. Indeed, as I wrote last year, just going through my old blog posts to identify those referencing dead links had reminded me of just how many project hosting services shut down, sometimes dragging along (Berlios) and sometimes abruptly (RubyForge).

This is a problem that does not only involve services provided by for-profit companies. Sunsite, RubyForge and Berlios didn’t really have companies behind, and that last one is probably one of the closest things to a Free Software co-operative that I’ve seen outside of FSF and friends.

There is of course Savannah, FSF’s own Forge-lookalike system. Unfortunately for one reason or another it has always lagged behind the featureset (particularly around security) of other project management SaaS. My personal guess is that it is due to the political nature of hosting any project over on FSF’s infrastructure, even outside of the GNU project.

So what we need would be a politically-neutral, project-agnostic hosting platform that is a co-operative effort. Unfortunately, I don’t see that happening any time soon. The main problem is that project hosting is expensive, whether you use dedicated servers or cloud providers. And it takes full time people to work as system administrators to keep it running smoothly and security. You need professionals, too — or you may end up like lkml.org being down when its one maintainer goes on vacation and something happens.

While there are projects that receive enough donations that they would be able to sustain these costs (see KDE, GNOME, VideoLAN), I’d be skeptical that there would be an unfocused co-operative that would be able to take care of this. Particularly if it does not restrict creation of new projects and repositories, as that requires particular attention to abuse, and to make good guidelines of which content is welcome and which one isn’t.

If you think that that’s an easy task, consider that even SourceForge, with their review process, that used to take a significant amount of time, managed to let joke projects use their service and run on their credentials.

A few years ago, I would have said that SFLC, SFC and SPI would be the right actors to set up something like this. Nowadays? Given their infights I don’t expect them being any useful.

Two words about my personal policy on GitHub

I was not planning on posting on the blog until next week, trying to stick on a weekly schedule, but today’s announcement of Microsoft acquiring GitHub is forcing my hand a bit.

So, Microsoft is acquiring GitHub, and a number of Open Source developers are losing their mind, in all possible ways. A significant proportion of comments on this that I have seen on my social media is sounding doomsday, as if this spells the end of GitHub, because Microsoft is going to ruin it all for them.

Myself, I think that if it spells the end of anything, is the end of the one-stop-shop to work on any project out there, not because of anything Microsoft did or is going to do, but because a number of developers are now leaving the platform in protest (protest of what? One company buying another?)

Most likely, it’ll be the fundamentalists that will drop their projects away to GitHub. And depending on what they decide to do with their projects, it might even not show on anybody’s radar. A lot of people are pushing for GitLab, which is both an open-core self-hosted platform, and a PaaS offering.

That is not bad. Self-hosted GitLab instances already exist for VideoLAN and GNOME. Big, strong communities are in my opinion in the perfect position to dedicate people to support core infrastructure to make open source software development easier. In particular because it’s easier for a community of dozens, if not hundreds of people, to find dedicated people to work on it. For one-person projects, that’s overhead, distracting, and destructive as well, as fragmenting into micro-instances will cause pain to fork projects — and at the same time, allowing any user who just registered to fork the code in any instance is prone to abuse and a recipe for disaster…

But this is all going to be a topic for another time. Let me try to go back to my personal opinions on the matter (to be perfectly clear that these are not the opinions of my employer and yadda yadda).

As of today, what we know is that Microsoft acquired GitHub, and they are putting Nat Friedman of Xamarin fame (the company that stood behind the Mono project after Novell) in charge of it. This choice makes me particularly optimistic about the future, because Nat’s a good guy and I have the utmost respect for him.

This means I have no intention to move any of my public repositories away from GitHub, except if doing so would bring a substantial advantage. For instance, if there was a strong community built around medical devices software, I would consider moving glucometerutils. But this is not the case right now.

And because I still root most of my projects around my own domain, if I did move that, the canonical URL would still be valid. This is a scheme I devised after getting tired of fixing up where unieject ended up with.

Microsoft has not done anything wrong with GitHub yet. I will give them the benefit of the doubt, and not rush out of the door. It would and will be different if they were to change their policies.

Rob’s point is valid, and it would be a disgrace if various governments would push Microsoft to a corner requiring it to purge content that the smaller, independent GitHub would have left alone. But unless that happens, we’re debating hypothetical at the same level of “If I was elected supreme leader of Italy”.

So, as of today, 2018-06-04, I have no intention of moving any of my repositories to other services. I’ll also use a link to this blog with no accompanying comment to anyone who will suggest I should do so without any benefit for my projects.

Glucometerutils News: Continuous Integration, Dependencies, and Better Code

You may remember glucometerutils, my project of an open source Python tool to download glucometer data from meters that do not provide Linux support (as in, any of them).

While the tool started a few years ago, out of my personal need, this year there has been a bigger push than before, with more contributors trying the tool out, finding problem, fixing bugs. From my part, I managed to have a few fits of productivity on the tool, particularly this past week at 34C3, when I decided it was due time to start making the package shine a bit more.

So let’s see what are the more recent developments for the tool.

First of all, I decided to bring up the Python version requirement to Python 3.4 (previously, it was Python 3.2). The reason for this is that it gives access to the mock module for testing, and the enum module to write actual semantically-defined constants. While both of these could be provided as dependencies to support the older versions, I can’t think of any good reason not to upgrade from 3.2 to 3.4, and thus no need to support those versions. I mean, even Debian Stable has Python 3.5.

And while talking about Python versions, Hector pointed me at construct, which looked right away like an awesome library for dealing with binary data structures. It turned out to be a bit more rough around the edges than I’ve expected from the docs, particularly because the docs do not contain enough information to actually use it with proper dynamic objects, but it does make a huge difference compared to dealing with bytestring manually. I have started already while in Leipzig to use it to parse the basic frames of the FreeStyle protocol, and then proceeded to rewrite the other binary-based protocols, between the airports and home.

This may sound like a minor detail, but I actually found this made a huge difference, as the library already provides proper support for validating expected constants, as well as dealing with checksums — although in some cases it’s a bit heavier-handed than I expected. Also, the library supports defining bit structures too, which simplified considerably the OneTouch Ultra Easy driver, that was building its own poor developer version of the same idea. After rewriting the two binary LifeScan drivers I have (for the OneTouch Verio 2015 and Select Plus, and the Ultra Easy/Ultra Mini), the similarities between the two protocols are much easier to spot. Indeed, after porting the second driver, I also decided to refactor a bit on the first, to make the two look more alike.

This is going to be useful soon again, because two people have asked for supporting the OneTouch Verio IQ (which, despite the name, shares nothing with the normal Verio — this one uses an on-board cp210x USB-to-serial adapter), and I somehow expect that while not compatible, the protocol is likely to be similar to the other two. I ended up finding one for cheap on Amazon Germany, and I ended up ordering it — it would be the easier one to reverse engineer from my backlog, because it uses a driver I already known is easy to sniff (unlike other serial adapters that use strange framing, I’m looking at you FT232RL!), and the protocol is likely to not stray too far from the other LifeScan protocols, even though it’s not directly compatible.

I have also spent some time on the tests that are currently present. Unfortunately they don’t currently cover much of anything beside for some common internal libraries. I have though decided to improve the situation, if a bit slowly. First of all, I picked up a few of the recommendations I give my peers at work during Python reviews, and started using the parameterized module that comes from Abseil, which was recently released opensource by Google. This reduces tedious repetition when building similar tests to exercise different paths in the code. Then, I’m very thankful to Muhammad for setting up Travis for me, as that now allows the tests to show breakage, if there is any coverage at all. I’ll try to write more tests this month to make sure to exercise more drivers.

I’ve also managed to make the setup.py in the project more useful. Indeed it now correctly lists dependencies for most of the drivers as extras, and I may even be ready to make a first release on PyPI, now that I tested most of the devices I have at home and they all work. Unfortunately this is currently partly blocked on Python SCSI not having a release on PyPI itself. I’ll get back to that possibly next month at this point. For now you can install it from GitHub and it should all work fine.

As for the future, there are two in-progress pull requests/branches from contributors to add support for graphing the results, one using rrdtool and one using gnuplot. These are particularly interesting for users of FreeStyle Libre (which is the only CGM I have a driver for), and someone expressed interest in adding a Prometheus export, because why not — this is not as silly as it may sound, the output graphs of the Libre’s own software look more like my work monitoring graphs than the usual glucometer graphs. Myself, I am now toying with the idea of mimicking the HTML output that the Accu-Check Mobile generate on its own. This would be the easiest to just send by email to a doctor, and probably can be used as a basis to extend things further, integrating the other graphs output.

So onwards and upwards, the tooling will continue being built on. And I’ll do my best to make sure that Linux users who have a need to download their meters’ readings have at least some tooling that they can use, and that does not require setting up unsafe MongoDB instances on cloud providers.

Diabetes control and its tech, take 4: glucometer utilities

This is one of the posts I lost due to the blog problems with draft autosaving. Please bear with the possibly missing pieces that I might be forgetting.

In the previous post on the subject I pointed out that thanks to a post in a forum I was able to find how to talk with the OneTouch Ultra 2 glucometer I have (the two of them) — the documentation assumes you’re using HyperTerminal on Windows and thus does not work when using either picocom or PySerial.

Since I had the documentation from LifeScan for the protocol, starting to write an utility to access the device was the obvious next step. I’ve published what I have right now on a GitHub repository and I’m going to write a bit more on it today after a month of procrastination and other tasks.

While writing the tool, I found another issue with the documentation: every single line returned by the glucometer is ending with a four-digits (hex) checksum, but the documentation does not describe how the checksum is calculated. By comparing some strings with the checksum I knew, I originally guessed it might have been what I found called “CRC16-Syck” — unfortunately that also meant that the only library implementing it was a GPL-3 one, which clashed with my idea of a loose copyleft license for the tools.

But after introducing the checksum verification, I found out that the checksum does not really match. So more looking around with Google and in forums, and I get told that the checksum is a 16-bit variation of Fletcher’s checksum calculated in 32-bit but dropping the higher half… and indeed it would then match, but when then looking at the code I found out that “32-bit fletcher reduced to 16-bit” is actually “a modulo 16-bit sum of all the bytes”. It’s the most stupid and simple checksum.

Interestingly enough, the newer glucometers from LifeScan use a completely different protocol: it’s binary-based and uses a standard CRC16 implementation.

I’ve been doing my best to design the utility in such a way that there is a workable library as well as an utility (so that a graphical interface can be built upon it), and at the same time I tried making it possible to have multiple “drivers” that implement access to the glucometer commands. The idea is that this way, if somebody knows the protocol for other devices, they can implement support without rewriting, or worse duplicating, the tool. So if you own a glucometer and want to add support for it to my tool, feel free to fork the repository on GitHub and submit a merge request with the driver.

A final note I want to leave about possible Android support. I have been keeping in mind the option of writing an Android app to be able to dump the readings on the go. Hopefully it’s still possible to build Android apps for the market in Python, but I’m not sure about it. At the same time, there is a more important problem: even though I could connect my phone (Nexus 4) to the glucometer with an USB OTG cable and the one LifeScan sent me, but the USB cable has a PL2303 and I doubt that most Android devices would support it anyway.

The other alternative I can think about is to find an userland implementation of PL2303 that lets me access it as a serial port without the need for a kernel driver. If somebody knows of any software already made to solve this problem, I’ll be happy to hear.

ModSecurity and my ruleset, a release

After the recent Typo update I had some trouble with Akismet not working properly to mark comments as spam, at least the very few spam comments that could get past my ModSecurity Ruleset — so I set off to deal with it a couple of days ago to find out why.

Well, to be honest, I didn’t really want to focus on why at first. The first thing I found out while looking at the way Typo uses akismet, is that it still used a bundled, hacked, ancient akismet library.. given that the API key I got was valid, I jumped to the conclusion, right or wrong it was, that the code was simply using an ancient API that was dismissed, and decided to look around if there is a newer Akismet version; lo and behold, a 1.0.0 gem was released not many months ago.

After fiddling with it a bit, the new Akismet library worked like a charm, and spam comments passing through ModSecurity were again marked as such. A pull request and its comments later, I got a perfectly working Typo which marks comments as spam as good as before, with one less library bundled within it (and I also got the gem into Portage so there is no problem there).

But this left me with the problem that some spam comments were still passing through my filters! Why did that happen? Well, if you remember my idea behind it was validating the User-Agent header content… and it turns out that the latest Firefox versions have such a small header that almost every spammer seem to have been able to copy it just fine, so they weren’t killed off as intended. So more digging in the requests.

Some work later, and I was able to find two rules with which to validate Firefox, and a bunch of other browsers; the first relies on checking the Connection: keep-alive header that is always sent by Firefox (tried in almost every possible combination), and the other relies on checking the Content-Type on the POST request for a charset being defined: browsers will have it, but whatever the spammers are using nowadays doesn’t.

Of course, the problem is that once I actually describe and upload the rules, spammers will just improve their tools to not commit these mistakes, but in the mean time I’ll have some calm, spamless blog. I still won’t give in to captchas!

At any rate, beside adding these validations, thanks to another round of testing I was able to fix Opera Turbo users (now they can comment just fine), and that lead me to the choice of tagging the ruleset and .. releasing it! Now you can download it from GitHub or, if you use Gentoo, just install it as www-apache/modsec-flameeyes — there’s also a live ebuild for the most brave.

I think I’ll keep away from Python still

Last night I ended up in Bizarro World, hacking at Jürgen’s gmaillabelpurge (which he actually wrote on my request, thanks once more Jürgen!). Why? Well, the first reason was that I found out that it hasn’t been running for the past two and a half months, because, for whatever reason, the default Python interpreter on the system where it was running was changed from 2.7 to 3.2.

So I tried first to get it to work with Python 3 keeping it working with Python 2 at the same time; some of the syntax changes ever so slightly and was easy to fix, but the 2to3 script that it comes with is completely bogus. Among other things, it adds parenthesis on all the print calls… which would be correct if it checked that said parenthesis wouldn’t be there already. In a script link the one aforementioned, the noise on the output is so high that there is really no signal worth reading.

You might be asking how comes I didn’t notice this before. The answer is because I’m an idiot! I found out only yesterday that my firewall configuration was such that postfix was not reachable from the containers within Excelsior, which meant I never got the fcron notifications that the job was failing.

While I wasn’t able to fix the Python 3 compatibility, I was able to at least understand the code a little by reading it, and after remembering something about the IMAP4 specs I read a long time ago, I was able to optimize its execution quite a bit, more than halving the runtime on big folders, like most of the ones I have here, by using batch operations, and peeking, instead of “seeing” the headers. At the end, I spent some three hours on the script, give or take.

But at the same time, I ended up having to workaround limitations in Python’s imaplib (which is still nice to have by default), such as reporting fetched data as an array, where each odd entry is a pair of strings (tag and unparsed headers) and each even entry is a string with a closed parenthesis (coming from the tag). Since I wasn’t able to sleep, at 3.30am I started re-writing the script in Perl (which at this point I know much better than I’ll ever know Python, even if I’m a newbie in it); by 5am I had all the features of the original one, and I was supporting non-English locales for GMail — remember my old complain about natural language interfaces? Well, it turns out that the solution is to use the Special-Use Extension for IMAP folders; I don’t remember this explanation page when we first worked on that script.

But this entry is about Python and not the script per-se (you can find on my fork the Perl version if you want). I have said before I dislike Python, and my feeling is still unchanged at this point. It is true that the script in Python required no extra dependency, as the standard library already covered all the bases … but at the same time that’s about it: it is basics that it has; for something more complex you still need some new modules. Perl modules are generally easier to find, easier to install, and less error-prone — don’t try to argue this; I’ve got a tinderbox that reports Python tests errors more often than even Ruby’s (which are lots), and most of the time for the same reasons, such as the damn unicode errors “because LC_ALL=C is not supported”.

I also still hate the fact that Python forces me to indent code to have blocks. Yes I agree that indented code is much better than non-indented one, but why on earth should the indentation mandate the blocks rather than the other way around? What I usually do in Emacs when I’m getting stuff in and out of loops (which is what I had to do a lot on the script, as I was replacing per-message operations with bulk operations), is basically adding the curly brackets in different place, then select the region, and C-M- it — which means that it’s re-indented following my brackets’ placement. If I see an indent I don’t expect, it means I made a mistake with the blocks and I’m quick to fix it.

With Python, I end up having to manage the space to have it behave as I want, and it’s quite more bothersome, even with the C-c < and C-c > shortcuts in Emacs. I find the whole thing obnoxious. The other problem is that, while Python does provide basics access to a lot more functionality than Perl, its documentation is .. spotty at best. In the case of imaplib, for instance, the only real way to know what’s going to give you, is to print the returned value and check with the RFC — and it does not seem to have a half-decent way to return the UIDs without having to parse them. This is simply.. wrong.

The obvious question for people who know would be “why did you not write it in Ruby?” — well… recently I’ve started second-guessing my choice of Ruby at least for simple one-off scripts. For instance, the deptree2dot tool that I wrote for OpenRC – available here – was originally written as a Ruby script … then I converted it a Perl script half the size and twice the speed. Part of it I’m sure it’s just a matter of age (Perl has been optimized over a long time, much more than Ruby), part of it is due to be different tools for different targets: Ruby is nowadays mostly a long-running software language (due to webapps and so on), and it’s much more object oriented, while Perl is streamlined, top-down execution style…

I do expect to find the time to convert even my scan2pdf script to Perl (funnily enough, gscan2pdf which inspired it is written in Perl), although I have no idea yet when… in the mean time though, I doubt I’ll write many more Ruby scripts for this kind of processing..

The future of Unpaper

You might have read it already, or maybe you haven’t, but it looks like Berlios is going to close down at the end of the year. I wouldn’t go as far as calling it the end of an era, but we’re pretty close. Even I have a few projects that are (still) hosted on Berlios and I need to migrate.

Interestingly, Berlios is also the original hosting for unpaper which makes me forking it a couple of months ago a very good move. Especially since if I waited too long, the repository wouldn’t have been available.. even if it’s true that the repository didn’t really contain anything useful, as it was just an import of the 0.3 sources.

At any rate, since Jens didn’t reply to my inquiries, I’ve decided to start working on a more proper takeover of the project. I have created an unpaper project page on my website, to which I switched the live ebuild’s HOMEPAGE and GitHub’s website, as well as created an ohloh project to track the development.

Oh, and I released version 0.4. Yeah I guess this was the first thing to write about, but I wanted to make it less obvious.

The new release is basically simply the first cleanup I worked on, so new build system, no changes in parameters, man page and so on. Right after releasing 0.4 I merged the changes from the new contributor, Felix Janda, who not only took the time to break up the code in multiple source files, but also improved the blur filter.

Now, the next release is likely going to be 2.0; why skipping the 1.0 release? Well, it’s not really skipping. Before Berlios shuts down I wanted to copy down the previous list of downloads so I can mirror those as well, and what I found is that… the original version number series started with 1.0, so it’s mucked up badly; in Gentoo we have no problem since 0.3 was the only one we had, but for the sake of restoring consistence, the next version is going to be 2.0.

What is going to happen with that release? Well, for sure I want to rewrite the command-line options parsing. Right now it’s very custom, as long options are sometimes prefixed with one, sometimes with two dashes, and in general it doesn’t fit the usual unix command lines; not counting the fact that the parsing is done as you go with a series of strcmp() calls, which is not what one expects, usually. I intend to rewite this with getopt_long() but the problem with that is that it will break the command-line compatibility with unpaper 0.3, which is not something I’m happy about. But we’ve got to do that sooner or later if we want a more well-blent tool.

I hope to also be able to hook into the code a different way to load and save images, using some already present image decoding and encoding library, so that it can digest images in different formats than simple PNM. In particular, I’d like to be able to execute as a single pass the conversion from multiple files to a multi-page TIFF document, which requires quite a bit of work indeed. But one can dream, can’t I?

In the mean time, I hope to find some time this week to find a way to generate man pages on my server so that I can publish more complete documentation for both Ruby-Elf and unpaper itself. This is likely going to be difficult, since I’m starting some new tasks next week but.. you never know.

Unpaper fork, part 2

Last month I posted a call to action hoping for help with cleaning up the unpaper code, as the original author has not been updating it since 2007, and it had a few issues. While I have seen some interest in said fork and cleanup, nobody stepped up with help, so it is proceeding, albeit slowly.

What is available now in my GitHub repository is mostly cleaned up, although still not extremely more optimised than the original — I actually removed one of the “optimisations” I added since the fork: the usage of sincosf() function. As Freddie pointed out in the other post’s comments, the compiler has a better chance of optimising this itself; indeed both GCC and PathScale’s compiler optimise two sin and cos calls with the same argument into a single sincos call, which is good. And using two separate calls allows declaring the temporary used to store the results as constant.

And indeed today I started rewriting the functions so that temporaries are declared as constant as possible, and with the most limited scope as it’s applicable to theme. This was important to me for one reason: I want to try making use of OpenMP to improve its performance on modern multicore systems. Since most of the processing is applied independently to each pixel, it should be possible for many iterative cycles to be executed in parallel.

It would also be a major win in my book if the processing of input pages was feasible in parallel as well: my current scan script has to process the scanned sheets in parallel itself, calling many copies of unpaper, just to process the sheets faster (sometimes I scan tens of sheets, such as bank contracts and similar). I just wonder if it makes sense to simply start as many threads as possible, each one handling one sheet, or if that would risk to hog the scheduler.

Finally there is the problem of testing. Freddie also pointed me at the software I remembered to check the differences between two image files: pdiff — which is used by the ChromiumOS build process, by the way. Unfortunately I then remembered why I didn’t like it: it uses the FreeImage library, which bundles a number of other image format libraries, and upstream refuses to apply sane development to it.

What would be nice for this would be to either modify pdiff to use a different library – such as libav! – to access the image data, or to find or write something similar that does not require such stupidly-designed libraries.

Speaking about image formats, it would be interesting to get unpaper to support other image format beside PNM; this way you wouldn’t have to keep converting from and to the other formats when processing. One idea that Luca gave me was to make use of libav itself to handle that part: it already supports PNM, PNG, JPEG and TIFF, so it would provide most of the features it’d be needing.

In the mean time, please let me know if you like how this is doing — and remember that this blog, the post and me are Flattr enabled!

Forking unpaper, call for action

You might or might not know the tool by the name of unpaper that has been in Gentoo’s main tree for a while. If you don’t know it and you scan a lot, please go look at it now, it is sweet.

But sweet or not, the tool itself had quite a few shortcomings; one of these was recently brought to my attention as unsafe use of sprintf() that was fixed by upstream after 0.3 release, but which never arrived to a release.

When looking at fixing that one issue, I ended up deciding for a slightly more drastic approach: I forked the code, imported it to github and started hacking at it. This both because the package lacked a build system, and because the tarball provided didn’t correspond with the sources on CVS (nor with those on SVN for what it’s worth).

For those who wonder why I got involved in this while this is obviously outside my usual interest area, I’m using unpaper almost daily on my paperless quest that is actually paying its fruits (my accountant is pleasantly surprised by how little time it takes to me to find the paperwork he needs). And if I can shave even fractions of seconds from a single unpaper process it can improve my workflow considerably.

What I have now in my repository is an almost identical version that has passed through some improvements: the build system is autotools (properly written), that works quite fine even for a single-source package, as it can find a couple of features that would otherwise be ignored. The code does not have the allocation waste that it did before, as I’ve removed a number of pointers to characters with preprocessor macros, and I started looking at a few strange things in the code.

For instance, it now no longer opens the file, seek to the end, then rewind to the start to find the file’s size, which was especially unhelpful since the variable where the file’s size was saved was never read from but the stdio calls have side effects, so the compiler couldn’t drop them by itself.

And when it is present, it will use sincosf() rather than calling sin() and cos() separately.

I also stopped the code from copying a string from a prebuilt table, and parse it at runtime to get the represented float value.. multiple times. This was mostly tied with the page size parsing, which I have basically rewritten, also avoiding looping twice over the two sizes with two separate loops. Duh!

I also originally overlooked the fact that the repository had some pre-defined self-tests that were never packaged and thus couldn’t be used for custom builds before; this is also fixed now, and make check runs the tests just fine. Unfortunately what this does not do is comparing the output with some known-good output, I need an image compare tool to do so; for now it only ensures that unpaper behaves as expected with the commandline it is provided, better than nothing.

At any rate this is obviously only the beginning: there are bugs open on the berlios project page that I should probably look into fixing, and I have already started writing a list of TODO tasks that should be taken care of at some point or another. If you’re interested in helping out, please clone the repository and see what you can do. Testing is also very much appreciated.

I haven’t decided when to do a release, for now I’m hoping that Jens will join the fork and publish the releases on berlios based on the autotools build system. There’s a live ebuild in main tree for testing (app-text/unpaper-9999), so I’d be grateful if you could try it on a few different systems. Please enable FEATURES=test for it so that if something breaks we’ll know son enough. If you’re a maintainer packaging unpaper on other distributions, feel free to get in touch with me and tell me if you’ve other patches to provide (I should mail the Debian maintainer at this point I guess).

Revisiting my opinion of GitHub — How do you branch a readonly repository?

I have expressed before quite a bit of discontent of GitHub before, regarding the way they continue suggesting people to “fork” projects. I’d like for once to state that while I find the fork word the wrong one the idea is not too far from ideal.

Fast, reliable and free distributed SCMs have defined a new landscape in the world of Free Software, as many probably noted already; but without the help of GitHub, Gitorious and BitBucket I wouldn’t expect them to have made such an impact. Why? Because hosting a repository is generally not an easy task for non-sysadmins, and finding where the original code is supposed to be is also not that easy, a lot of times.

Of course, you cannot “rely on the cloud” to be the sole host for your content, especially if you’re a complex project like Gentoo, but it doesn’t hurt to be able to tell users “You want to make changes? Simply branch it here and we’ll take care of the management”, which is what those services enabled.

Why am I writing about this now? Well, it happened more than once before that I was in need to publish a branch of a package whose original repository was hosted on SourceForge, or the GNOME source , where you cannot branch it to make any change. To solve that problem I set up a system on my server to clone and mirror the repositories over to Gitorious; the automirror user is now tracking twelve repositories and copying them over to gitorious every six hours.

As it happens, last night I was hacking a bit at ALSA — mostly I was looking into applying what I wrote yesterday regarding hidden symbols on the ALSA plugins, as my script found duplicated symbols between the two Pulse plugins (one is the actual sound plugin the other is the control one), but ended up doing general fixing of their build system as it was slightly broken. I sent the patches to the mailing lists, but I wanted to have them available as a branch as well.

Well, now you got most of the ALSA project available on Gitorious for all your branching and editing needs. Isn’t Git so lovely?

Before finishing though I’d like to point out that there is one thing I’m not going to change my opinion on: the idea of “forking” projects is, in my opinion, very bad, as I wrote on the article at the top of the page. I actually like Gitorious’s “clones” better, as that’s what it should be: clones with branches, not forks. Unfortunately as I wrote some other time, Gitorious is not that great when it comes to licensing constraints, so for some projects of mine I’ve been using GitHub instead. Both the services are, in my opinion, equally valuable for the Free Software community, and not only.

Update (2017-04-22): as you may know, Gitorious was acquired by GitLab in 2015 and turned down the service. This means that all the references are totally useless now. Sorry.