Success Story: Mergify, GitHub and Pre-Merge Checks

You may remember that when I complained about bubbles, one of the thing I complained about is that I had no idea how to get continuous integration right. And this kept being a problem to me for a few projects where I do actually get contributions.

In particular, glucometerutils is a project that I don’t want to be “just mine” in the future. I am releasing it with a very permissive license in the future, and I hope that others will continue contributing. But while I did manage to get Travis CI set up for it, I kept not remembering to run it myself, which is annoying.

One of the solutions that was proposed to me for that particular project was to use pre-commit, which clearly is a good starting point, but as the mypy integration shows, it’s not perfect: it requires you to duplicate quite a bit of information regarding dependencies. And honestly the problem is not whether it’s working on a per-commit basis, as much as it’s fine on a per-push basis. Which often it hasn’t been for me.

On the other hand, pull requests coming from other users have been much easier to not break stuff, because Travis CI would tell me if something was wrong. So I was basically looking for something that would let me go through exactly the same level of checking, but at the same time would let me push (or merge in) my code as soon as integration passed.

While I was looking around for this, I found a blog post by Debian developer Julien Danjou about his company Mergify which looked pretty much exactly what I wanted: it allows me to say that if either I approved of a pull request, or made it myself, and the continuous integration reports no problems, the pull request should just be rebased into the master branch.

The next problem was how to make it less cumbersome for me to keep developing the project, but thankfully Julien came through for that as well by introducing me to git-pull-request, although we had a bit of work needed for that, because of me having the same advanced settings in my git configuration for the past few years, and also because I’m lazy and not always capitalize the F in Flameeyes when I type my username. Hopefully all of that will be upstreamed by the time you read this blog post.

The end result of this? I moved glucometerutils to be part of the same organizations as the Protocols (which is also using Mergify now), and instead of git push, I’m using git pull-request. If I didn’t break anything, it gets merged by the bot. If someone sends me a pull request, I just need to approve, and once again the bot takes care of it.

I’ll look for ways to keep doing this for repositories that are not part of any organization, but at the very least this solved the issue for the two main repositories for which I have active contributors. And reduces the risk of me being the single point of failure for the projects.

Also, this is a perfect example of why Randall Munroe is Wrong, for once, or twice. Automating the merges definitely does not save me any more time than I spent even trying to get this to work. The fragment of time me and Julien spent to figure out why GitHub was throwing non-obvious validation errors will never be repaid by the time I save not clicking on the pull request link after git push. But saving time is not the only thing automation is about.

In particular, this time automation is about fairness, consistency, and resiliency: while I’m still special in the Mergify configuration, I now go through the same integration test as everyone else to merge to the repository, and it’s a bot doing the rebase-merge, rather than me, so it’s less likely it’ll make mistakes.

Anyway, thank you Julien, thank you Mergify, and thank you all who contribute. Hopefully the next few months will be a bit more active for me, between the forced work from home and the new job.

Publishing Documentation

I have been repeating for years that blogs are not documentation out of themselves. While I have spent a lot of time over the years to make sure that my blog’s links are not broken, I also know that many of my old blog posts are no longer relevant at all. The links out of the blog can be broken, and it’s not particularly easy to identify them. What might have been true in 2009 might not be true in 2020. The best option for implementing something has likely changed significantly, given how ten years ago, Cloud Computing was barely a thing on the horizon, and LXC was considered an experiment.

This is the reason why Autotools Mythbuster is the way it is: it’s a “living book” — I can update and improve it, but at the same time it can be used as a stable reference of best practices: when they change it gets updated, but the link is still a pointer to the good practice.

At work, I pretty much got used to “Radically Simple Documentation” – thanks to Riona and her team. Which pretty much means I only needed to care about the content of the documentation, rather than dealing with how it would render, either in terms of pipeline or style.

And just like other problems with the bubble, when I try to do the same outside of it, I get thoroughly lost. The Glucometer Protocols site had been hosted as GitHub pages for a few years by now — but I now wanted to add some diagrams, as more modern protocols (as well as some older, but messier, protocols) would be much simpler to explain with UML Sequence Diagrams to go with.

The first problem was of course to find a way to generate sequence diagrams out of code that can be checked-in and reviewed, rather than as binary blobs — and thankfully there are a few options. I settled for blockdiag because it’s the easiest to set up in a hurry. But it turned out that integrating it is far from easy as it would seem.

While GitHub pages uses Jekyll, it uses such an old version that reproducing that on Netlify is pretty much impossible. Most of the themes that are available out there are mostly dedicated to personal sites, or ecommerce, or blogs — and even when I found one that seemed suitable for this kind of reference, I couldn’t figure out how to to get the whole thing to work. And it didn’t help that Jekyll appears to be very scant on debug logging.

I tried a number of different static site generators, including a few in JavaScript (which I find particularly annoying), but the end result was almost always that they seemed more geared towards “marketing” sites (in a very loose sense) than references. To this moment, I miss the simplicity of g3doc.

I ended up settling for Foliant, which appears to be more geared towards writing actual books than reference documentation, but wraps around MkDocs, and it provides a plugin that integrates with Blockdiag (although I still have a pending pull request to support more diagram types). And with a bit of play around it, I managed to get Netlify to build this properly and serve it. Which is what you get now.

But of course, since MkDocs (and a number of other Python-based tools I found) appear to rely on the same Markdown library, they are not even completely compatible with the Markdown as written for Jekyll and GitHub pages: the Python implementation is much stricter when it comes to indentation, and misses some of the feature. Most of those appear to have been at some point works in progress, but there doesn’t seem to be much movement on the library itself.

Again, these are relatively simple features I came to expect for documentation. And I know that some of my (soon-to-be-former) colleagues have been working on improving the state of opensource documentation frameworks, including Lisa working on Docsy, which looks awesome — but relies on Hugo, which I still dislike, and seems to have taken a direction which is going further and further away from me (the latest when I was trying to set this up is that to use Hugo on Linux they now seem to require you to install Homebrew, because clearly having something easy for Linux packagers to work with is not worth it, sigh).

I might reconsider that, if Hugo finds a way to implement building images out of other tools, but I don’t have strong expectations that the needs for documentation reference would be considered for future updates to Hugo, given how it was previously socialized as a static blog engine, only to pivot to needs that would make it more “marketable”.

I even miss GuideXML, to a point. This was Gentoo’s documentation format back in the days before the Wiki. It was complex, and probably more complicated than it should have been, but at least the pipeline to generate the documentation was well defined.

Anyhow, if anyone out there has experience in setting up reference documentation sites, and wants to make it easier to maintain a repository of information on glucometers, I’ll welcome help, suggestions, pull requests, and links to documentation and tools.

We need Free Software Co-operatives, but we probably won’t get any

The recent GitHub craze that got a number of Free Software fundamentalists to hurry away from GitHub towards other hosting solutions.

Whether it was GitLab (a fairly natural choice given the nature of the two services), BitBucket, or SourceForge (which is trying to rebuild a reputation as a Free Software friendly hosting company), there are a number of options of new SaaS providers.

At the same time, a number of projects have been boasting (and maybe a bit too smugly, in my opinion) that they self-host their own GitLab or similar software, and suggested other projects to do the same to be “really free”.

A lot of the discourse appears to be missing nuance on the compromises that using SaaS hosting providers, self-hosting for communities and self-hosting for single projects, and so I thought I would gather my thoughts around this in one single post.

First of all, you probably remember my thoughts on self-hosting in general. Any solution that involves self-hosting will require a significant amount of ongoing work. You need to make sure your services keep working, and keep safe and secure. Particularly for FLOSS source code hosting, it’s of primary importance that the integrity and safety of the source code is maintained.

As I already said in the previous post, this style of hosting works well for projects that have a community, in which one or more dedicated people can look after the services. And in particular for bigger communities, such as KDE, GNOME, FreeDesktop, and so on, this is a very effective way to keep stewardship of code and community.

But for one-person projects, such as unpaper or glucometerutils, self-hosting would be quite bad. Even for xine with a single person maintaining just site+bugzilla it got fairly bad. I’m trying to convince the remaining active maintainers to migrate this to VideoLAN, which is now probably the biggest Free Software multimedia project and community.

This is not a new problem. Indeed, before people rushed in to GitHub (or Gitorious), they rushed in to other services that provided similar integrated environments. When I became a FLOSS developer, the biggest of them was SourceForge — which, as I noted earlier, was recently bought by a company trying to rebuild its reputation after a significant loss of trust. These environments don’t only include SCM services, but also issue (bug) trackers, contact email and so on so forth.

Using one of these services is always a compromise: not only they require an account on each service to be able to interact with them, but they also have a level of lock-in, simply because of the nature of URLs. Indeed, as I wrote last year, just going through my old blog posts to identify those referencing dead links had reminded me of just how many project hosting services shut down, sometimes dragging along (Berlios) and sometimes abruptly (RubyForge).

This is a problem that does not only involve services provided by for-profit companies. Sunsite, RubyForge and Berlios didn’t really have companies behind, and that last one is probably one of the closest things to a Free Software co-operative that I’ve seen outside of FSF and friends.

There is of course Savannah, FSF’s own Forge-lookalike system. Unfortunately for one reason or another it has always lagged behind the featureset (particularly around security) of other project management SaaS. My personal guess is that it is due to the political nature of hosting any project over on FSF’s infrastructure, even outside of the GNU project.

So what we need would be a politically-neutral, project-agnostic hosting platform that is a co-operative effort. Unfortunately, I don’t see that happening any time soon. The main problem is that project hosting is expensive, whether you use dedicated servers or cloud providers. And it takes full time people to work as system administrators to keep it running smoothly and security. You need professionals, too — or you may end up like lkml.org being down when its one maintainer goes on vacation and something happens.

While there are projects that receive enough donations that they would be able to sustain these costs (see KDE, GNOME, VideoLAN), I’d be skeptical that there would be an unfocused co-operative that would be able to take care of this. Particularly if it does not restrict creation of new projects and repositories, as that requires particular attention to abuse, and to make good guidelines of which content is welcome and which one isn’t.

If you think that that’s an easy task, consider that even SourceForge, with their review process, that used to take a significant amount of time, managed to let joke projects use their service and run on their credentials.

A few years ago, I would have said that SFLC, SFC and SPI would be the right actors to set up something like this. Nowadays? Given their infights I don’t expect them being any useful.

Two words about my personal policy on GitHub

I was not planning on posting on the blog until next week, trying to stick on a weekly schedule, but today’s announcement of Microsoft acquiring GitHub is forcing my hand a bit.

So, Microsoft is acquiring GitHub, and a number of Open Source developers are losing their mind, in all possible ways. A significant proportion of comments on this that I have seen on my social media is sounding doomsday, as if this spells the end of GitHub, because Microsoft is going to ruin it all for them.

Myself, I think that if it spells the end of anything, is the end of the one-stop-shop to work on any project out there, not because of anything Microsoft did or is going to do, but because a number of developers are now leaving the platform in protest (protest of what? One company buying another?)

Most likely, it’ll be the fundamentalists that will drop their projects away to GitHub. And depending on what they decide to do with their projects, it might even not show on anybody’s radar. A lot of people are pushing for GitLab, which is both an open-core self-hosted platform, and a PaaS offering.

That is not bad. Self-hosted GitLab instances already exist for VideoLAN and GNOME. Big, strong communities are in my opinion in the perfect position to dedicate people to support core infrastructure to make open source software development easier. In particular because it’s easier for a community of dozens, if not hundreds of people, to find dedicated people to work on it. For one-person projects, that’s overhead, distracting, and destructive as well, as fragmenting into micro-instances will cause pain to fork projects — and at the same time, allowing any user who just registered to fork the code in any instance is prone to abuse and a recipe for disaster…

But this is all going to be a topic for another time. Let me try to go back to my personal opinions on the matter (to be perfectly clear that these are not the opinions of my employer and yadda yadda).

As of today, what we know is that Microsoft acquired GitHub, and they are putting Nat Friedman of Xamarin fame (the company that stood behind the Mono project after Novell) in charge of it. This choice makes me particularly optimistic about the future, because Nat’s a good guy and I have the utmost respect for him.

This means I have no intention to move any of my public repositories away from GitHub, except if doing so would bring a substantial advantage. For instance, if there was a strong community built around medical devices software, I would consider moving glucometerutils. But this is not the case right now.

And because I still root most of my projects around my own domain, if I did move that, the canonical URL would still be valid. This is a scheme I devised after getting tired of fixing up where unieject ended up with.

Microsoft has not done anything wrong with GitHub yet. I will give them the benefit of the doubt, and not rush out of the door. It would and will be different if they were to change their policies.

Rob’s point is valid, and it would be a disgrace if various governments would push Microsoft to a corner requiring it to purge content that the smaller, independent GitHub would have left alone. But unless that happens, we’re debating hypothetical at the same level of “If I was elected supreme leader of Italy”.

So, as of today, 2018-06-04, I have no intention of moving any of my repositories to other services. I’ll also use a link to this blog with no accompanying comment to anyone who will suggest I should do so without any benefit for my projects.

Glucometerutils News: Continuous Integration, Dependencies, and Better Code

You may remember glucometerutils, my project of an open source Python tool to download glucometer data from meters that do not provide Linux support (as in, any of them).

While the tool started a few years ago, out of my personal need, this year there has been a bigger push than before, with more contributors trying the tool out, finding problem, fixing bugs. From my part, I managed to have a few fits of productivity on the tool, particularly this past week at 34C3, when I decided it was due time to start making the package shine a bit more.

So let’s see what are the more recent developments for the tool.

First of all, I decided to bring up the Python version requirement to Python 3.4 (previously, it was Python 3.2). The reason for this is that it gives access to the mock module for testing, and the enum module to write actual semantically-defined constants. While both of these could be provided as dependencies to support the older versions, I can’t think of any good reason not to upgrade from 3.2 to 3.4, and thus no need to support those versions. I mean, even Debian Stable has Python 3.5.

And while talking about Python versions, Hector pointed me at construct, which looked right away like an awesome library for dealing with binary data structures. It turned out to be a bit more rough around the edges than I’ve expected from the docs, particularly because the docs do not contain enough information to actually use it with proper dynamic objects, but it does make a huge difference compared to dealing with bytestring manually. I have started already while in Leipzig to use it to parse the basic frames of the FreeStyle protocol, and then proceeded to rewrite the other binary-based protocols, between the airports and home.

This may sound like a minor detail, but I actually found this made a huge difference, as the library already provides proper support for validating expected constants, as well as dealing with checksums — although in some cases it’s a bit heavier-handed than I expected. Also, the library supports defining bit structures too, which simplified considerably the OneTouch Ultra Easy driver, that was building its own poor developer version of the same idea. After rewriting the two binary LifeScan drivers I have (for the OneTouch Verio 2015 and Select Plus, and the Ultra Easy/Ultra Mini), the similarities between the two protocols are much easier to spot. Indeed, after porting the second driver, I also decided to refactor a bit on the first, to make the two look more alike.

This is going to be useful soon again, because two people have asked for supporting the OneTouch Verio IQ (which, despite the name, shares nothing with the normal Verio — this one uses an on-board cp210x USB-to-serial adapter), and I somehow expect that while not compatible, the protocol is likely to be similar to the other two. I ended up finding one for cheap on Amazon Germany, and I ended up ordering it — it would be the easier one to reverse engineer from my backlog, because it uses a driver I already known is easy to sniff (unlike other serial adapters that use strange framing, I’m looking at you FT232RL!), and the protocol is likely to not stray too far from the other LifeScan protocols, even though it’s not directly compatible.

I have also spent some time on the tests that are currently present. Unfortunately they don’t currently cover much of anything beside for some common internal libraries. I have though decided to improve the situation, if a bit slowly. First of all, I picked up a few of the recommendations I give my peers at work during Python reviews, and started using the parameterized module that comes from Abseil, which was recently released opensource by Google. This reduces tedious repetition when building similar tests to exercise different paths in the code. Then, I’m very thankful to Muhammad for setting up Travis for me, as that now allows the tests to show breakage, if there is any coverage at all. I’ll try to write more tests this month to make sure to exercise more drivers.

I’ve also managed to make the setup.py in the project more useful. Indeed it now correctly lists dependencies for most of the drivers as extras, and I may even be ready to make a first release on PyPI, now that I tested most of the devices I have at home and they all work. Unfortunately this is currently partly blocked on Python SCSI not having a release on PyPI itself. I’ll get back to that possibly next month at this point. For now you can install it from GitHub and it should all work fine.

As for the future, there are two in-progress pull requests/branches from contributors to add support for graphing the results, one using rrdtool and one using gnuplot. These are particularly interesting for users of FreeStyle Libre (which is the only CGM I have a driver for), and someone expressed interest in adding a Prometheus export, because why not — this is not as silly as it may sound, the output graphs of the Libre’s own software look more like my work monitoring graphs than the usual glucometer graphs. Myself, I am now toying with the idea of mimicking the HTML output that the Accu-Check Mobile generate on its own. This would be the easiest to just send by email to a doctor, and probably can be used as a basis to extend things further, integrating the other graphs output.

So onwards and upwards, the tooling will continue being built on. And I’ll do my best to make sure that Linux users who have a need to download their meters’ readings have at least some tooling that they can use, and that does not require setting up unsafe MongoDB instances on cloud providers.

Diabetes control and its tech, take 4: glucometer utilities

This is one of the posts I lost due to the blog problems with draft autosaving. Please bear with the possibly missing pieces that I might be forgetting.

In the previous post on the subject I pointed out that thanks to a post in a forum I was able to find how to talk with the OneTouch Ultra 2 glucometer I have (the two of them) — the documentation assumes you’re using HyperTerminal on Windows and thus does not work when using either picocom or PySerial.

Since I had the documentation from LifeScan for the protocol, starting to write an utility to access the device was the obvious next step. I’ve published what I have right now on a GitHub repository and I’m going to write a bit more on it today after a month of procrastination and other tasks.

While writing the tool, I found another issue with the documentation: every single line returned by the glucometer is ending with a four-digits (hex) checksum, but the documentation does not describe how the checksum is calculated. By comparing some strings with the checksum I knew, I originally guessed it might have been what I found called “CRC16-Syck” — unfortunately that also meant that the only library implementing it was a GPL-3 one, which clashed with my idea of a loose copyleft license for the tools.

But after introducing the checksum verification, I found out that the checksum does not really match. So more looking around with Google and in forums, and I get told that the checksum is a 16-bit variation of Fletcher’s checksum calculated in 32-bit but dropping the higher half… and indeed it would then match, but when then looking at the code I found out that “32-bit fletcher reduced to 16-bit” is actually “a modulo 16-bit sum of all the bytes”. It’s the most stupid and simple checksum.

Interestingly enough, the newer glucometers from LifeScan use a completely different protocol: it’s binary-based and uses a standard CRC16 implementation.

I’ve been doing my best to design the utility in such a way that there is a workable library as well as an utility (so that a graphical interface can be built upon it), and at the same time I tried making it possible to have multiple “drivers” that implement access to the glucometer commands. The idea is that this way, if somebody knows the protocol for other devices, they can implement support without rewriting, or worse duplicating, the tool. So if you own a glucometer and want to add support for it to my tool, feel free to fork the repository on GitHub and submit a merge request with the driver.

A final note I want to leave about possible Android support. I have been keeping in mind the option of writing an Android app to be able to dump the readings on the go. Hopefully it’s still possible to build Android apps for the market in Python, but I’m not sure about it. At the same time, there is a more important problem: even though I could connect my phone (Nexus 4) to the glucometer with an USB OTG cable and the one LifeScan sent me, but the USB cable has a PL2303 and I doubt that most Android devices would support it anyway.

The other alternative I can think about is to find an userland implementation of PL2303 that lets me access it as a serial port without the need for a kernel driver. If somebody knows of any software already made to solve this problem, I’ll be happy to hear.

ModSecurity and my ruleset, a release

After the recent Typo update I had some trouble with Akismet not working properly to mark comments as spam, at least the very few spam comments that could get past my ModSecurity Ruleset — so I set off to deal with it a couple of days ago to find out why.

Well, to be honest, I didn’t really want to focus on why at first. The first thing I found out while looking at the way Typo uses akismet, is that it still used a bundled, hacked, ancient akismet library.. given that the API key I got was valid, I jumped to the conclusion, right or wrong it was, that the code was simply using an ancient API that was dismissed, and decided to look around if there is a newer Akismet version; lo and behold, a 1.0.0 gem was released not many months ago.

After fiddling with it a bit, the new Akismet library worked like a charm, and spam comments passing through ModSecurity were again marked as such. A pull request and its comments later, I got a perfectly working Typo which marks comments as spam as good as before, with one less library bundled within it (and I also got the gem into Portage so there is no problem there).

But this left me with the problem that some spam comments were still passing through my filters! Why did that happen? Well, if you remember my idea behind it was validating the User-Agent header content… and it turns out that the latest Firefox versions have such a small header that almost every spammer seem to have been able to copy it just fine, so they weren’t killed off as intended. So more digging in the requests.

Some work later, and I was able to find two rules with which to validate Firefox, and a bunch of other browsers; the first relies on checking the Connection: keep-alive header that is always sent by Firefox (tried in almost every possible combination), and the other relies on checking the Content-Type on the POST request for a charset being defined: browsers will have it, but whatever the spammers are using nowadays doesn’t.

Of course, the problem is that once I actually describe and upload the rules, spammers will just improve their tools to not commit these mistakes, but in the mean time I’ll have some calm, spamless blog. I still won’t give in to captchas!

At any rate, beside adding these validations, thanks to another round of testing I was able to fix Opera Turbo users (now they can comment just fine), and that lead me to the choice of tagging the ruleset and .. releasing it! Now you can download it from GitHub or, if you use Gentoo, just install it as www-apache/modsec-flameeyes — there’s also a live ebuild for the most brave.

I think I’ll keep away from Python still

Last night I ended up in Bizarro World, hacking at Jürgen’s gmaillabelpurge (which he actually wrote on my request, thanks once more Jürgen!). Why? Well, the first reason was that I found out that it hasn’t been running for the past two and a half months, because, for whatever reason, the default Python interpreter on the system where it was running was changed from 2.7 to 3.2.

So I tried first to get it to work with Python 3 keeping it working with Python 2 at the same time; some of the syntax changes ever so slightly and was easy to fix, but the 2to3 script that it comes with is completely bogus. Among other things, it adds parenthesis on all the print calls… which would be correct if it checked that said parenthesis wouldn’t be there already. In a script link the one aforementioned, the noise on the output is so high that there is really no signal worth reading.

You might be asking how comes I didn’t notice this before. The answer is because I’m an idiot! I found out only yesterday that my firewall configuration was such that postfix was not reachable from the containers within Excelsior, which meant I never got the fcron notifications that the job was failing.

While I wasn’t able to fix the Python 3 compatibility, I was able to at least understand the code a little by reading it, and after remembering something about the IMAP4 specs I read a long time ago, I was able to optimize its execution quite a bit, more than halving the runtime on big folders, like most of the ones I have here, by using batch operations, and peeking, instead of “seeing” the headers. At the end, I spent some three hours on the script, give or take.

But at the same time, I ended up having to workaround limitations in Python’s imaplib (which is still nice to have by default), such as reporting fetched data as an array, where each odd entry is a pair of strings (tag and unparsed headers) and each even entry is a string with a closed parenthesis (coming from the tag). Since I wasn’t able to sleep, at 3.30am I started re-writing the script in Perl (which at this point I know much better than I’ll ever know Python, even if I’m a newbie in it); by 5am I had all the features of the original one, and I was supporting non-English locales for GMail — remember my old complain about natural language interfaces? Well, it turns out that the solution is to use the Special-Use Extension for IMAP folders; I don’t remember this explanation page when we first worked on that script.

But this entry is about Python and not the script per-se (you can find on my fork the Perl version if you want). I have said before I dislike Python, and my feeling is still unchanged at this point. It is true that the script in Python required no extra dependency, as the standard library already covered all the bases … but at the same time that’s about it: it is basics that it has; for something more complex you still need some new modules. Perl modules are generally easier to find, easier to install, and less error-prone — don’t try to argue this; I’ve got a tinderbox that reports Python tests errors more often than even Ruby’s (which are lots), and most of the time for the same reasons, such as the damn unicode errors “because LC_ALL=C is not supported”.

I also still hate the fact that Python forces me to indent code to have blocks. Yes I agree that indented code is much better than non-indented one, but why on earth should the indentation mandate the blocks rather than the other way around? What I usually do in Emacs when I’m getting stuff in and out of loops (which is what I had to do a lot on the script, as I was replacing per-message operations with bulk operations), is basically adding the curly brackets in different place, then select the region, and C-M- it — which means that it’s re-indented following my brackets’ placement. If I see an indent I don’t expect, it means I made a mistake with the blocks and I’m quick to fix it.

With Python, I end up having to manage the space to have it behave as I want, and it’s quite more bothersome, even with the C-c < and C-c > shortcuts in Emacs. I find the whole thing obnoxious. The other problem is that, while Python does provide basics access to a lot more functionality than Perl, its documentation is .. spotty at best. In the case of imaplib, for instance, the only real way to know what’s going to give you, is to print the returned value and check with the RFC — and it does not seem to have a half-decent way to return the UIDs without having to parse them. This is simply.. wrong.

The obvious question for people who know would be “why did you not write it in Ruby?” — well… recently I’ve started second-guessing my choice of Ruby at least for simple one-off scripts. For instance, the deptree2dot tool that I wrote for OpenRC – available here – was originally written as a Ruby script … then I converted it a Perl script half the size and twice the speed. Part of it I’m sure it’s just a matter of age (Perl has been optimized over a long time, much more than Ruby), part of it is due to be different tools for different targets: Ruby is nowadays mostly a long-running software language (due to webapps and so on), and it’s much more object oriented, while Perl is streamlined, top-down execution style…

I do expect to find the time to convert even my scan2pdf script to Perl (funnily enough, gscan2pdf which inspired it is written in Perl), although I have no idea yet when… in the mean time though, I doubt I’ll write many more Ruby scripts for this kind of processing..

The future of Unpaper

You might have read it already, or maybe you haven’t, but it looks like Berlios is going to close down at the end of the year. I wouldn’t go as far as calling it the end of an era, but we’re pretty close. Even I have a few projects that are (still) hosted on Berlios and I need to migrate.

Interestingly, Berlios is also the original hosting for unpaper which makes me forking it a couple of months ago a very good move. Especially since if I waited too long, the repository wouldn’t have been available.. even if it’s true that the repository didn’t really contain anything useful, as it was just an import of the 0.3 sources.

At any rate, since Jens didn’t reply to my inquiries, I’ve decided to start working on a more proper takeover of the project. I have created an unpaper project page on my website, to which I switched the live ebuild’s HOMEPAGE and GitHub’s website, as well as created an ohloh project to track the development.

Oh, and I released version 0.4. Yeah I guess this was the first thing to write about, but I wanted to make it less obvious.

The new release is basically simply the first cleanup I worked on, so new build system, no changes in parameters, man page and so on. Right after releasing 0.4 I merged the changes from the new contributor, Felix Janda, who not only took the time to break up the code in multiple source files, but also improved the blur filter.

Now, the next release is likely going to be 2.0; why skipping the 1.0 release? Well, it’s not really skipping. Before Berlios shuts down I wanted to copy down the previous list of downloads so I can mirror those as well, and what I found is that… the original version number series started with 1.0, so it’s mucked up badly; in Gentoo we have no problem since 0.3 was the only one we had, but for the sake of restoring consistence, the next version is going to be 2.0.

What is going to happen with that release? Well, for sure I want to rewrite the command-line options parsing. Right now it’s very custom, as long options are sometimes prefixed with one, sometimes with two dashes, and in general it doesn’t fit the usual unix command lines; not counting the fact that the parsing is done as you go with a series of strcmp() calls, which is not what one expects, usually. I intend to rewite this with getopt_long() but the problem with that is that it will break the command-line compatibility with unpaper 0.3, which is not something I’m happy about. But we’ve got to do that sooner or later if we want a more well-blent tool.

I hope to also be able to hook into the code a different way to load and save images, using some already present image decoding and encoding library, so that it can digest images in different formats than simple PNM. In particular, I’d like to be able to execute as a single pass the conversion from multiple files to a multi-page TIFF document, which requires quite a bit of work indeed. But one can dream, can’t I?

In the mean time, I hope to find some time this week to find a way to generate man pages on my server so that I can publish more complete documentation for both Ruby-Elf and unpaper itself. This is likely going to be difficult, since I’m starting some new tasks next week but.. you never know.

Unpaper fork, part 2

Last month I posted a call to action hoping for help with cleaning up the unpaper code, as the original author has not been updating it since 2007, and it had a few issues. While I have seen some interest in said fork and cleanup, nobody stepped up with help, so it is proceeding, albeit slowly.

What is available now in my GitHub repository is mostly cleaned up, although still not extremely more optimised than the original — I actually removed one of the “optimisations” I added since the fork: the usage of sincosf() function. As Freddie pointed out in the other post’s comments, the compiler has a better chance of optimising this itself; indeed both GCC and PathScale’s compiler optimise two sin and cos calls with the same argument into a single sincos call, which is good. And using two separate calls allows declaring the temporary used to store the results as constant.

And indeed today I started rewriting the functions so that temporaries are declared as constant as possible, and with the most limited scope as it’s applicable to theme. This was important to me for one reason: I want to try making use of OpenMP to improve its performance on modern multicore systems. Since most of the processing is applied independently to each pixel, it should be possible for many iterative cycles to be executed in parallel.

It would also be a major win in my book if the processing of input pages was feasible in parallel as well: my current scan script has to process the scanned sheets in parallel itself, calling many copies of unpaper, just to process the sheets faster (sometimes I scan tens of sheets, such as bank contracts and similar). I just wonder if it makes sense to simply start as many threads as possible, each one handling one sheet, or if that would risk to hog the scheduler.

Finally there is the problem of testing. Freddie also pointed me at the software I remembered to check the differences between two image files: pdiff — which is used by the ChromiumOS build process, by the way. Unfortunately I then remembered why I didn’t like it: it uses the FreeImage library, which bundles a number of other image format libraries, and upstream refuses to apply sane development to it.

What would be nice for this would be to either modify pdiff to use a different library – such as libav! – to access the image data, or to find or write something similar that does not require such stupidly-designed libraries.

Speaking about image formats, it would be interesting to get unpaper to support other image format beside PNM; this way you wouldn’t have to keep converting from and to the other formats when processing. One idea that Luca gave me was to make use of libav itself to handle that part: it already supports PNM, PNG, JPEG and TIFF, so it would provide most of the features it’d be needing.

In the mean time, please let me know if you like how this is doing — and remember that this blog, the post and me are Flattr enabled!