After my challenge with the fused multiply-add instructions I managed to find some time to write a new test utility. It’s written ad hoc for unpaper but it can probably be used for other things too. It’s trivial and stupid but it got the job done.
What it does is simple: it loads both a golden and a result image files, compares the size and format, and then goes through all the bytes to identify how many differences are there between them. If less than 0.1% of the image surface changed, it consider the test a pass.
It’s not a particularly nice system, especially as it requires me to bundle some 180MB of golden files (they compress to just about 10 MB so it’s not a big deal), but it’s a strict improvement compared to what I had before, which is good.
This change actually allowed me to explore one change that I abandoned before because it resulted in non-pixel-perfect results. In particular, unpaper now uses single-precision floating points all over, rather than doubles. This is because the slight imperfection caused by this change are not relevant enough to warrant the ever-so-slight loss in performance due to the bigger variables.
But even up to here, there is very little gain in performance. Sure some calculation can be faster this way, but we’re still using the same set of AVX/FMA instructions. This is unfortunate, unless you start rewriting the algorithms used for searching for edges or rotations, there is no gain to be made by changing the size of the code. When I converted unpaper to use libavcodec, I decided to make the code simple and as stupid as I could make it, as that meant I could have a baseline to improve from, but I’m not sure what the best way to improve it is, now.
I still have a branch that uses OpenMP for the processing, but since most of the filters applied are dependent on each other it does not work very well. Per-row processing gets slightly better results but they are really minimal as well. I think the most interesting parallel processing low-hanging fruit would be to execute processing in parallel on the two pages after splitting them from a single sheet of paper. Unfortunately, the loops used to do that processing right now are so complicated that I’m not looking forward to touch them for a long while.
I tried some basic profile-guided optimization execution, just to figure out what needs to be improved, and compared with codiff a proper release and a PGO version trained after the tests. Unfortunately the results are a bit vague and it means I’ll probably have to profile it properly if I want to get data out of it. If you’re curious here is the output when using rbelf-size -D on the unpaper binary when built normally, with profile-guided optimisation, with link-time optimisation, and with both profile-guided and link-time optimisation:
% rbelf-size -D ../release/unpaper ../release-pgo/unpaper ../release-lto/unpaper ../release-lto-pgo/unpaper
    exec         data       rodata        relro          bss     overhead    allocated   filename
   34951         1396        22284            0        11072         3196        72899   ../release/unpaper
   +5648         +312         -192           +0         +160           -6        +5922   ../release-pgo/unpaper
    -272           +0        -1364           +0         +144          -55        -1547   ../release-lto/unpaper
   +7424         +448        -1596           +0         +304          -61        +6519   ../release-lto-pgo/unpaper
It’s unfortunate that GCC does not give you any diagnostic on what it’s trying to do achieve when doing LTO, it would be interesting to see if you could steer the compiler to produce better code without it as well.
Anyway, enough with the microptimisations for now. If you want to make unpaper faster, feel free to send me pull requests for it, I’ll be glad to take a look at them!
did you try -fdump-ipa-all (and perhaps -fdump-tree-all)?
I haven’t had a chance to really reply on this point until now, but the kosher way to compare floats is to verify that their difference is less than some value.They way you describe them, it sounds like you could have 0.1% of distortion concentrated in one area of the image, with the rest of the image pristine, and still have it pass. That doesn’t sound terribly good.Your unit tests ought to be able to be much stricter by doing per-pixel comparisons ( (abs(golden-new)<epsilon) ?=”” pass=”” :=”” fail=”” ),=”” and=”” then=”” checking=”” for=”” any=”” fail=”” pixels.=”” (sum=”” all=”” the=”” pixels=”” in=”” the=”” diff=”” image,=”” if=”” summed=”” value=”” is=”” greater=”” than=”” 0,=”” fail.=”” might=”” be=”” faster=”” than=”” a=”” memcmp,=”” too,=”” since=”” i=”” think=”” memcmp=”” has=”” to=”” use=”” more=”” conditionals.)=”” this=”” is=”” all=”” without=”” having=”” looked=”” at=”” the=”” code=”” (which=”” i=”” don’t=”” have=”” time=”” atm=”” to=”” do),=”” so=”” take=”” it=”” with=”” a=”” grain=”” of=”” salt.=””>
No I haven’t tried dumping the internal state, I guess I should do that and see what I can get.Michael, the problem is a bit more complex. There is no float in the output and the problem happens vastly on B/W images so there is no way to just say “this could be 0.1% darker or lighter”.Admittedly, I’d prefer a smarter algorithm that would detect a 1-pixel transpose to be negligible (right now it’s not). But for now it’ll have to do 🙂