Today, even though it’s the new year’s eve, I’ve spent it working just like any other day, looking through the analysis log for my linking collisions script, to find some more crappy software in need of fixes. As it turns out, I found quite a bit of software, but I also confirmed to myself I have crappy database skills.
The original output of the script, already taking quite a long time to process, didn’t sort the symbols by name, but just by count, so to show the symbols with most collisions first and the ones that related to one or two files later. It also didn’t sort the name of the objects where the symbols could be find, which caused quite an issue as from time to time the list changed sorting so the list of elements wasn’t easy to compare between symbols.
Yesterday I added sorting to both fields so that I could have a more pleasant og to read, but it caused the script to slow down tremendously. At which point I noticed that maybe, just maybe, PostgreSQL didn’t optimise my tables, even though I had created views, in the hope of it being smart enough to use them as optimisation options. So I created two indexes, one for the name of the objects and one for the name of the symbols, with the default handler (btree).
The harvesting process now slowed down of a good 50%. Instead of taking less than 40 minutes, it took about an hour, but then when I launched the analysis script, it generated the whole 30MB log file in a matter of minuts rather than requiring me hours, I never have been able to let the analysis script complete its work before, and now it did it in minutes.
I have no problem to say that my database skills suck, which is probably why I’m much more of a system developer than a webapp developer.
Now at least i won’t have many more doubts about adding a way to automatically expand “multimplementations”: with the speed it has now I can well get it to merge in the data from the third table without many issues. But still, seeing how much my SQL skills are pointless, I’d like to ask some help on how to deal with this.
Basically, I have a table with paths, each of which refers to a particular object, which I call “multimplementation” (and groups together all the symbols related to a particular library ignoring things like ABI versioning and different sub-versions). For each of the multimplementation I have to get a descriptive name to report to users. When there is just one path linked to that object, that path should be used; when there are two paths, the name of the object, plus the two paths should be used; for more than two paths, the object name and the path of the first object should be used, with ellipses to indicate that there are more.
If you want to see the actual schema, you can find it on ruby-elf’s repository in the tools directory.
There are more changes to the database that I should do to make it much more feasible to connect the paths (and thus the objects) to the package names, but at least now with the speed it took it seems to be feasible to run these check on a more stable basis on the tinderbox. If only I could find an easy way to have incremental harvesting, I might as well be able to run it on my actual system too.