I’m glad I’m not a DBA

Today, even though it’s the new year’s eve, I’ve spent it working just like any other day, looking through the analysis log for my linking collisions script, to find some more crappy software in need of fixes. As it turns out, I found quite a bit of software, but I also confirmed to myself I have crappy database skills.

The original output of the script, already taking quite a long time to process, didn’t sort the symbols by name, but just by count, so to show the symbols with most collisions first and the ones that related to one or two files later. It also didn’t sort the name of the objects where the symbols could be find, which caused quite an issue as from time to time the list changed sorting so the list of elements wasn’t easy to compare between symbols.

Yesterday I added sorting to both fields so that I could have a more pleasant og to read, but it caused the script to slow down tremendously. At which point I noticed that maybe, just maybe, PostgreSQL didn’t optimise my tables, even though I had created views, in the hope of it being smart enough to use them as optimisation options. So I created two indexes, one for the name of the objects and one for the name of the symbols, with the default handler (btree).

The harvesting process now slowed down of a good 50%. Instead of taking less than 40 minutes, it took about an hour, but then when I launched the analysis script, it generated the whole 30MB log file in a matter of minuts rather than requiring me hours, I never have been able to let the analysis script complete its work before, and now it did it in minutes.

I have no problem to say that my database skills suck, which is probably why I’m much more of a system developer than a webapp developer.

Now at least i won’t have many more doubts about adding a way to automatically expand “multimplementations”: with the speed it has now I can well get it to merge in the data from the third table without many issues. But still, seeing how much my SQL skills are pointless, I’d like to ask some help on how to deal with this.

Basically, I have a table with paths, each of which refers to a particular object, which I call “multimplementation” (and groups together all the symbols related to a particular library ignoring things like ABI versioning and different sub-versions). For each of the multimplementation I have to get a descriptive name to report to users. When there is just one path linked to that object, that path should be used; when there are two paths, the name of the object, plus the two paths should be used; for more than two paths, the object name and the path of the first object should be used, with ellipses to indicate that there are more.

If you want to see the actual schema, you can find it on ruby-elf’s repository in the tools directory.

There are more changes to the database that I should do to make it much more feasible to connect the paths (and thus the objects) to the package names, but at least now with the speed it took it seems to be feasible to run these check on a more stable basis on the tinderbox. If only I could find an easy way to have incremental harvesting, I might as well be able to run it on my actual system too.

Some tips for both students and mentors for SoC (but not limited to)

While I wrote my rant about last year’s SoC I started to think of some advises for both students and mentors of Google Summer of Code.

I have to say first off that I didn’t partecipate actively as a mentor in 2006 (I was backup), and I didn’t partecipate last year at all, so you have to take these suggestions as an outsider’s suggestion, but, I think, a quite experienced outsider, by now.

  • Work in advance. Don’t wait till your application is accepted. Plan ahead, if you intend to partecipate, start working already! Find an interesting idea, and start fleshing out details. How are you going to work on it? Can you already prepare a few use case diagrams? Can you design an interface already? This is an investment, even if you don’t get accepted, the goal of SoC is to put you into a real-world environment, and doing this work is the first step. Think of it like trying to sell an idea to your superior, or a new company you want to work for. Also, it should give you quite an edge, showing you care about the experience more than the money (which in turn should mean you might actually continue working on the thing afterward).
  • Ask around! An important task for any developer, not limited to Free Software developers, is to be able to ask the right questions. If you’re a free software developer, you most likely have to ask one day to people who worked on similar issues than your own, colleagues and similar, so that you don’t have to re-implement the wheel every time. Searching documentation is cool, but it’s not always going to cut it as you might not find any reference to what you want to know. If you need information, you have to ask not only your mentor, but whoever has the information you need. Your mentor is supposed to know more about the project which you’re working on than you, but if that is not the case you has to find someone to give you the information. Trying to work without knowing the conventions and similar is not going to produce good results.
  • If you’re working on a testable project (a library, or a non-interactive tool), write testcases: doing so will make sure that your mentor can tell your work is proceeding correctly. And will give you a way to make sure you don’t end up breaking what you wrote a week before. In addition, I’d suggest you to write one test each time you do find an error in the behaviour, even if that means you end up with hundreds of tests!
  • Profile your code. Profiling is an important task for making sure of the quality of the code; while in most universities you’re done with writing a working solution, or a working not over-complex solutions, in real world you have to write code that doesn’t suck. And that means it has to run in a timely fashion and not abusing memory. Try looking around in my blog about memory usage and cowstats for instance. You need to learn how to use tools like that, smem, valgrind, and so on. This is the best time!

Mentors should look at these points above, and see what they can do to facilitate their students. Be around to answer their questions, point them to the right person to ask if you can’t answer them. Make sure you know how the testsuites work in the language of choice of your student, this way you can judge if he’s doing them right or if he’s just testing the behaviour that is known to work; also, try to figure out patterns that are not yet tested for, and ask the student to test those.

Up to now the suggestions refer to any organisation and project involved in Summer of Code, not just Gentoo. So feel free to link this post (or quote it, please still reference my blog though) to your students.

As it might be that not all the developers writing up as mentors will have time to do all of the above, I’m at least going to help by trying to stay around for the students. This means that if you have any question, especially related to C or Linux programming (ELF files, memory usage, compiled code analysis – both static and dynamic), feel free to contact me. In the worst case I’ll have to answer you “ask me again tomorrow, I’m busy” or “sorry this I don’t know, it’s not my area of expertise”. It’s worth a try :)

(Joshua, Alec, whoever, feel free to link this blog in the SoC page).