Today, even though it’s the new year’s eve, I’ve spent it working just like any other day, looking through the analysis log for my linking collisions script, to find some more crappy software in need of fixes. As it turns out, I found quite a bit of software, but I also confirmed to myself I have crappy database skills.
The original output of the script, already taking quite a long time to process, didn’t sort the symbols by name, but just by count, so to show the symbols with most collisions first and the ones that related to one or two files later. It also didn’t sort the name of the objects where the symbols could be find, which caused quite an issue as from time to time the list changed sorting so the list of elements wasn’t easy to compare between symbols.
Yesterday I added sorting to both fields so that I could have a more pleasant og to read, but it caused the script to slow down tremendously. At which point I noticed that maybe, just maybe, PostgreSQL didn’t optimise my tables, even though I had created views, in the hope of it being smart enough to use them as optimisation options. So I created two indexes, one for the name of the objects and one for the name of the symbols, with the default handler (btree).
The harvesting process now slowed down of a good 50%. Instead of taking less than 40 minutes, it took about an hour, but then when I launched the analysis script, it generated the whole 30MB log file in a matter of minuts rather than requiring me hours, I never have been able to let the analysis script complete its work before, and now it did it in minutes.
I have no problem to say that my database skills suck, which is probably why I’m much more of a system developer than a webapp developer.
Now at least i won’t have many more doubts about adding a way to automatically expand “multimplementations”: with the speed it has now I can well get it to merge in the data from the third table without many issues. But still, seeing how much my SQL skills are pointless, I’d like to ask some help on how to deal with this.
Basically, I have a table with paths, each of which refers to a particular object, which I call “multimplementation” (and groups together all the symbols related to a particular library ignoring things like ABI versioning and different sub-versions). For each of the multimplementation I have to get a descriptive name to report to users. When there is just one path linked to that object, that path should be used; when there are two paths, the name of the object, plus the two paths should be used; for more than two paths, the object name and the path of the first object should be used, with ellipses to indicate that there are more.
If you want to see the actual schema, you can find it on ruby-elf’s repository in the tools directory.
There are more changes to the database that I should do to make it much more feasible to connect the paths (and thus the objects) to the package names, but at least now with the speed it took it seems to be feasible to run these check on a more stable basis on the tinderbox. If only I could find an easy way to have incremental harvesting, I might as well be able to run it on my actual system too.
I’m rather good with DB schemas, using mysql normally but I think that PG is similar enough.If you want a hand we can speak about it, I will need the actual schema with some data inside
Why not run the harvesting process before you create the indexes, and then create the indexes before you run the analysis script? I doubt it’ll take Postgres 20 minutes to add the index.
putting an index over the non unique name is not of very much use, and usually you wouldn’t get much advantage from that index. since you have a unique name,abi it should be better to use an unique index on name+abi columns.here’s what i’d do:db.exec(“CREATE SEQUENCE objects_seq START 1 CACHE 100 “)db.exec(“CREATE TABLE objects ( id INTEGER PRIMARY KEY DEFAULT NEXTVAL (‘objects_seq’), name VARCHAR(4096), abi VARCHAR(255) )”)db.exec(“CREATE UNIQUE INDEX objects_name_1 ON objects(id)”)db.exec(“CREATE UNIQUE INDEX objects_name_2 ON objects(name, abi)”)db.exec(“CREATE SEQUENCE obj_prop_seq START 1 CACHE 100 “)db.exec( “CREATE TABLE object_attribs ( id integer PRIMARY KEY DEFAULT NEXTVAL (‘obj_prop_seq’), object_id INTEGER REFERENCES objects(id) ONDELETE CASCADE, path VARCHAR(4096), symbol TEXT )”)db.exec(“CREATE UNIQUE INDEX symbols_symbol_1 ON object_attribs(object_id, symbol, path)”)db.exec(“CREATE UNIQUE INDEX symbols_symbol_2 ON objects(object_id, symbol)”)db.exec(“CREATE UNIQUE INDEX synbols_symbol_3 ON objects(id)”)doing a query on name that would use the index on objects would be like this:select * from objects where name = ‘needed name’ and (abi like ‘%’ or abi is null)this would use the unique index on the table name and would at worst use the full index scan.if you were searching for the name and abi with a spefic symbol you’d do:select * from objects, object_attribs where objects.id like ‘%’ and objects.id = object_attribs.object_id and object_attribs.object_id like ‘%’and symbol = ‘needed symbol’this should first join the 2 tables on the 2 unique indexes and then would do a filter on the object_attribs unique index.if you want to use the path as seach spec you’d do:select * from objects, object_attribs where objects.id like ‘%’ and objects.id = object_attribs.object_id and object_attribs.object_id like ‘%’and (symbol like ‘%’ or symbol is null)and path like ‘/usr/%’if you’d not adding the symbol specification you’d not be using the index and would do a full table access,which would be quite big cost for the db engine to filter. in this way you should have a full index scan with much more lower used size and reduced db cost.note that using unique index would also put constraints on the table colums.for the object_attribs table you’ll not be permitted to have a duplicate value that contains the same object_id (which is a primary on the objects table and thus unique), the same symbol and the same path => you’ll have a unique path for the same object that might contain different symbols. also usually the attributes for an object usually go in one single table and are not splitted around in different tables for a better data handling.also you should use the not null constrain if applies for some field in the tables. in this way you might helpthe performance of the indexes.the 2 sequences would help you limit scripting insertions and would maintain your primary keys as primary keys. using sequence cache would help you fetch nextvals faster than non cached ones (set it’s limit based on the number of values you usually insert into tables). do some tests on the speed and see which cache value is best.if you still want you would be able to manually insert data into the primary keys of the tables without issues, but you should be aware that the sequence then would not try to insert the same value or you’d have a primary key violation.