Falsehoods in Tutorials: Database Schemas

It’s well possible that a number of people reading this post have already stumbled across a few of the “Falsehoods Programmers Believe…” documents. If not, there appears to be a collection of them, although I have honestly only read through the ones about names, addresses, and time. The short version of all of this, is that interfacing software with reality is complicated, and in many cases, programmers don’t know how complicated it is at all. And sometimes this turns into effectively institutional xenophobia.

I have already mused that tutorials and documentation are partially to blame, by spreading code memes and reality-hostile simplifications. But now I have some more evidence of this being the case, without me building an explicit strawman like I did last time, and that brings me to another interesting point, in regards to the raising importance of getting stuff right beforehand, as costs to correct these mistakes are raising.

You see, with lockdown giving us a lot of spare time, I spent some of it on artsy projects and electronics, while my wife spent it learning about programming, Python, and more recently databases. She found a set of tutorials on YouTube that explain the basis of what a database is, and how SQL works. And they were full of those falsehoods I just linked above.

The tutorials use what I guess is a fairly common example of using a database for employees, customers, and branches of a company. And it includes in the example the fields for first name and last name. Which frankly is a terrible mistake — with very few exception that include banks and airlines, there’s no need to distinguish between components of a name, and a simple full name field would work just as well, and don’t end up causing headaches to people from cultures that don’t split names the same way. The fact that I recently ranted about this on Twitter against VirusTotal is not totally coincidental.

It goes a bit beyond that though, by trying to explain ON DELETE triggers by attaching them to the deletion of an employee from the database. Now, I’m not a GDPR lawyer, but it’s my understanding that employee rosters are one of those things that you’re allowed to keep for essential business needs — and you most likely don’t want to ever delete employees, their commissions payment history, and tax records.

I do understand that a lot of tutorials need to be using simple examples, as setting up a proper HR-compatible database would probably take a lot more time, particularly with compartmentalizing information so that your random sales analyst don’t have access to the home phone numbers of their colleagues.

I have no experience with designing employee-related database schemas, so I don’t really want to dig myself into a hole I can’t come out of, by running with this example. I do have experience with designing database schemas for product inventory, though, so I will run with that example. I think it was a different tutorial that was talking about those, but I’ll admit I’m not sure, because I didn’t pay too much attention as I was getting annoyed at the quality.

So this other tutorial focused on products, orders and sales total — its schema was naïve and not the type of databases any real order history system would use — noticeably, it assumed that an order would just need to connect with the products, with the price attached to the product row. In truth, most databases like those would need to attach the price for which an item was sold to the order — because products change prices over time.

And at the same time, it’s fairly common to want to keep the history of price changes for an item, which include the ability to pre-approve time-limited discounts, so a table of products is fairly unlikely to have the price for each item as a column. Instead, I’ve commonly seen these database to have a prices table that references the items, and provides start and end dates for the price. This way, it’s possible to know at any time what is the “valid price” for an item. And as some of my former customers had to learn on their own, it’s also important to separate which VAT is used at which time.

Example ER diagram showing an example of a more realistic shop database.

There are five tables. * indicates the primary key.

Order (*ID, Customer_ID, Billing_Address, Shipping_Address)
Order_Products(*Order_ID, *Product_ID, Gross_Price, VAT_Rate)
Product(*ID, Name)
Product_VAT(*Product_ID, *Start_Date, End_Date, VAT_Rate)
Product_ID(*Product_ID, *Start_Date, End_Date, Gross_Price)

This is again fairly simplified. Most of the shopping systems you might encounter use what might appear redundant, particularly when you’re taught that SQL require normal form databases, but that’s just in theory — practice is different. Significantly so at times.

Among other things, if you have an online shop that caters to multiple countries within the European Union, then your table holding products’ VAT information might need to be extended to include the country for each one of them. Conversely, if you are limited to accounting for VAT in a single country you may be able to reduce this to VAT categories — but keep in mind that products can and do change VAT categories over time.

Some people might start wondering now why would you go through this much trouble for an online store, that only needs to know what the price is right now. That’s a good point, if you happen to have multiple hundreds’ megabytes of database to go through to query the current price of a product. In the example above you would probably need a query such as

SELECT Product.ID, Product.Name, Product_Price.Gross_Price, Product_VAT.VAT_Rate
FROM Product
  LEFT JOIN Product_Price ON Product_Price.Product_ID = Product.ID
  LEFT JOIN Product_VAT ON Product_VAT.Product_ID = Product.ID
WHERE
  Product.ID = '{whatever}' AND
  Product_Price.Start_Date <= TODAY() AND
  Product_Price.End_Date > TODAY() AND
  Product_VAT.Start_Date <= TODAY() AND
  Product_VAT.End_Date > TODAY();

It sounds like an expensive query, doesn’t it? And it seems silly to go and scan the price and VAT tables all the time throughout the same day. It also might be entirely incorrect, depending on its placement — I do not know the rules of billings, but it may very well be possible that an order be placed close to a VAT change boundary, in which case the customer could have to pay the gross price at the time of order, but the VAT at shipping time!

So what you do end up using in many places for online ordering is a different database. Which is not the canonical copy. Often the term used for this is ETL, which stands for Extract, Transform, Load. It basically means you can build new, read-only tables once a day, and select out of those in the web frontend. For instance the above schema could be ETL’d to include a new, disconnected WebProduct table:

The same ER diagram as before, but this time with an additional table:

WebProduct(*ID, *Date, Name, Gross_Price, VAT_Rate)

Now with this table, the query would be significantly shorter:

SELECT ID, Name, Gross_Price, VAT_Rate
FROM WebProduct
WHERE ID = '{whatever}' AND Date = TODAY();

The question that comes up with seeing this schema is “Why on Earth do you have a Date column as part of the primary key, and why do you need to query for today’s date?” I’m not suggesting that the new table is generated to include every single day in existence, but it might be useful to let an ETL pipeline generate more than just one day’s worth of data — because you almost always want to generate today’s and tomorrow’s, that way you don’t need to take down your website for maintenance around midnight. But also, if you don’t have any expectation that prices will fluctuate on a daily basis, it would be more resource-friendly to run the pipeline every few days instead of daily. It’s a compromise of course, but that’s what system designing is there for.

Note that in all of this I have ignored the issue of stock. That’s a harder problem, and one that might not actually be suited to be solved with a simple database schema — you need to come to terms with compromises around availability and the fact that you need a single source of truth for how many items you’re allowed to sell… consistency is hard.

Closing my personal rant on database design, there’s another problem I want to point a spotlight to. When I started working on Autotools Mythbuster, I explicitly wanted to be able to update the content, quickly. I have had multiple revisions of the book on the Kindle Store and Kobo, but even those lagged behind the website a few times. Indeed, I think the only reason why they are not lagging behind right now is that most of the changes on the website in the past year or two have only been cosmetics, and not applying to ePub.

Even for a project like that, which uses the same source of truth for the content, there’s a heavy difference in the time cost of updating the website rather than the “book”. When talking about real books, that’s an even bigger cost — and that’s without going into the print books realm. Producing content is hard, which is why I realised many years ago that I wouldn’t have the ability to carve out enough time to make a good author.

Even adding diagrams to this blog post has a slightly higher cost than just me ranting “on paper”. And that’s why sometimes I could add more diagrams with my ideas, but I don’t, because the cost of producing it, and keeping it current would be too high. The Glucometers Protocols site as a few rough diagrams, but they are generated with blockdiag so that they can be edited quickly.

When it comes to online tutorial, though, there’s an even bigger problem: the possibly vast majority of them are, nowadays, on YouTube, as videos shot with a person in frame, to be more like a teacher in a classroom, that can explain things. If something in the video is only minimally incorrect, it’s unlikely that those videos would be re-shot — it would be an immense cost in time. Also, you can’t just update a YouTube video like you do a Kindle book — you lose comments, likes, view-counts, and those things matter for monetization, which is what most of those tutorials out there are made for. So unless the mistakes in a video-tutorial are Earth-shattering, it’s hard to expect the creators to go and fix them.

Which is why I think that it’s incredibly important to get the small things right — Stop using first and last name fields in databases, objects, forms, and whatever else you are teaching people to make! Think a bit harder as for how a product inventory database would look like! Be explicit in pointing out that you’re simplifying to an extreme, rather than providing a real-world-capable design of a database! And maybe, just maybe, start using examples that are ridiculous enough that they don’t risk being used by a junior developer in the real world.

And let me be clear on this: you can’t blame junior developers for making mistakes such as using a naïve database schema, if that’s all they are taught! I have been saying this at previous dayjob for a while: you can’t complain about the quality of code of newbies unless you have provided them with the right information in the documentation — which is why I spent more time than average on example code, and tutorials, to fix up trimmings and make it easier to copy-paste the example code into a working change that follows best practices. In the words of a colleague wiser than me: «Example code should be exemplar.»

So save yourself some trouble in the future, by making sure the people that you’re training get the best experience, and can build your own next tool to the best of specs.

Are tutorials to blame for basic IT problems?

It’s now effectively impossible to spend a month following IT (and not just) new and not hear of breaches, “hacks”, or general security fiascos. Some of these are tracked down to very basic mistakes in configuration or coding of software, including the lack of hashing of passwords in database. Everyone in the industry, including me, have at some point expressed the importance of proper QA and testing, and budgeting for them in the development process. But what if the problem is much higher up the chain?

Falsehoods Programmers Believe About Names is now over seven years old, and yet my just barely complicated full name (first name with a space in it, surname with an accent) can’t be easily used by most of the services I routinely use. Ireland was particularly interesting, as most services would support characters in the “Latin extended” alphabet, due to the Irish language use of ó, but they wouldn’t accept my surname, which uses ò — this not a first, I had trouble getting invoices from French companies before because they only know about ó as a character.

In a related, but not directly connected topic, there are the problems an acquaintance of mine keeps stumbling across. They don’t want service providers to attach a title to their account, but it looks like most of the developers that implement account handling don’t actually think about this option at all, and make it hard to not set a honorific at all. In particular, it appears not only UIs tend to include a mandatory drop-down list of titles, but the database schema (or whichever other model is used to store the information) also provides the title as an enumeration within a list — that is apparent by the way my acquaintance has had their account reverted to a “default” value, likely the “zeroth” one in the enumeration.

And since most systems don’t end up using British Airways’s honorific list but are rather limited to the “usual” ones, that appears to be “Ms” more often than not, as it sorts (lexicographically) before the others. I have had that happen to me a couple of times too, as I don’t usually file the “title” field on paper forms (I never seen much of a point of it), and I guess somewhere in the pipeline a model really expects a person to have a title.

All of this has me wondering, oh-so-many times, why most systems appear to want to store a name in separate entries for first and last name (or variation thereof), and why they insist on having a honorific title that is “one of the list” rather than a freeform (which would accept the empty string as a valid value). My theory on this is that it’s the fault of the training, or of the documentation. Multiple tutorials I have read, and even followed, over the years defined a model for a “person” – whether it is an user, customer, or any other entity related to the service itself – and many of these use the most basic identifying information about a person as fields to show how the model works, which give you “name”, “surname”, and “title” fields. Bonus points to use an enumeration for the title rather than a freeform, or validation that the title is one of the “admissible” ones.

You could call this a straw man argument, but the truth is that it didn’t take me any time at all to find an example tutorial (See also Archive.is, as I hope the live version can be fixed!) that did exactly that.

Similarly, I have seen sample tutorial code explaining how to write authentication primitives that oversimplify the procedure by either ignoring the salt-and-hashing or using obviously broken hashing functions such as crypt() rather than anything solid. Given many of us know all too well how even important jobs that are not flashy enough for a “rockstar” can be pushed into the hands of junior developers or even interns, I would not be surprised if a good chunk of these weak authentication problems that are now causing us so much pain are caused by simple bad practices that are (still) taught to those who join our profession.

I am afraid I don’t have an answer of how to fix this situation. While musing, again on Twitter, the only suggestion for a good text on writing correct authentication code is the NIST recommendations, but these are, unsurprisingly, written in a tone that is not useful to teach how to do things. They are standards first and foremost, and they are good, but that makes them extremely unsuitable for newcomers to learn how to do things correctly. And while they do provide very solid ground for building formally correct implementations of common libraries to implement the authentication — I somehow doubt that most systems would care about the formal correctness of their login page, particularly given the stories we have seen up to now.

I have seen comments on social media (different people on different media) about what makes a good source of documentation changes depending on your expertise, which is quite correct. Giving a long list of things that you should or should not do is probably a bad way to introduce newcomers to development in general. But maybe we should make sure that examples, samples, and documentation are updated so that they show the current best practice rather than overly simplified, or artificially complicated (sometimes at the same time) examples.

If you’re writing documentation, or new libraries (because you’re writing documentation for new libraries you write, right?) you may want to make sure that the “minimal” example is actually the minimum you need to do, and not skip over things like error checks, or full initialisation. And please, take a look at the various “Falsehoods Programmers Believe About” lists — and see if your example implementation make those assumptions. And if so fix them, please. You’ll prevent many mistakes from happening in real world applications, simply because the next junior developer who gets hired to build a startup’s latest website will not be steered towards the wrong implementations.