You may remember I had already a go at tutorials, after listening in on one that my wife had been going through. Well, she’s now learning about C after hearing me moan about higher and lower level languages, and she did that by starting with Harvard’s CS50 class, which is free to “attend” on edX. I am famously not a big fan of academia, but I didn’t think it would make my blood boil as much as it did.
I know that it’s easy to rant and moan about something that I’m not doing myself. After all you could say “Well, they are teaching at Harvard, you are just ranting on a c-list blog that is followed by less than a hundred people!” and you would be right. But at the same time, I have over a decade of experience in the industry, and my rants are explicitly contrasting what they say in the course to what “we” do, whether it is in opensource projects, or a bubble.
I think the first time I found myself boiling and went onto my soapbox was when the teacher said that the right “design” (they keep calling it design, although I would argue it’s style) for a single-source file program is to have includes, followed by the declaration of all the functions, followed by main()
, followed by the definition of all the functions. Which is not something I’ve ever seen happening in my experience — because it doesn’t really make much sense: duplicating declarations/definitions in C is an unfortunate chore due to headers, but why forcing even more of that in the same source file?
Indeed, one of my “pre-canned comments” in reviews at my previous employer was a long-form of “Define your convenience functions before calling them. I don’t want to have to jump around to see what your doodle_do()
function does.” Now it is true that in 2020 we have the technology (VSCode’s “show definition” curtain is one of the most magical tools I can think of), but if you’re anyone like me, you may even sometimes print out the source code to read it, and having it flow in natural order helps.
But that was just the beginning. Some time later as I dropped by to see how things were going I saw a strange string
type throughout the code — turns out that they have a special header that they (later) define as “training wheels” that includes typedef char *string;
— possibly understandable given that it takes some time to get to arrays, pointers, and from there to character arrays, but… could it have been called something else than string
, given the all-too-similarly named std::string
of C++?
Then I made the mistake of listening in on more of that lesson, and that just had me blow a fuse. The lesson takes a detour to try to explain ASCII — the fact that characters are just numbers that are looked up in a table, and that the table is typically 8-bit, with no mention of Unicode. Yes I understand Unicode is complicated and UTF-8 and other variable-length encodings will definitely give a headache to a newcomer who has not seen programming languages before. But it’s also 2020 and it might be a good idea to at least put out the idea that there’s such a thing as variable-length encoded text and that no, 8-bit characters are not enough to represent people’s names! The fact that my own name has a special character might have something to do with this, of course.
It went worse. The teacher decided to show some upper-case/lower-case trickery on strings to show how that works, and explained how you add or subtract 32 to go from one case to the other. Which is limited not only by character set, but most importantly by locale — oops, I guess the teacher never heard of the Turkish Four Is, or maybe there’s some lack of cultural diversity in the writing room for these courses. I went on a rant on Twitter over this, but let me reiterate this here as it’s important: there’s no reason why a newcomer to any programming language should know about adding/subtracting 32 to 7-bit ASCII characters to change their case, because it is not something you want to do outside of very tiny corner cases. It’s not safe in some languages. It’s not safe with characters outside the 7-bit safe Latin alphabet. It is rarely the correct thing to do. The standard library of any programming language has locale-aware functions to uppercase or lowercase a string, and that’s what you need to know!
Today (at the time of writing) she got to allocations, and I literally heard the teacher going for malloc(sizeof(int)*10)
. Even to start with a bad example and improve from that — why on Earth do they even bother teaching malloc()
first, instead of calloc()
is beyond my understanding. But what do I know, it’s not like I spent a whole lot of time fixing these mistakes in real software twelve years ago. I will avoid complaining too much about the teacher suggesting that the behaviour of malloc()
was decided by the clang authors.
Since there might be newcomers reading this and being a bit lost of why I’m complaining about this — calloc()
is a (mostly) safer alternative to allocate an array of elements, as it takes two parameters: the size of a single element and the number of elements that you want to allocate. Using this interface means it’s no longer possible to have an integer overflow when calculating the size, which reduces security risks. In addition, it zeroes out the memory, rather than leaving it uninitialized. While this means there is a performance cost, if you’re a newcomer to the language and just about learning it, you should err on the side of caution and use calloc()
rather than malloc()
.
Next up there’s my facepalm on the explanation of memory layout — be prepared, because this is the same teacher who in a previous lesson said that the integer variable’s address might vary but for his explanation can be asserted to be 0x123, completely ignoring the whole concept of alignment. To explain “by value” function calls, they decide to digress again, this time explaining heap and stack, and they describe a linear memory layout, where the code of the program is followed by the globals and then the heap, with the stack at the bottom growing up. Which might have been true in the ’80s, but hasn’t been true in a long while.
Memory layout is not simple. If you want to explain a realistic memory layout you would have to cover the differences between physical and virtual memory, memory pages and pages tables, hugepages, page permissions, W^X, Copy-on-Write, ASLR, … So I get it that the teacher might want to simplify and skip over a number of these details and give a simplified view of how to understand the memory layout. But as a professional in the industry for so long I would appreciate if they’d be upfront with the “By the way, this is an oversimplification, reality is very different.” Oh, and by the way, stack grows down on x86/x86-64.
This brings me to another interesting… mess in my opinion. The course comes with some very solid tools: a sandbox environment already primed for the course, an instance of AWS Cloud9 IDE with the libraries already installed, a fairly recent version of clang… but then decides to stick to this dubious old style of C, with strcpy()
and strcmp()
and no reference to more modern, safer options — nevermind that glibc still refuses to implement C11 Annex K safe string functions.
But then they decide to not only briefly show the newcomers how to use Valgrind, of all things. They even show them how to use a custom post-processor for Valgrind’s report output, because it’s otherwise hard to read. For a course using clang, that can rely on tools such as ASAN and MSAN to report the same information in more concise way.
I find this contrast particularly gruesome — the teacher appears to think that memory leaks are an important defect to avoid in software, so much so that they decide to give a power tool such as Valgrind to a class of newcomers… but they don’t find Unicode and correctness in names representation (because of course they talk about names) to be as important. I find these priorities totally inappropriate in 2020.
Don’t get me wrong: I understand that writing a good programming course is hard, and that professors and teachers have a hard job in front of them when it comes to explain complex concepts to a number of people that are more eager to “make” something than to learn how it works. But I do wonder if sitting a dozen professionals through these lessons wouldn’t make for a better course overall.
«He who can, does; he who cannot teaches» is a phrase attributed to George Bernand Shaw — I don’t really agree with it as it is, because I met awesome professors and teachers. I already mentioned my Systems’ teacher, who I’m told retired just a couple of months ago. But in this case I can tell you that I wouldn’t want to have to review the code (or documentation) written by that particular teacher, as I’d have a hard time keeping to constructive comments after so many facepalms.
It’s a disservice to newcomers that this is what they are taught. And it’s the professionals like me that are causing this by (clearly) not pushing back enough on Academia to be more practical, or building better courseware for teachers to rely on. But again, I rant on a C-list blog, not teach at Harvard.
This poses an interesting question. Why do we have comp.sci programs? Is it to generate developers for industry? Or is it to generate CS academics? Or somewhere in between?
When I was a CS&E student, I ranted about universities choosing to pick FancyLanguageOfTheDay as the language of discourse for introductory programming, because that’s the sort of thing I expect from a trade school, not from an institute of higher learning,
But, I also expect that institutes of higher learning do things like cognitive studies of “bets code organisation for review”, “defects avoided by code review” and other things like that, which would then probably feed back into introductory programming courses,
I think for either way, C taught like it was 40 years ago is a waste of time. Whether you’re trying to get people into theoretical CS or into the industry. For the industry, these courses are dangerous as they teach the wrong way to do things. For theoretical, because they don’t actually get to figure out the improvements that theory and industry made in those 40 years. To compare it to physics, are they still teaching introductory courses using the cubical atom model?
I’m not expecting them to go and start teaching JavaScript, because that’s what the industry uses now. But I’d also like for them to get on with progress on how modern computer works. Which they kind of did by using clang and Cloud9 on an x86-64 VM. But then decided to throw all of that away by describing the old linear memory layout.
I think it would have been fairer to just go and use an invented, or subset, language, so that it doesn’t even suggest reusability in the industry. It’s kind of like when I used to use Blackfin assembly to explain generated code, because freaking Intel assembly is too complicated to understand even for me.
Yep, I expect there to be a feedback loop. As new things get discovered, as new knowledge is gained, as standards change, …
To some extent, I am OK with introductory courses using slightly outdated standards, but plain K&R C is clearly not acceptable in a modern “introduction to C”, I’d probably think a 2-5 year lag from “standardised” to “used” is an acceptable lag.
I think what sent me off on the academia/industry tangent and “what is it for” was your comment that professionals are not pushing back enough. And pretty much all cases I’ve seen of that happening is that you end up with an “introduction to hype-language-of-the-year” instead of a decent exemplar language for the type of programming you’re making an introduction to (imperative, OO, functional, declarative, I am probably missing a few paradigms).
I guess I have an expectation that professionals should have learnt to choose their target when writing docs (or courses).
Professionals can still point out “this is dangerous to teach” even without going for the latest trendy option.
Such as teaching that heap and stack can meet.
I think it’s fine that glibc doesn’t implement the “safe” functions. Like strlcpy and friends they are deceptive in that they look like they should solve the problem, but they don’t. They just become more subtle.
I disagree, I find their failure modes more blunt, rather than subtle. But that’s at this point up for taste.
String processing in C will never be safe.