Cleanup chores

These last few days in San Francisco I was able to at least do some of the work I set myself out to do in my spare time, mostly on my blog. The week turned out to be much more full of things to do than I planned originally, so it turned out that I did not go through with all my targets, but at least a few were accomplished.

Namely, all my blog archives consistently links to the HTTPS versions of the posts, as well as the HTTPS version of my website – which is slowly withering to leave more space to the log – and of Autotools Mythbuster on its new home domain. This sounds like an easy task but it turned out to be slightly more involved than I was expecting, among other things because at some point I used protocol-relative URLs. I even fixed all the links that pointed to the extremely old Planet Gentoo Blog, so that the cross-references are now working, even though probably nobody will read those posts ever again. I also made all the blog comments coming from me consistent by using the same email address (rather than three different ones) and the same website. Finally, I got the list of 404s as seen by GoogleBot and made sure that the links that were broken when posted out there pointed to the right posts.

But there have been a few more things that needed some housekeeping, and was related to account churn. For context, this past Friday was my birthday — and I received an interesting email from a very old games forum that I registered on when I was helping out with the NoX-Wizard emulator: a “happy birthday” message. I then remembered that most vBullettin/phpBB installs send their greetings to the registered user who opted in to provide their birthdate (and sometimes you were forced to due to COPPA). Then since there has been some rumors of a breach on an Italian provider which I used to use when I originally went online, I decided to go and change passwords – once again thanks LastPass ­– and found there two more similar messages for other forums, which I probably have not visited in almost ten years.

You could think that there is no reason to go and try to restore those accounts to life — and I would almost agree with you if it wasn’t that they pose a security risk the moment they get breached. And it should be obvious by now that breaching lots of small sites can be just as profitable as breaching a single big site, and much easier. Those forums most likely still had my original, absolutely insecure passwords, so I went and regenerated them.

I wonder how many more accounts I forgot about are out there — I know that for sure there are some that were attached to my BerliOS email address, which is now long gone. The other day using Baidu to look for myself I got remembered I had a Last.FM account which I now got access to again. At least using a password manager it’s more difficult to forget about accounts altogether, as they are stored there.

Anyway, for the moment this is enough cleanup, feel free to report if there are other things that I should probably work on, non-Gentoo related (Autotools Mythbuster is due an update but I have not had time to go through that yet); the slower Amazon ad on the blog will also be fixed, promised!

How to analyze a dump of usernames

There has been some noise around a leak of users/passwords pairs which somehow panicked people into thinking it was coming from a particular provider. Since it seems most people have not even tried looking at the account information available, I’d like to point out some ways that could have helped avoiding the panic, if only the reporters cared. It also fits nicely into my previous notes on accounts’ churn.

But before proceeding let me make one thing straight: this post contains no information that is not available to the public and bears no relation to my daily work for my employer. Just wanted to make that clear. Edit: for the official response, please see this blog post of Google’s Security blog.

To begin the analysis you need a copy of the list of usernames; Italian blogger Paolo Attivissimo linked to it in his post but I’m not going to do so. Especially since it’s likely to become obsolete soon, and might not be liked by many. The archive is a compressed list of usernames without passwords or password hashes. At first, it seems to contain almost exclusively gmail.com addresses — in truth there are more addresses but it probably does not hit the news as much to say that there are some 5 million addresses from some thousand domains.

Let’s first try to extract real email addresses from the file, which I’ll call rawsource.txt — yes it does not match the name of the actual source file out there but I would rather avoid the search requests finding this post from the filename.

$ fgrep @ rawsource.txt > source-addresses.txt

This removes some two thousands lines that were not valid addresses — turns out that the file actually contains some passwords, so let’s process it a little more to get a bigger sample of valid addresses:

$ sed -i -e 's:|.*::' source-addresses.txt

This should make the next command give us a better estimate of the content:

$ sed -n -e 's:.*@::p' source-addresses.txt | sort | uniq -c | sort -n
[snip]
  238 gmail.com.au
256 gmail.com.br
338 gmail.com.vn
608 gmail.com777
123215 yandex.ru
4800129 gmail.com

So as we were saying earlier there are more than just Google accounts in this. A good chunk of them are on Yandex, but if you look at the outlier in the list there are plenty of other domains including Yahoo. Let’s just filter away the four thousands addresses using either broken domains or outlier domains and instead focus on these three providers:

$ egrep '@(gmail.com|yahoo.com|yandex.ru)$' source-addresses.txt > good-addresses.txt

Now things get more interesting, because to proceed to the next step you have to know how email servers and services work. For these three providers, and many default setups for postfix and similar, the local part of the address (everything before the @ sign) can contain a + sign, when that is found, the local part is split into user and extension, so that mail to nospam+noreally would be sent to the user nospam. Servers generally ignore the extension altogether, but you can use it to either register multiple accounts on the same mailbox (like I do for PayPal, IKEA, Sony, …) or to filter the received mail on different folders. I know some people who think they can absolutely identify the source of spam this way — I’m a bit more skeptical, if I was a spammer I would be dropping the extension altogether. Only some very die-hard Unix fans would not allow inbound email without an extension. Especially since I know plenty of services that don’t accept email addresses with + in them.

Since this is not very well known, there are going to be very few email addresses using this feature, but that’s still good because it limits the amount of data to crawl through. Finding a pattern within 5M addresses is going to take a while, finding one in 4k is much easier:

$ egrep '.*+.*@.*' good-addresses.txt | sed -e '/.*@.*@.*/d' > experts-addresses.txt

The second command filters out some false positives due to two addresses being on the same line; the results from the source file I started with is 3964 addresses. Now we’re talking. Let’s extract the extensions from those good addresses:

$ sed -e 's:.*+(.*)@.*:1:' experts-addresses.txt | sort > extensions.txt

The first obvious thing you can do is figure out if there are duplicates. While the single extensions are probably interesting too, finding a pattern is easier if you have people using the same extension, especially since there aren’t that many. So let’s see which extensions are common:

$ sed -e 's:.*+(.*)@.*:1:' experts-addresses.txt | sort | uniq -c -d | sort -n > common-extensions.txt

An obvious quick look look of that shows that a good chunk of the extensions (the last line in the generated file) used were referencing xtube – which you may or may not know as a porn website – reminding me of the YouPorn-related leak two and a half years ago. Scouring through the list of extensions, it’s also easy to spot the words “porn” and “porno”, and even “nudeceleb” making the list probably quite up to date.

Just looking at the list of extensions shows a few patterns. Things like friendster, comicbookdb (and variants like comics, comicdb, …) and then daz (dazstudio), and mythtv. As RT points out it might very well be phishing attempts, but it is also well possible that some of those smaller sites such as comicbookdb were breached and people just used the same passwords for their GMail address as the services (I used to, too!), which is why I think mandatory registrations are evil.

The final automatic interesting discovery you can make involves checking for full domains in the extensions themselves:

fgrep . extensions.txt | sort -u

This will give you which extensions include a dot in the name, many of which are actually proper site domains: xtube figures again, and so does comicbookdb, friendster, mythtvtalk, dax3d, s2games, policeauctions, itickets and many others.

What does this all tell me? I think what happens is that this list was compiled with breaches of different small websites that wouldn’t make a headline (and that most likely did not report to their users!), plus some general phishing. Lots of the passwords that have been confirmed as valid most likely come from people not using different passwords across websites. This breach is fixed like every other before it: stop using the same password across different websites, start using something like LastPass, and use 2 Factor Authentication everywhere is possible.

Account churn

In my latest post I singled out ne of the worst security experiences I’ve had with a service, but that is by far not the only bad experience I had. Indeed, given that I’ve been actively hunting down my old accounts and tried to get my hands on them, I can tell you that I have plenty of material to fill books with bad examples.

First of all, there is the problem of migrating the email addresses. For the longest time I’ve been using my GMail address to register everywhere, but then I decided to migrate to use my own domain (especially since Gandi supports two factor authentication, which makes it much safer). Unfortunately that means that not only I have a bunch of accounts still on the old email address, but I also have duplicate accounts.

Duplicate accounts become even more tricky when you consider that I had my own company, which meant I had double accounts for things that did not allow me to ship sometimes with a VAT ID attached, and sometimes not. Sometimes I could close the accounts (once I dropped the VAT ID), and sometimes I couldn’t, so a good deal of them are still out there.

FInally, there are the services that are available in multiple countries but with country-specific accounts. Which, don’t be mistaken, does not mean that every country has its own account database! It simply means that a given account is assigned to a country and does not work on any other. In most cases you cannot even migrate your account across countries. This is the case, for instance, of OVH, and why I moved to Gandi but also of PayPal (in which the billing address is tied to the country of the account and can’t be changed), IKEA and -PSN- Sony Online Entertainment. The end result is that I have duplicated (or even triplicated) accounts to cover the fact I have lived in multiple countries by now.

Also, it turns out that I completely forgot how many services I registered to over the years. Yes I have the passwords as stored by Chrome, but that’s not a comprehensive list as some of the most critical passwords have never been saved there (such as my bank’s password), plus some websites I have never used in Chrome, and at some point I had it clean the history of passwords and start from scratch. Some of the passwords have been saved in sgeps so I could look them up there, but even those are not a complete list. I ended up looking in my old email content to figure out which accounts I forgot having. The results have been fun.

But what about the grievances? Some of the accounts I wanted to gain access to again ended up being blocked or deleted, I’m surprised by the amount of services that either were killed, or moved around. At least three ebook stores I used are now gone, two of which absorbed by Kobo, while Marks & Spencer refused to recognize my email as valid, I assume they at some point reset their user database or something. Some of the hotel loyalty programs I signed up before and used once or twice disappeared, or were renamed/merged into something else. Not a huge deal but it makes account management a fun problem.

Then there are the accounts that got their password invalidated in the mean time, so even if I have a copy of it, it’s useless. Given that some accounts I had not logged into for years, that’s fair to happen: between leaks, heartbleed, and the overdue changes in best practices for password storage, I am more bothered by the services that did not invalidate my password in the mean time. But then again, there are different ways to deal with it. Some services when trying to login with the previous password point out that it’s no long valid and proceed with the same Forgotten password request workflow. Others will send you the password by email in plain text.

One quite egregious case happened with an Italian electronics shop, one of the first online-only stores I know of in Italy. Somehow, I knew that the account was still active, mostly because I could see their newsletter in the spam folder of my GMail account. So I went and asked for the password back, to change the address and stop the newsletter too (given I don’t live in Italy any longer), they sent me back the userid and password in cleartext. They reset their passwords in the mean time, and the default password became my Italian tax ID. Not very safe, if I were to know the user id of anyone else, knowing their tax ID is easy, as it can be calculated based on a few personal, but not so secret, details (full name, sex, date and city of birth).

But there is something worse than unannounced password resets. The dance of generating a new password. I have now the impression that it’s a minority of services that actually allow you to use whichever password you want. Lots of the services I have changed password for between last night and today required me to disable the non-alphanumeric symbols, because either they don’t support any non-alphanumeric character, or they only support a subset that LastPass does not let you select.

But this is not as bothersome as the length limitation of passwords. Most sites will be happy to tell you that they require a minimum of 6 or 8 characters for your password — few will tell you upfront the maximum length of a password. And very few of those that won’t tell you right away will fix the mistake by telling you when the password is too long, how long it can be. I even found sites that simply error out on you when you try to use a password that is not valid, and decide to invalidate both your old and temporary passwords, while not accepting the new one. It’s a serious pain.

Finally, I’ve enabled 2FA for as many services as I could; some of it is particularly bothersome (LinkedIn, I’ll probably write more about that), but at least it’s an extra safety. Unfortunately, I still find it extremely bothersome neither Google Authenticator nor RedHat’s FreeOTP (last time I tried) supported backing up the private keys of the configured OTPs. Since I switched between three phones in the past three months, I could use some help when having to re-enroll my OTP generators — I upgraded my phone, then had to downgrade because I broke the screen of the new one.