This Time Self-Hosted
dark mode light mode Search

How to analyze a dump of usernames

There has been some noise around a leak of users/passwords pairs which somehow panicked people into thinking it was coming from a particular provider. Since it seems most people have not even tried looking at the account information available, I’d like to point out some ways that could have helped avoiding the panic, if only the reporters cared. It also fits nicely into my previous notes on accounts’ churn.

But before proceeding let me make one thing straight: this post contains no information that is not available to the public and bears no relation to my daily work for my employer. Just wanted to make that clear. Edit: for the official response, please see this blog post of Google’s Security blog.

To begin the analysis you need a copy of the list of usernames; Italian blogger Paolo Attivissimo linked to it in his post but I’m not going to do so. Especially since it’s likely to become obsolete soon, and might not be liked by many. The archive is a compressed list of usernames without passwords or password hashes. At first, it seems to contain almost exclusively gmail.com addresses — in truth there are more addresses but it probably does not hit the news as much to say that there are some 5 million addresses from some thousand domains.

Let’s first try to extract real email addresses from the file, which I’ll call rawsource.txt — yes it does not match the name of the actual source file out there but I would rather avoid the search requests finding this post from the filename.

$ fgrep @ rawsource.txt > source-addresses.txt

This removes some two thousands lines that were not valid addresses — turns out that the file actually contains some passwords, so let’s process it a little more to get a bigger sample of valid addresses:

$ sed -i -e 's:|.*::' source-addresses.txt

This should make the next command give us a better estimate of the content:

$ sed -n -e 's:.*@::p' source-addresses.txt | sort | uniq -c | sort -n
[snip]
238 gmail.com.au
256 gmail.com.br
338 gmail.com.vn
608 gmail.com777
123215 yandex.ru
4800129 gmail.com

So as we were saying earlier there are more than just Google accounts in this. A good chunk of them are on Yandex, but if you look at the outlier in the list there are plenty of other domains including Yahoo. Let’s just filter away the four thousands addresses using either broken domains or outlier domains and instead focus on these three providers:

$ egrep '@(gmail.com|yahoo.com|yandex.ru)$' source-addresses.txt > good-addresses.txt

Now things get more interesting, because to proceed to the next step you have to know how email servers and services work. For these three providers, and many default setups for postfix and similar, the local part of the address (everything before the @ sign) can contain a + sign, when that is found, the local part is split into user and extension, so that mail to nospam+noreally would be sent to the user nospam. Servers generally ignore the extension altogether, but you can use it to either register multiple accounts on the same mailbox (like I do for PayPal, IKEA, Sony, …) or to filter the received mail on different folders. I know some people who think they can absolutely identify the source of spam this way — I’m a bit more skeptical, if I was a spammer I would be dropping the extension altogether. Only some very die-hard Unix fans would not allow inbound email without an extension. Especially since I know plenty of services that don’t accept email addresses with + in them.

Since this is not very well known, there are going to be very few email addresses using this feature, but that’s still good because it limits the amount of data to crawl through. Finding a pattern within 5M addresses is going to take a while, finding one in 4k is much easier:

$ egrep '.*+.*@.*' good-addresses.txt | sed -e '/.*@.*@.*/d' > experts-addresses.txt

The second command filters out some false positives due to two addresses being on the same line; the results from the source file I started with is 3964 addresses. Now we’re talking. Let’s extract the extensions from those good addresses:

$ sed -e 's:.*+(.*)@.*:1:' experts-addresses.txt | sort > extensions.txt

The first obvious thing you can do is figure out if there are duplicates. While the single extensions are probably interesting too, finding a pattern is easier if you have people using the same extension, especially since there aren’t that many. So let’s see which extensions are common:

$ sed -e 's:.*+(.*)@.*:1:' experts-addresses.txt | sort | uniq -c -d | sort -n > common-extensions.txt

An obvious quick look look of that shows that a good chunk of the extensions (the last line in the generated file) used were referencing xtube – which you may or may not know as a porn website – reminding me of the YouPorn-related leak two and a half years ago. Scouring through the list of extensions, it’s also easy to spot the words “porn” and “porno”, and even “nudeceleb” making the list probably quite up to date.

Just looking at the list of extensions shows a few patterns. Things like friendster, comicbookdb (and variants like comics, comicdb, …) and then daz (dazstudio), and mythtv. As RT points out it might very well be phishing attempts, but it is also well possible that some of those smaller sites such as comicbookdb were breached and people just used the same passwords for their GMail address as the services (I used to, too!), which is why I think mandatory registrations are evil.

The final automatic interesting discovery you can make involves checking for full domains in the extensions themselves:

fgrep . extensions.txt | sort -u

This will give you which extensions include a dot in the name, many of which are actually proper site domains: xtube figures again, and so does comicbookdb, friendster, mythtvtalk, dax3d, s2games, policeauctions, itickets and many others.

What does this all tell me? I think what happens is that this list was compiled with breaches of different small websites that wouldn’t make a headline (and that most likely did not report to their users!), plus some general phishing. Lots of the passwords that have been confirmed as valid most likely come from people not using different passwords across websites. This breach is fixed like every other before it: stop using the same password across different websites, start using a password manager, and use 2-Factor Authentication everywhere is possible.

Comments 4
  1. Please do not suggest to use Lastpass. You can’t know if it’s secure (closed source). Just use KeePass, and variant, instead

  2. Yes, because it’s reasonable to suggest a very user unfriendly software that nobody will use when you’re trying to work around people’s laziness with passwords…Sorry I’m out of sarcasm to make this comment longer.

  3. Or not given that’s still more insecure, especially if it’s an algorithm you can remember (or that is generated by something such as SuperGenPass).

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.