How to analyze a dump of usernames

There has been some noise around a leak of users/passwords pairs which somehow panicked people into thinking it was coming from a particular provider. Since it seems most people have not even tried looking at the account information available, I’d like to point out some ways that could have helped avoiding the panic, if only the reporters cared. It also fits nicely into my previous notes on accounts’ churn.

But before proceeding let me make one thing straight: this post contains no information that is not available to the public and bears no relation to my daily work for my employer. Just wanted to make that clear. Edit: for the official response, please see this blog post of Google’s Security blog.

To begin the analysis you need a copy of the list of usernames; Italian blogger Paolo Attivissimo linked to it in his post but I’m not going to do so. Especially since it’s likely to become obsolete soon, and might not be liked by many. The archive is a compressed list of usernames without passwords or password hashes. At first, it seems to contain almost exclusively gmail.com addresses — in truth there are more addresses but it probably does not hit the news as much to say that there are some 5 million addresses from some thousand domains.

Let’s first try to extract real email addresses from the file, which I’ll call rawsource.txt — yes it does not match the name of the actual source file out there but I would rather avoid the search requests finding this post from the filename.

$ fgrep @ rawsource.txt > source-addresses.txt

This removes some two thousands lines that were not valid addresses — turns out that the file actually contains some passwords, so let’s process it a little more to get a bigger sample of valid addresses:

$ sed -i -e 's:|.*::' source-addresses.txt

This should make the next command give us a better estimate of the content:

$ sed -n -e 's:.*@::p' source-addresses.txt | sort | uniq -c | sort -n
[snip]
  238 gmail.com.au
256 gmail.com.br
338 gmail.com.vn
608 gmail.com777
123215 yandex.ru
4800129 gmail.com

So as we were saying earlier there are more than just Google accounts in this. A good chunk of them are on Yandex, but if you look at the outlier in the list there are plenty of other domains including Yahoo. Let’s just filter away the four thousands addresses using either broken domains or outlier domains and instead focus on these three providers:

$ egrep '@(gmail.com|yahoo.com|yandex.ru)$' source-addresses.txt > good-addresses.txt

Now things get more interesting, because to proceed to the next step you have to know how email servers and services work. For these three providers, and many default setups for postfix and similar, the local part of the address (everything before the @ sign) can contain a + sign, when that is found, the local part is split into user and extension, so that mail to nospam+noreally would be sent to the user nospam. Servers generally ignore the extension altogether, but you can use it to either register multiple accounts on the same mailbox (like I do for PayPal, IKEA, Sony, …) or to filter the received mail on different folders. I know some people who think they can absolutely identify the source of spam this way — I’m a bit more skeptical, if I was a spammer I would be dropping the extension altogether. Only some very die-hard Unix fans would not allow inbound email without an extension. Especially since I know plenty of services that don’t accept email addresses with + in them.

Since this is not very well known, there are going to be very few email addresses using this feature, but that’s still good because it limits the amount of data to crawl through. Finding a pattern within 5M addresses is going to take a while, finding one in 4k is much easier:

$ egrep '.*+.*@.*' good-addresses.txt | sed -e '/.*@.*@.*/d' > experts-addresses.txt

The second command filters out some false positives due to two addresses being on the same line; the results from the source file I started with is 3964 addresses. Now we’re talking. Let’s extract the extensions from those good addresses:

$ sed -e 's:.*+(.*)@.*:1:' experts-addresses.txt | sort > extensions.txt

The first obvious thing you can do is figure out if there are duplicates. While the single extensions are probably interesting too, finding a pattern is easier if you have people using the same extension, especially since there aren’t that many. So let’s see which extensions are common:

$ sed -e 's:.*+(.*)@.*:1:' experts-addresses.txt | sort | uniq -c -d | sort -n > common-extensions.txt

An obvious quick look look of that shows that a good chunk of the extensions (the last line in the generated file) used were referencing xtube – which you may or may not know as a porn website – reminding me of the YouPorn-related leak two and a half years ago. Scouring through the list of extensions, it’s also easy to spot the words “porn” and “porno”, and even “nudeceleb” making the list probably quite up to date.

Just looking at the list of extensions shows a few patterns. Things like friendster, comicbookdb (and variants like comics, comicdb, …) and then daz (dazstudio), and mythtv. As RT points out it might very well be phishing attempts, but it is also well possible that some of those smaller sites such as comicbookdb were breached and people just used the same passwords for their GMail address as the services (I used to, too!), which is why I think mandatory registrations are evil.

The final automatic interesting discovery you can make involves checking for full domains in the extensions themselves:

fgrep . extensions.txt | sort -u

This will give you which extensions include a dot in the name, many of which are actually proper site domains: xtube figures again, and so does comicbookdb, friendster, mythtvtalk, dax3d, s2games, policeauctions, itickets and many others.

What does this all tell me? I think what happens is that this list was compiled with breaches of different small websites that wouldn’t make a headline (and that most likely did not report to their users!), plus some general phishing. Lots of the passwords that have been confirmed as valid most likely come from people not using different passwords across websites. This breach is fixed like every other before it: stop using the same password across different websites, start using something like LastPass, and use 2 Factor Authentication everywhere is possible.

Passive web logs analysis. Replacing AWStats?

You probably don’t know that, but for my blog I do analyse the Apache logs with AWStats over and over again. This is especially useful at the start of the month to identify referrer spam and other similar issues, which in turn allows me to update my ModSecurity ruleset so that more spammers are caught and dealt with.

To do so, I’ve been using for, at this point, years, AWStats which is a Perl analyzer, generator and CGI application. It used to work nicely, but nowadays it’s definitely lacking. It doesn’t filter referrers search engines as much as it used to be (it’s still important to filter out requests coming from Google, but newer search engines are not recognized), and most of the new “social bookmark” websites are not there at all — yes it’s possible to keep adding to them, but with upstream not moving, this is getting harder and harder.

Even more important, for my ruleset work, is the lack of identification of modern browsers. Things like Android versions and other fringe OSes would be extremely useful for me, but adding support for all of them is a pain and I have enough things on my plates that this is not something I’m looking forward to tackle myself. It’s even more bothersome when you consider that there is no way to reconsider the already analyzed data, if a new URL is identified as a search engine, or an user agent a bot.

One of the most obvious choices for this kind of work is to use Google Analytics — unfortunately, this means that it will only work if it’s not blacklisted from the user side — that includes NoScript users and of course most of the spammers. So this is not a job for them. It’s something that has to be done on the backend, on the logs side.

The obvious point at that point is to find something capable to import the data out of the current awstats datafiles I got, and keep importing data from the Apache log files. Hopefully this should be done by saving the data in a PostgreSQL database (which is what I usually use); native support for vhost data, but the ability to collapse it in a single view would also be nice.

If somebody knows of a similar piece of software, I’d love to give it a try — hopefully, something that is written in Ruby or Perl might be the best for me (because I can hack on those) but I wouldn’t say no to Python or even Java (the latter if somebody helped me making sure the dependencies are all packed up properly). This will bring you better modsec rules, I’m sure!

I need help publishing tinderbox logs

I’m having a bit of time lately, since I didn’t want to keep one of the gigs I’ve been working on in the past year or so… this does not mean I’m more around than usual for now though, mostly because I’m so tired that I need to recharge before I can even start answering the email messages I’ve received in the past ten days or so.

While tinderboxing lately has been troublesome – the dev-lang/tendra package build causes Yamato to reach out of memory, causing the whole system to get hosed; it looks like a fork bomb actually – I’ve also had some ideas on how to make it easier for me to report bugs, and in general to get other people to help out with analysing the results.

My original hope for the log analysis was to make it possible to push out the raw log, find the issues, and report them one by one.. I see now that this is pretty difficult, nearing infeasible, so I’d rather try a different approach. The final result I’d like to have now is a web-based equivalent of my current emacs+grep combination: a list of log names, highlighting those that hit any trouble at all, and then within those logs, highlights on the rows showing trouble.

To get to this point, I’d like to start small and in a series of little steps… and since I honestly have not the time to work on it, I’m asking the help of anybody who’s interested in helping Gentoo. The first step here would be to find a way to process a build log file, and translate it into HTML. While I know this means incrementing its size tremendously, this is the simplest way to read the log and to add data over it. What I’d be looking for would be a line-numbered page akin to the various pastebins that you can find around. This does mean having one-span-per-line to make sure that they are aligned, and this will be important in the following steps.

The main issue with this is that there are build logs that include escape sequences, even though I disable Portage’s colours as well as build systems’, and that means that whatever converts the logs should also take care of stripping away said sequences. There are also logs that include outputs such as wget’s or curl’s, that use the carriage-return code to overwrite the output line, but creates a mess when viewing the log outside a terminal — I’m not sure why the heck they don’t check whether you’re outputting only on a tty. There are also some (usually Java) packages whose log appears to grep as a binary file, and that’s also something that the conversion code should have to deal with.

As a forecast of what’s to come with the next few steps, I’ll need a way to match error messages in each line, and highlight those. Once they are highlighted, just using XPath expressions to count the number of matched lines should make it much easier to produce an index of relevant logs… then it’s a matter where to publish these. I think that it might be possible to just upload everything here in Amazon’s S3 system and use that, but I might be optimistic for no good reason, so it has to be discussed and designed.

Are you up to help me on this task? If so, join in the comments!

Why the tinderbox is a non-distributed effort

In my previous post about a run control for the tinderbox, Pavel suggested me to use Gearman to handle the run control section; as I explained him earlier this evening, this is probably too much right now; the basic run control I need is pretty simple (I can even keep using xargs, if Portage gave me hooks for merge succeeded/failed), the fancy stuff, the network interface, is sugar that Gearman wouldn’t help me with as far as I can see.

On the other hand, what I want to talk about is the reasoning why I don’t think the tinderbox should e a distributed effort, as many people try to suggest from time to time to reduce the load on my machine. Unfortunately to work well in distributed methods, the task has to feasibly become a “divide et impera” kind of task.

The whole point of the tinderbox for me is to verify the interactions between packages; it’s meant to find which packages break when they are used together, among other things, and that kind of things need for all the packages to be present at the same time, which precludes the use of a distributed tinderbox.

If you don’t think this is helpful, I can tell you quite a bit of interesting things about automagic deps but since I already wrote about them from time to time I’ll skip over it for now.

That kind of effort that can work with the distributed approach is that taken by Patrick of cleaning-up tinderboxes: after each merge the dependencies gets removed, and a library of binary packages is kept up to date to avoid building them multiple times a day. This obviously makes it possible to test multiple package chains at once in multiple systems, but it also adds some further overhead (as multiple boxes will have to rebuild the same binary packages if you don’t share them around).

On the other hand, I think I got an use for Gearman (an ebuild for which, mostly contributed by Pavel, is in my overlay; we’re working on it to polish): I already mused some time ago about checking the packages’ sources looking for at least those things that can be found easily via scripts (like over-canonicalisation that I well documented already). This is a task where divide-et-impera is very likely a working strategy. Extracting and analysing the sources is an I/O-bound task, not a CPU-bound task, so Yamato’s approach there is definitely a losing one.

To have a single box to have enough I/O speed to handle so many packages you end up resorting to very high end hardware (disks and controllers) which is very expensive. Way too expensive. On the other hand, having multiple boxes, even cheap or virtual (distributed among different real boxes of course) working independently but dividing their queue together, with proper coordination, you probably can beat those performances for less than half the price. Now, for this to work there are many prerequisites, a lot of which I’m afraid I won’t be able to tackle anytime soon yet.

First of all, I need to understand well how Gearman work since I only skimmed through it up to now. Then I need to find the hardware; if I can change my garage into a machine room, and connect it to my network, that might be a good place to start (I can easily use low-power old-style machines, I still have a few around that hadn’t found space to be put lately); I remember some users offering chroots in their boxes before; this might turn out pretty useful, if they can make virtual machines, or containers, they can also work on the analysis, in a totally distributed fashion).

The third problem is somewhat the hardest but the most interesting: finding more analysis to run on the sources; without building stuff. Thankfully, I have got the book (Secure Programming with Static Analysis) to help me coping with that task.

Wish me luck, and especially wish me to find time to work on this.

Tinderbox and logs, some thoughts

While looking at today’s logs I started to wonder whether the current approach is actually sustainable. The main problem is that most of the tinderbox was just created piece by piece as needed to complete a specific task, and it really has never undergone a complete overhaul.

Right before moving to containers it only consisted of a chroot, which caused quite a few issues on the long run, but not extreme issues (still some of the important parts, such as the fact that libvirt testsuite seem to be triggering something bad in my kernel, which in turn blocks evolution and gwibber).

Beside the change from a simple chroot to a container (that reminds a lot of BSD jails by the way), the rest ofthe tinderbox has been kept almost identical: there is a script, written by Zac, that produces a list of packages, following these few rules:

  • the package is the highest version available for a slot of a package (it lists all slots for all packages);
  • the package is not masked, nor its dependencies are (this is important because I do mask stuff locally when I know ti fails, for instance libvirt above);
  • the ebuild in that particular version is not installed, or it wasn’t re-installed in the past 10 weeks (originally, six weeks, now ten because tests need to be executed as well).

This is the main core of the tinderbox, and beside that there is a mostly-simple bashrc file for Portage, that takes care of a few more tests, that portage itself ignores for now:

  • check for calls to AC_CANONICAL_TARGET, this is a personal beef against overcanonicalisation in autotools;
  • check for bundled common libraries (zlib, libpng, jpeg, ffmpeg, …), this makes it easier to identify possibly bundled libraries, although it has a quite high rate of false positives; having an actual database of code present in the various package would be easier;
  • check for use of insecure functions; this is just a little extra check for functions like tmpnam() and similar, that ld already warns about;
  • single-pass find(1) checks for OSX forkfiles (unfortunately common in the Ruby packages!), for setuid and setgid binaries (to have a list of them) and for invalid directories (like /usr/X11R6) in the packages;
  • extra QA trick to identify packages calling make rather than emake (which turns out to be quite useful to identify packages that use make instead of emake -j1 to hide bugs.

Now, all this produces a few files in the temporary directory, which are then printed so that the actual build log keeps them. Unfortunately this does not really work tremendously well since it requires quite a bit of work to properly extract them.

So I was thinking, what if instead of just running emerge, I run a wrapper around emerge? This would actually help me gathering important information, at the cost of spending more time for each merge. At that point, the wrapper could be writing up a “report card” after each emerge, containing in a suitable format (XML comes to mind) the following information:

  • status of the emerge (completed or not);
  • build log of the emerge (filename, gets copied with the report card itself);
  • emerge info for the current merge (likewise) — right now the emerge info I provide is the generic information of the tinderbox, which might differ from the specific instance used by the merge;
  • version of all the dependency tree — likewise, it’s something that might change between the time the package fails and I file the bug;
  • CVS revision of the current ebuild — very important to debug possibly fixed and duplicated bugs; this is one of the crucial point to be considering before moving to git where such revision is not available, as far as I know;
  • if the package has failed within functions we know they leave a log with details, those files should be copied together with the report card, and listed — this works for epatch, the functions from autotools.eclass and econf;
  • gathering of all the important information as returned by the checks in bashrc and in Portage proper (prestripped files, setXid files, use of make and AC_CANONICAL_TARGET, …).

Once this report card is generated, the information from the workdir is pretty unimportant and can mostly cleared up, this would allow me to save space on disk, which isn’t exactly free.

More importantly, with this report it’s also possible to check for bugs, and report them almost automatically, which would reduce the time needed for handling the reports of the tinderbox, and thus the time I need to invest on the reporting side of the job, instead of spending it on the fixing side (which thankfully is currently handled very well by Samuli and Victor).

More on this idea in the next days hopefully, with some proof of concept as well.

More tinderboxing, more analysis, more disk space

Even though I had a cold I’ve kept busy in the past few days, which was especially good because today was most certainly Monday. For the sake of mental sanity, I’ve decided a few months ago that the weekend is off work for me, and Monday is dedicated at summing up what I’m going to do during the rest of the week, sort of a planning day. Which usually turns out to mean a lot of reading and very little action and writing.

Since I cannot sleep right now (I’ll have to write a bit about that too), I decided to start with the writing to make sure the plans I figured out will be enacted this week. Whih is especially considerate to do considering I also had to spend some time labelling, as usual this time of the year. Yes I’m still doing that, at least until I can get a decent stable job. It works and helps paying the bills at least a bit.

So anyway, you might have read Serkan’s post regarding the java-dep-check package and the issus that it found once run on the tinderbox packages. This is probably one of the most interesting uses of the tinderbox: large-scale testing of packages that would otherwise keep such a low profile that they would never come out. To make more of a point, the tinderbox is now running with the JAVA_PKG_STRICT variable set so that the Java packages will have extra checks and would be much more safely tested on the tree.

I also wanted to add further checks for bashisms in the configure script. This sprouted from the fact that, on FreeBSD 7.0, the autoconf-generated configure script does not discard the /bin/sh shell any longer. Previously, the FreeBSD implementation was discarded because of a bug, and thus the script re-executed itself using bash instead. This was bad (because bash, as we should really well know, is slow) but also good (because then all the scripts were executed with the same shell on both Linux and FreeBSD). Since the bug is now fixed, the original shell is used, which is faster (and thus good); the problem is that some projects (unieject included!) use bashisms that will fail. Javier spent some time trying to debug the issue.

To check for bashisms, I’ve used the script that Debian makes available. Unfortunately the script is far from perfect. First of all it does not really have an easy way to just scan a subtree for actual sh scripts (using egrep is not totally fie since autoconf m4 fragments often have the #!/bin/sh string in them). Which forced me to write a stupid, long and quite faulty script to scan the configure files.

But even worse, the script is full of false positives: instead of actually parsing its semantics, it only scans for substrings. For instance it identified the strange help output in gnumeric as a bash-specific brace expansion, when it was in an HEREDOC string. Instead of this method, I’d probably take a special parameter in bash that tells the interpreter to output warnings about bash-specific features, maybe I should write it myself.

But I think that there are some things that should be addressed in a much different way than the tinderbox itself. Like I have written before, there are many tests that should actually be executed on source code, like static analysis of the source code, and analysis of configure scripts so to fix issues like canonical targets when they are not needed, or misaligned ./configure --help output, and os on so forth. This kind of scans should not be applied only to released code, but more importantly on the code still in the repositories, so that the issues can be killed before the released code.

I had this idea when I went to look for different conditions on Lennart’s repositories (which are as usually available on my own repositories with changes and fixes and improvements on the buildsystem – a huge thanks to Lennart for allowing me to be his autotools-meister). By build-checking his repositories before he makes release I can ensure the released code works for Gentoo just fine, instead of having to patch it afterwards and queue the patch for the following release. It’s the step beyond upstreaming the patches.

Unfortunately this kind of work is not only difficult because it’s hard to write static analysis software that gets good results; US DHS-founded Coverity Scan, although lauded by people like Andrew Morton, had tremendously bad results in my opinion with xine-lib analysis: lots of issues were never reported, and the ones reported were often enough either false positives or were inside the FFmpeg code (which xine-lib used to import); and the code was almost never updated. If it didn’t pick up the change to the Mercurial repository, that would have been understandable, I don’t pretend them to follow the repository moves of all the projects they analyse, but the problem was there since way before the move. And it also reported every and each day the same exact problems, repeated over and over; for a while I tried to keep track of them and marked hte ones we already dealt with or which were false positives or were parts of FFmpeg (which may even have been fixed already).

So one thing to address is to have an easy way to keep track of various repositories and their branches, which is not so easy since all SCM programs have different ways to access the data. Ohloh Open Hub has probably lots of experience with that, so I guess that might be a start; it has to be considered, though, that Open Hub only supports the three “major” SCM products, GIT, Subversion and the good old CVS, which means that extending it to any repository at all is going to take a lot more work, and it had quite a bit of problems with accessing Gentoo repositories which means that it’s certainly not fault-proof. And even if I was able to hook up a similar interface on my system. it would probably require much more disk space that I’m able to have right now.

For sure now the first step is to actually write the analysis script that first checks the build logs (since anyway that would already allow to have some results, once hooked up with the tinderbox), and then find a way to identify some of the problems we most care about in Gentoo from static analysis of the source code. Not an easy task or something that can be done in spare time so if you got something to contribute, please do, it would be really nice to get the pieces of the puzzle up.