2014-09-10

How to analyze a dump of usernames

There has been some noise around a leak of users/passwords pairs which somehow panicked people into thinking it was coming from a particular provider. Since it seems most people have not even tried looking at the account information available, I’d like to point out some ways that could have helped avoiding the panic, if only the reporters cared. It also fits nicely into my previous notes on accounts’ churn.

But before proceeding let me make one thing straight: this post contains no information that is not available to the public and bears no relation to my daily work for my employer. Just wanted to make that clear. Edit: for the official response, please see this blog post of Google’s Security blog.

To begin the analysis you need a copy of the list of usernames; Italian blogger Paolo Attivissimo linked to it in his post but I’m not going to do so. Especially since it’s likely to become obsolete soon, and might not be liked by many. The archive is a compressed list of usernames without passwords or password hashes. At first, it seems to contain almost exclusively gmail.com addresses — in truth there are more addresses but it probably does not hit the news as much to say that there are some 5 million addresses from some thousand domains.

Let’s first try to extract real email addresses from the file, which I’ll call rawsource.txt — yes it does not match the name of the actual source file out there but I would rather avoid the search requests finding this post from the filename.

$ fgrep @ rawsource.txt > source-addresses.txt





This removes some two thousands lines that were not valid addresses — turns out that the file actually contains some passwords, so let’s process it a little more to get a bigger sample of valid addresses:



$ sed -i -e 's:|.*::' source-addresses.txt




This should make the next command give us a better estimate of the content:



$ sed -n -e 's:.*@::p' source-addresses.txt | sort | uniq -c | sort -n
[snip]
238 gmail.com.au
256 gmail.com.br
338 gmail.com.vn
608 gmail.com777
123215 yandex.ru
4800129 gmail.com




So as we were saying earlier there are more than just Google accounts in this. A good chunk of them are on Yandex, but if you look at the outlier in the list there are plenty of other domains including Yahoo. Let’s just filter away the four thousands addresses using either broken domains or outlier domains and instead focus on these three providers:



$ egrep '@(gmail.com|yahoo.com|yandex.ru)$' source-addresses.txt > good-addresses.txt




Now things get more interesting, because to proceed to the next step you have to know how email servers and services work. For these three providers, and many default setups for postfix and similar, the local part of the address (everything before the @ sign) can contain a + sign, when that is found, the local part is split into user and extension, so that mail to nospam+noreally would be sent to the user nospam. Servers generally ignore the extension altogether, but you can use it to either register multiple accounts on the same mailbox (like I do for PayPal, IKEA, Sony, …) or to filter the received mail on different folders. I know some people who think they can absolutely identify the source of spam this way — I’m a bit more skeptical, if I was a spammer I would be dropping the extension altogether. Only some very die-hard Unix fans would not allow inbound email without an extension. Especially since I know plenty of services that don’t accept email addresses with + in them.



Since this is not very well known, there are going to be very few email addresses using this feature, but that’s still good because it limits the amount of data to crawl through. Finding a pattern within 5M addresses is going to take a while, finding one in 4k is much easier:



$ egrep '.*+.*@.*' good-addresses.txt | sed -e '/.*@.*@.*/d' > experts-addresses.txt




The second command filters out some false positives due to two addresses being on the same line; the results from the source file I started with is 3964 addresses. Now we’re talking. Let’s extract the extensions from those good addresses:



$ sed -e 's:.*+(.*)@.*:1:' experts-addresses.txt | sort > extensions.txt




The first obvious thing you can do is figure out if there are duplicates. While the single extensions are probably interesting too, finding a pattern is easier if you have people using the same extension, especially since there aren’t that many. So let’s see which extensions are common:



$ sed -e 's:.*+(.*)@.*:1:' experts-addresses.txt | sort | uniq -c -d | sort -n > common-extensions.txt




An obvious quick look look of that shows that a good chunk of the extensions (the last line in the generated file) used were referencing xtube – which you may or may not know as a porn website – reminding me of the YouPorn-related leak two and a half years ago. Scouring through the list of extensions, it’s also easy to spot the words “porn” and “porno”, and even “nudeceleb” making the list probably quite up to date.



Just looking at the list of extensions shows a few patterns. Things like friendster, comicbookdb (and variants like comics, comicdb, …) and then daz (dazstudio), and mythtv. As RT points out it might very well be phishing attempts, but it is also well possible that some of those smaller sites such as comicbookdb were breached and people just used the same passwords for their GMail address as the services (I used to, too!), which is why I think mandatory registrations are evil.



The final automatic interesting discovery you can make involves checking for full domains in the extensions themselves:



fgrep . extensions.txt | sort -u




This will give you which extensions include a dot in the name, many of which are actually proper site domains: xtube figures again, and so does comicbookdb, friendster, mythtvtalk, dax3d, s2games, policeauctions, itickets and many others.



What does this all tell me? I think what happens is that this list was compiled with breaches of different small websites that wouldn’t make a headline (and that most likely did not report to their users!), plus some general phishing. Lots of the passwords that have been confirmed as valid most likely come from people not using different passwords across websites. This breach is fixed like every other before it: stop using the same password across different websites, start using a password manager, and use 2-Factor Authentication everywhere is possible.
Share this:

				Share on Threads (Opens in new window)
				Threads
			

				Share on Facebook (Opens in new window)
				Facebook
			

				Share on Mastodon (Opens in new window)
				Mastodon
			

				Share on LinkedIn (Opens in new window)
				LinkedIn
			

				Share on Reddit (Opens in new window)
				Reddit



			Accounts
Analysis
Breaches
Dumps
Passwords
Security


			
			
				
									
				
					
					
						Flameeyes					

					2845 posts



					


			
			
				
					
						
							
								
								
									
										Comments 4
										
										
													
			
				
					
												Timothy says:					


					
						2014-09-10 at 18:55					


									


				
					Please do not suggest to use Lastpass. You can’t know if it’s secure (closed source). Just use KeePass, and variant, instead
				


				Reply
			

		

		
			
				
					
												Flameeyes says:					


					
						2014-09-10 at 19:02					


									


				
					Yes, because it’s reasonable to suggest a very user unfriendly software that nobody will use when you’re trying to work around people’s laziness with passwords…Sorry I’m out of sarcasm to make this comment longer.
				


				Reply
			

		

		
			
				
					
												Quijote70 says:					


					
						2014-09-11 at 19:42					


									


				
					Hi,If you want some security:select a pasword and based on al algoritm on the site name add 3 more lettersand use name+site@gmail.com
				


				Reply
			

		

		
			
				
					
												Flameeyes says:					


					
						2014-09-11 at 19:55					


									


				
					Or not given that’s still more insecure, especially if it’s an algorithm you can remember (or that is generated by something such as SuperGenPass).
				


				Reply
			

		

										

										
																			
								
							
						
					

					
									
			
		
	
	
		
			
				
					
						
							
							
								
									
		
			Leave a ReplyCancel reply
			
				
									
									
					
					
							
		

		
		

		This site uses Akismet to reduce spam. Learn how your comment data is processed.
								
							
						
					
				

				
							
		
	



	
		
			Related Posts
			
				

	
		
		

			03.07.26
			Identity Crisis In The Age of AI
			
						
				

												
																	
										
									
									
								
									
										
											Flameeyes										
									

																	
							
											
			
					
	



	
		
		

			19.06.26
			Bloke On A Trike
			
						
				

												
																	
										
									
									
								
									
										
											Flameeyes										
									

																	
							
											
			
					
	



	
		
		

			22.05.26
			This Blog, Brought To You Through AI! (Well, Kinda)
			
						
				

												
																	
										
									
									
								
									
										
											Flameeyes										
									

																	
							
											
			
					
	



	
		
		

			08.05.26
			Did I Finally Solve My Audiobook Woes? Well, Maybe.
			
						
				

												
																	
										
									
									
								
									
										
											Flameeyes

Popular tags

The Latest
View All

Identity Crisis In The Age of AI

Bloke On A Trike

This Blog, Brought To You Through AI! (Well, Kinda)

Did I Finally Solve My Audiobook Woes? Well, Maybe.

How to analyze a dump of usernames

Comments 4

Leave a ReplyCancel reply