So I use phpMyAdmin most of the time. I am working on a project (http://www.kubisec.com / http://kubisec.mywire.org) that requires I import millions of records into MySQL. The CSV import is pretty sensitive and on a lot of errors, it just gives up. I needed a sure way to make sure CSVs would import correctly. Fixing millions of lines manually was not an option.
I’m running BackBox Linux. It is basically Ubuntu with features from Kali and Tails. It has pen-testing tools and “anonymous” mode which sends all traffic through Tor. So this should work on most flavors and especially Ubuntu.
Here’s an example from the MySpace data dump:
358274297:em@il:no1believer:0x3C820C6E5C0A45D922BAC3A216F393BAC2646CC3:”
358274301:em@il:358274301:0x4F972B0E0284DD6A478A83E7CC3CB04DC7A3132C:”
358274303:em@il:358274303:0x85023BC72A13542002BDA3F31355C0BAD2767B38:”
It isn’t exactly CSV, but uses a “:”. That’s fine, this method will work with whatever separator you want. We want the email address, username, and the password hash (sha1). We are gonna use awk to pull out those fields and reprint what we want. Reprinting it ensures it is all formatted the same. It also ensures that we know the column count. Here’s the command:
awk ‘BEGIN{FS=”:”} {print $2″:”$3″:”$4}’ MySpaceDump/MySpace.part2.ab.csv > MySpace.formatted.part2.ab.csv
Here’s a breakdown of the command:
- awk is a tool/language for processing text
- BEGIN{FS=”:”} says this file is separated with “:”
- {print $2″:”$3″:”$4} says we want to print columns 2, 3, and 4
- {print $2″:”$3″:”$4} also says to print “:” between columns
- MySpaceDump/MySpace.part2.ab.csv is the input file
- > MySpace.formatted.part2.ab.csv says where to output
Our output file looks like this:
em@il:358274314:0x9233CD2EC86BC1F998285BE69CC997ACCF84565B
em@il:358274315:0xB56B1F4A6EBF893091F6BC0A91A498F18B796AA4
em@il:358274316:0x15524616E0D8C2C83FCD0F2056A6CD140A050F6C
em@il:randooman:0x429AAB47B254D09CF5CBA9653466C3904DA5DD61
This will get rid of problems from lines with more or less columns than the rest of the file, blank lines, spaces in the file, lines formatted wrong, and other stuff that will break a CSV import.
Lot’s of files are going to have little, different issues when importing into MySQL. This command will hopefully catch all of them. That said, you should write down commands as you go. That way, you can make a script for next time.