I am trying to create the optimal flow for transforming CSV files over to a database, but they must be processed on the way so it's reorganized into a better database layout in the end.
So far I have create some SQL tables that contain information about how to restructure the data and with those I can transform my CSV files into a SQL file, let's call them routines. First I tried direct insert statements, but that process took forever, so first generating a SQL file seems like an optimization.
However the process is still kinda slow with CSV files of about 200.000 lines.
I guess another optimization could be to load my routines into variables (memory) so it's not a query for every line in a file, but only for every file?
Also I have written it in PHP right now. That is probably not the most compatible language for file handling. Should I prefer Python or some other language?
More suggestions?
Last. It's a relative question, but how fast should I expect to process my files if I have 5 times 150.000 lines?
Depending on your database, consider bulk loading.
MySQL has LOAD DATA INFILE
SQL Server has BULK INSERT
They're very particular. Data has to be in the right format for this to work, but these are typically magnitudes faster than running through any processor.
200K rows isn't TOO large, and shouldn't take that long but it depends on the size of the rows. The work of converting the CSV to a SQL file then running it should actually be taking as much if not possibly MORE time than just iterating it one line at a time -- at least if you are properly leveraging "prepared queries" using PDO and turn emulated prepares off.
Did you try using the POEM method? POEM -- Prepare Once, Execute Mostly -- not only gives you sanitation, it can be WAY faster than slopping variables into your query strings.
Let's say $db is our PDO connection, $csvFile is an fopen'ed file handler, and there are five fields per data row... let's say a name, two lines of address, city, province, zip.
// sneaky trick to let us bind to array indexes before we fetch a line. $row = [0, 0, 0, 0, 0, 0]; $stmt = $db->prepare(' INSERT INTO test ( name, address1, address2, city, provice, zip ) VALUES ( :name, :address1, :address2, :city, :province, :zip ) '); $stmt->bindParam(':name', $row[0]); $stmt->bindParam(':address1', $row[1]); $stmt->bindParam(':address2', $row[2]); $stmt->bindParam(':city', $row[3]); $stmt->bindParam(':province', $row[4]); $stmt->bindParam(':zip', $row[5]); while ($row = fgetcsv($csvFile, 2048)) $stmt->execute(); // no seriously, that's the whole loop! ONE LINE!!!The ONLY thing that needs to be done for the loop is read the line and execute the query. That is bindParam's true power unleashed.
You might also be able to squeeze more speed out of it using transactions, but depending on the data size you might want to limit transaction size to a certain number of records so you don't go full hog on memory.