Thread

Emil Moe

Senior Data Engineer

Dec 11, 2017

How to tackle large CSV files

I am trying to create the optimal flow for transforming CSV files over to a database, but they must be processed on the way so it's reorganized into a better database layout in the end.

So far I have create some SQL tables that contain information about how to restructure the data and with those I can transform my CSV files into a SQL file, let's call them routines. First I tried direct insert statements, but that process took forever, so first generating a SQL file seems like an optimization.

However the process is still kinda slow with CSV files of about 200.000 lines.

I guess another optimization could be to load my routines into variables (memory) so it's not a query for every line in a file, but only for every file?

Also I have written it in PHP right now. That is probably not the most compatible language for file handling. Should I prefer Python or some other language?

More suggestions?

Last. It's a relative question, but how fast should I expect to process my files if I have 5 times 150.000 lines?

#programming #optimization

Responses(2)

200K rows isn't TOO large, and shouldn't take that long but it depends on the size of the rows. The work of converting the CSV to a SQL file then running it should actually be taking as much if not possibly MORE time than just iterating it one line at a time -- at least if you are properly leveraging "prepared queries" using PDO and turn emulated prepares off.

Did you try using the POEM method? POEM -- Prepare Once, Execute Mostly -- not only gives you sanitation, it can be WAY faster than slopping variables into your query strings.

Let's say $db is our PDO connection, $csvFile is an fopen'ed file handler, and there are five fields per data row... let's say a name, two lines of address, city, province, zip.

// sneaky trick to let us bind to array indexes before we fetch a line.
$row = [0, 0, 0, 0, 0, 0]; 
$stmt = $db->prepare('
    INSERT INTO test (
        name, address1, address2, city, provice, zip
    ) VALUES (
        :name, :address1, :address2, :city, :province, :zip
    )
');
$stmt->bindParam(':name', $row[0]);
$stmt->bindParam(':address1', $row[1]);
$stmt->bindParam(':address2', $row[2]);
$stmt->bindParam(':city', $row[3]);
$stmt->bindParam(':province', $row[4]);
$stmt->bindParam(':zip', $row[5]);

while ($row = fgetcsv($csvFile, 2048)) $stmt->execute();
// no seriously, that's the whole loop! ONE LINE!!!

The ONLY thing that needs to be done for the loop is read the line and execute the query. That is bindParam's true power unleashed.

You might also be able to squeeze more speed out of it using transactions, but depending on the data size you might want to limit transaction size to a certain number of records so you don't go full hog on memory.

Emil Moe Also a good idea if the data set is particularly large and you've got RAM to burn. That's where str_getcsv comes in handy, though with line-ending offsets not being uniform, that can be a good deal of work.

The rule of thumb that... well, has been relevant forever -- is filesystem allocation block size. If the data for a row is sufficiently smaller than a block, you buffer to at least the block size to grab many rows in one operation. If not, it's not worth the effort of loading more if other stuff that uses the disk (like writing to it for the DB) is going on.

Though any DECENT server should have predictive caching on disk reads to the hardware block level since, well... it HAS to read the ENTIRE physical block regardless of if you are using all of it anyways.

Search Hashnode

How to tackle large CSV files

Responses(2)

Recent threads