Discussion on "Handling GTFS data with DuckDB"

Tobias Müller · 2025-05-16T22:00:00.000Z

The General Transit Feed Specification (GTFS) is a standardized, open data format for public transportation schedules and geographic information. In practice, a GTFS feed is simply a ZIP archive of text (CSV) tables - such as stops.txt, routes.txt, a...

The problem by doing this in this way is that the order of fields is or may be different between datasets. We have created some Python glue that takes care of that. You can find our github repo at MMTIS/badger the file is gtfs_import_to_db.py

While we initially did all our processing in DuckDB, the issues found and reported for processing huge non-csv files, made us look at alternatives. Still for CSV processing DuckDB makes sense.

Thanks for the feedback! Not sure if I understood correctly, as I'm "fixing" the problems you describe with custom loading scripts by provider, e.g. github.com/tobilg/duckdb-gtfs/blob/main/queries/p…

This works w/o problems for me so far. What do you mean by "huge non-CSV files" Stefan de Konink? The GTFS standard specifies CSV as format, so I'm a bit confused.

Tobias Müller the read_csv of DuckDB does not fix the column order, right? Hence if the column of agency.txt was agency_name, agency_id this would fail, similarly when an extra column is part of the source GTFS file not specified in the table.

For us GTFS is one of the file formats that we handle (import and export), but since within Europe NeTEx is the standard for data exchange between the national access points it makes sense to be able to process NeTEx. I have done so in different forms within DuckDB. Creating a relational database based on XML Schema. Using DuckDB as 'advanced' key value store with extra attributes per key in various incarnations. For now my conclusion is that DuckDB has a set of significant and known issues when going beyond main memory. github.com/duckdb/duckdb/issues

Stefan de Konink Yes, in tobilg.com/handling-gtfs-data-with-duckdb I mention that users will have to write their custom loading scripts taking thing like wrong column order into account.

Search Hashnode

Handling GTFS data with DuckDB

Responses(1)