Discussion on "Column-oriented processing."

Taras Tsugrii · 2021-04-22T20:59:38.805Z

It's tempting to believe that there is a single unifying theory of everything and while we haven't found it yet, there are numerous principles shared between different disciplines responsible for significant performance wins. With this post I'd like ...

Thanks, Taras, for the article.

I have few questions:

How does indexing work in columnar databases? What if I need to query 'Hire_date' info for Bob and Jim in a huge table. In RDBMS, I could use column index(B tree etc.) to fetch data very quickly.
You mentioned "array of structs vs. a struct of arrays" for column-based data. IIUC customer has to do proper mapping between columns on the client-side. If you use incorrect indexes to map columns for a specific row, it will introduce hidden bugs that are very difficult to debug. I wonder what is the best practice to fetch column-based data.
How are join queries performed? What would be the joining criteria?

Thanks!

Great questions, Agshin!

most columnar DBMSs do not use indexing, since they are focused on OLAP loads where it's common to scan massive amounts of data anyways. At the same time, it's totally possible to use either sort or hash based indexing scheme depending on data cardinality and specific use-case. I actually plan to write about hash-based indexing in one of the future posts.
the only advise I have is building hard-to-misuse APIs. For the most part it's possible to hide the difference completely.
they are performed in a similar way to row-based DBMSs. Why would the criteria be any different? :)

db.csail.mit.edu/projects/cstore/abadi-sigmod08.p… has more details on indexing and join strategies used in popular solutions.

Search Hashnode

Column-oriented processing.

Responses(1)