Hi,
For those you have experience with databases for big data (RDBMS), I have a question what is on the back of my head for a long time.
If you have a table called user with a status field, do you prefer for it to be (or the best in terms of performance) to be defined:
`status` TINYINT(1) UNSIGNED NOT NULL DEFAULT 0 COMMENT '0=blocked,1=active,2=dead'
or:
status VARCHAR(20) NOT NULL DEFAULT 'blocked' COMMENT 'blocked,active,dead',
Of course you can still have a relation to another table where you have those status defined. But in this case is not important.
Personally I've been always a big fan of the first one.
Between these two what is better in terms of performance? The db that I'm using is MySQL 5.7, but this is more like a generic RDBMS question.
Or actually use ENUM() ?
Thanks
I like the idea of
ENUMs because they're explicit, whereas depending on a comment is pretty unfair to the user. Internally,ENUMis stored as 1 or 2 bytes, depending on the number of possible values (TINYINTis 1 byte). So lookups are going to be only slightly slower thanTINYINT(an additional lookup to the information schema to determine the mapping to the right byte value to look for).However, there are lots of caveats about
ENUMimplementations, and those often push people to other arrangements: In MySQL,ALTER TABLEon anENUMcan be expensive: the table locks, everything might get re-keyed... kinda bad. In PostgresSQL, you just can't alter anENUM. You create a new column with the new set of options, copy values over, drop original column. Goodness help you if you've got indexes involved. (Although I hear that's changing? Or maybe it already has? Turns out as of 9.2 you can modify 'em! So probably about the same penalty as MySQL.)So a lot of it will come down to your use case. If you're pretty confident the value set is complete or if you have the flexibility to manage an
ALTER TABLEand its implications, considerENUMfor clarity.But the third option, which you kindof mention, is the best: create a values table. Something like
user_status_typeswith key column and label column (labels "blocked", "active", etc), and make it a foreign key reference of thestatuscolumn in whatever other table(s). You can make the key aTINYINT, so space is still efficient. You still have a table lookup similar to theENUMinformation schema lookup, but you have the flexibility to mess with the table without penalty, and it's still inherently clear to the user.(
VARCHARis ... an option. Document databases enforce the schema at the application layer, as you'd be doing withVARCHAR, so it's a thing I guess, just not very RDMBS-like. The above blows that out of the water for performance, memory, clarity in the data layer, etc)