komlenic.com

making noise since 1977

8 Reasons Why MySQL's ENUM Data Type Is Evil

« | Wed March 2, 2011 | comments and reactions | permanent link | »

MySQL's ENUM data-type is a hot spot that sometimes generates strong opinions among developers. At first glance it seems like an efficient solution for declaring a set of permitted values from which only one can be chosen for each record. A good example is a having a table of countries with a 'continent' column: every country belongs to a continent and the seven continents aren't likely to change anytime soon. Yes, one day North America may collide with Asia to form Noramersia, but if your database somehow survives that long at least you won't be around for the meeting on how to restructure your tables. That will be some new guy's problem.

Now, If using ENUM were our only option for representing what continent a country belonged to, we could all move on to debating the merits of NoSQL or whether Git was better than SVN, or why your favorite framework sucks. But there is a superior option to ENUM in most cases: the lookup/joined/reference table. (We'll call them reference tables.) It's pretty simple:

Comparison of ENUM vs Reference Table

Wikipedia describes a reference table as:

...a table into which an enumerated set of possible values of a certain field data type is divested. For example, in a relational database model of a warehouse the entity 'Item' may have a field called 'status' with a predefined set of values such as 'sold', 'reserved', 'out of stock'. In a purely designed database these values would be divested into an extra entity or Reference Table called 'status' in order to achieve database normalization.

So, with reference tables available as an option to represent an enumeration, let's look at why the ENUM data-type is evil:

1. Data isn't being treated like data.

Male/Female, Mr/Mrs/Ms, Africa/Asia/etc: these bits of text that people use ENUM columns for are data. When you use an ENUM column, you're technically moving data from where it belongs (in actual database fields), to somewhere it doesn't (into the database metadata, specifically a column definition). This is different than putting constraints on the data, which is what we are doing when we say that a numeric column can only hold an integer, or that a date column can't be null - that's fine and quite necessary. With an ENUM we're actually storing pieces of data in a place that was only intended to hold crucial information about the model. In short, an ENUM column violates the rules of normalization. This may seem academic or pedantic, but it is actually why a lot of the other reasons on this list can be problems!

2. Changing the member list of ENUM columns is very expensive.

Invariably, what happens is this: you create an ENUM column and say "no way NEVER will this list change or need added to". But humans are really poor at estimating the entire scope of something, and even worse at predicting the future. R&D dreams up a whole new product type. Your company adds another shipping method. North America crashes into Asia.

The problem is that changing the member list for an ENUM column restructures the entire table with ALTER TABLE, which can be very expensive on resources and time. If you have ENUM('red', 'blue', 'black') but need to change it to ENUM('red', 'blue', 'white'), MySQL needs to rebuild your table and look through every record to check for the now-invalid value 'black'. MySQL is literally dumb and will even do this when all you did was add a new value to the end of the member list! (It is rumored that appending an ENUM member list will be handled better in the future, but I doubt that this is a high priority feature.)

A full-table rebuild may not cause much pain on a small table, but on a large one it is possible to peg your resources for a long time. If you use a reference table instead of ENUM, changing the list is as simple as INSERT, UPDATE, or DELETE, laughably-cheap operations by comparison. It's also important to note that when altering an ENUM member list, MySQL converts any existing record values that are not included in the new ENUM definition to ' ' (an empty string). With a reference table, you have greater flexibility when renaming or eliminating list choices (more on this below).

3. It's impossible to add additional attributes or related info.

Adding related info to a reference tableThere just isn't any sane way to add related information to an ENUM column, which is a common scenario that often comes up. In our country/continent example, what happens when we need to store something like land area for each continent? We didn't envision needing that attribute, but now we do. With a reference table, we can simply extend the continent table to include a 'land_area' column and query this new data any way we would like. With an ENUM? Forget it.

Other awesome flexibilities exist due to the ability to easily extend a reference table. One common scenario is adding a column to set a flag to denote whether a choice in the reference table is discontinued. So, when your company stops selling black widgets, you can add an 'is_discontinued' column to the reference table and flag the old 'black' row. You can still query a list of currently offered colors, and maintain info about all your old orders of black widgets! Try that with an ENUM column.

4. Getting a list of distinct ENUM members is a pain.

A very common need is to populate a select-box or drop down list with possible values from the database. Like this:

Select color:

If these values are stored in a reference table named 'colors', all you need is: SELECT * FROM colors ...which can then be parsed out to dynamically generate the drop down list. You can add or change the colors in the reference table, and your sexy order forms will automatically be updated. Awesome.

Now consider the evil ENUM: how do you extract the member list? You could query the ENUM column in your table for DISTINCT values but that will only return values that are actually used and present in the table, not necessarily all possible values. You can query INFORMATION_SCHEMA and parse them out of the query result with a scripting language, but that's unnecessarily complicated. In fact, I don't know of any elegant, purely SQL way to extract the member list of an ENUM column.

5. ENUM columns may only offer limited or negligible effects on optimization.

The usual justifications for using ENUM, are centered around optimization, in the conventional sense of performance gains, and sometimes in the sense of simplifying a complicated model to be more comprehensible.

Let's look at performance. You can do a surprising number of un-optimized things with databases, but most won't affect performance until a certain scale is reached, and often our applications are never asked to scale up that far. This is important to remember because DB devs should aspire to design fully-normalized and only de-normalize when a performance problem becomes real. If you're concerned that a reference table is going to slow things down, benchmark it out both ways in your unique application on an actual dataset (or a realistic high-estimate fake dataset) and see. Just don't automatically assume a join or a reference table is going to be a bottleneck, because it probably isn't. (There is also evidence to support that ENUM isn't always appreciably faster than alternatives.)

The second optimization argument for ENUM is that it reduces the number of tables and foreign keys in your database. This is a valid argument, in the sense that it's one more little box joined to another box with some lines, and in large systems the effect of normalization can already tax the limits of human comprehension and complicate queries. This is however, why we make models, and why those models employ abstraction so we can understand them. Go ahead and draw up a new representation of your model or ER diagram that leaves out some of the little details and reference tables. Sometimes it may just seem easier to use an ENUM, but the fact that you think another reference table makes things too complicated isn't a good reason by itself.

6. You can't reuse the member-list of an ENUM column in other tables.

When you create a list of possible members in an ENUM column, there's no easy and consistent way to re-use that list in other tables. With a reference table, the same set of data can be related to as many other tables as required. Changing the list in the lone reference table, will change the available options in every other table that it is linked or joined to.

A reference table can easily be linked to multiple tables

With separate ENUM columns, you would have identical duplicate member lists on two different columns in at least two different tables (that would all require consistent updating).

7. ENUM columns have noteable gotchas.

Suppose you have ENUM('blue', 'black', 'red') and you attempt to insert 'purple': MySQL actually truncates the illegal value to  ' ' (an empty string).  This is correct, but if we had used a reference table with a foreign key, we would have more robust data integrity enforcement. 

Also, MySQL stores enum values internally as integer keys to reference ENUM members. It's easy to end up referencing the index instead of the value and vice-versa.  Consider:

CREATE TABLE test (foobar ENUM('0', '1', '2'));

mysql> INSERT INTO test VALUES ('1'), (1);
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM test;
+--------+
| foobar |
+--------+
| 1      |
| 0      |
+--------+
2 rows in set (0.00 sec)

We inserted '1' (a string), and accidentally also inserted 1 (as a number, without quotes).  MySQL correctly (but confusingly) uses our number input as an internal reference to the first item in the member list (which is actually the value '0').

8. ENUM has limited portability to other DBMS.

The ENUM data type isn't standard SQL, and beyond MySQL not many other DBMS's have native support for it. PostgreSQL, MariaDB, and Drizzle (the latter two are forks of MySQL anyway), are the only three that I know of. Should you or someone else want to move your database to another system, someone is going to have to add more steps to the migration procedure to deal with all of your clever ENUMs. If it's you, you'll undoubtedly feel less clever than you once did - and if it's someone else, they may not like you. Generally, migrating to a different database system is something that just doesn't happen that often and everybody assumes will bring out demons anyway, which is why this just squeaks in at number 8 on the list.

Criteria for when it might be okay to use enum:

1. When you're storing distinct, unchanging value sets...

A fairly good example that meets this criteria is our list of the continents. These are well-defined. Other commonly-given examples are salutations: Mr/Mrs/Ms, or playing card suits Spades/Hearts/Diamonds/Clubs. However, consider that even these examples have scenarios where you may need to extend the member list (such as when someone demands that you now need a 'Dr.' salutation, or when your card game app needs to accommodate a non-suited card like the Joker).

AND 2. You will never need to store additional related info...

Consider again Spades/Hearts/Diamonds/Clubs. There are popular card games that rely on the fact that clubs/spades are black and hearts/diamonds are red (Euchre, for example.) What happens when we need to store additional info related to suit, such as its color? If we had used a reference table, it would be a trivial matter to add this color data to the reference table in an additional column. If we use an ENUM to represent suit, it becomes much more difficult to represent the color/suit model accurately, and we're going to have to enforce it on the application level.

AND 3. The member list will contain more than 2 and less than 20 items.

If you're using an ENUM for only two values, you can always replace the ENUM with a very efficient TINYINT(1) or the even-better BIT(1) available since MySQL 5.0.3. For example: gender ENUM('male', 'female') can be changed to: is_male BIT(1). When you only have two choices, they can always be expressed as a Boolean true/false by prepending "is" to one of the member strings and renaming the column. As for less than 20: Yes, ENUM can store up to 65,535 values. No, you shouldn't try it. More than 20 becomes unwieldy and certainly more than 50 is just insane to manage and work with.

If you really still want to use ENUM:

1. Never use numbers as enum member values.

There's a reason ENUM is a string data type. Not only should you be using a numeric data type to store numbers, but ENUM has some well-documented gotchas related to the fact that  MySQL references ENUM members internally using a numerical index.  (See #7 above.)  Just don't ever store numbers in an ENUM data type, ok?

2. Consider using strict mode.

Strict mode will at least throw an error when you try to insert an invalid value into an ENUM column. Otherwise only a warning is thrown and the value is simply set to an empty string ' ' (referenced internally as 0). Note: Errors can still be suppressed in strict mode if you use IGNORE.

Conclusion

Do what makes sense from a development/maintenance perspective, and optimize only once a performance problem becomes real - in most cases that is a strong argument for using reference tables over MySQL's ENUM datatype.

There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.  - Donald Knuth

blog comments powered by Disqus