Thursday, March 04. 2010
This is a rebuttal to depesz's charx, varcharx, varchar, and text and David Fetter's varchar(n) considered harmful. I respect both depesz and David and in fact enjoy reading their blogs. We just have deferring opinions on the topic.
For starters, I am pretty tired of the following sentiments from some PostgreSQL people:
I don't like these sentiments because, it gives the perception of PostgreSQL users as a bunch of prissy people who expect everyone to step up to their level of misguided sophistication. We shouldn't cater to the lowest common denominator always, but when the lowest common denominator approach is easier to write, easier to read, and achieves the same purpose, why should people be forced to do it any other way. Domains and triggers are great things, but they are overkill in most cases and are harder to inspect.
I think depesz nails it on the head when he points out the main reason why varchar(x) is worse than text in PostgreSQL. It is because if you have a big table, changing the size of a varchar field takes eons in PostgreSQL. This is not so much the case with other databases we have worked with. To me this is not a problem with everyone else, but a problem with PostgreSQL, and something that should be improved on. In fact you would think this would be simpler in PostgreSQL than any other database since varchar and text are implemented essentially the same. The main difference it seems is that a text field always results in creation of an accompanying toast table where as varchar doesn't always.
Instead newcomers are patted on the head and told You young'in, just use text and throw a domain around it if you want to limit size.
I am not going to write a domain or trigger everytime I have to limit the size of a field, for several reasons and listed in priority.
So in short I expect my relational database, not only to provide me good performance, but to provide me structure and ease of use as well. I like my crappy tools. My tools do what they are designed to do and I expect PostgreSQL to do the same without me having to jump thru hoops. I want easy walls I can put up and tear down. I choose my foreign key constraints and my field sizes carefully -- because they do more than just perform well. They are self-documenting objects that can draw themselves on a screen and are always consistent with the simple rules I set up. If I set a cascade update/delete, it means I expect the child data to go away when the parent does. If I set a foreign key with restrict delete, it means if the child has data, then the parent is way too important to be allowed to be deleted lightly. If I use a varchar(x) -- it means I want a max of x characters in that field. If I use a char(n), it means I expect the data to be exactly (n) characters long - which admittedly is rare.
Display comments as (Linear | Threaded)
Very valid points Leo, thanks for taking the time to write this. I came to PostgreSQL from MySQL - The "switch" was daunting enough, it took several months for me to realize how to start doing things the Postgres way. If the features weren't somewhat similar, and the data-types somewhat transferable the switch would have been much, much more difficult.
I will agree that it *should* be faster to alter the length of a varchar, near instant in fact. We should fix that.
I've spent too much time having to futz with field sizes because of changing requirements or bad guesses as to the maximum size, especially when it often doesn't really matter as far as the db is concerned.
Therefore I now much prefer to simply use text and not constrain the length at all in the schema. I instead enforce that at the application layer.
I can already hear the "but, but...", but max length is just not something I worry about. Anyone who would want to enter 100 names into my name field will not be having direct access to my db. Use the nice app I've written, thank you.
Aside from that I am constraint crazy. But max length just changes too much, and seems pointless if you're going to sidestep that by setting it way too high anyway.
Do it at the app layer and be done with it. And screw all that work in making domains and such. In the 4 places it's that important, use a CHECK().
On a side note, I got a nice giggle out of the "tired of the following sentiments..." bit, as it definitely sounds like many PG people I know ;-)
That would be all well and good in some cases, except that in many of our cases, we have the same database feeding 4 or 5 applications in different locations.
Its easier to centralize that in the database for us rather than coding it 5 times. If you have to feed to government systems with their silly regulations of size, its nice to know that your structure already meets the criteria and you don't need to check it when you pass it off.
First of all, a bit offtopic - could you s/hubert/depesz/ig ?
Second - I absolutely agree that it should be fixed in Pg. I actually feel ashamed when I have to explain this to someone - especially someone who knows that the binary format is the same.
As for tools - I don't have this particular itch, as my tool of choice is psql, but I do feel your pain, and I think it's a valid point when considering your datatypes and schema.
And lastly - all my advices, and posts are generally targeted at new users of Pg - experiences dbas, know that there is no rule without exceptions. And I still believe that newbies should stick with text - as long as it will take our fine developers to fix size altering - just to avoid the problem when they will need to make this space for names a little bit longer.
Fixed well for this article anyway :). I don't agree newbies are better with text particularly those new to databases, because it teaches new PostgreSQL users to focus on implementation details of a product. For me, if I'm using PostgreSQL as a tool in teaching relational databases, which I really hope a lot of university's will do, because it is a great example of a good relational database with sound standard ANSI support, I don't want them to focus on teaching PostgreSQL specific work-arounds for problems. When the fine PostgreSQL developers improve on it, that lesson will be obsolete anyway.
I think its one of those things like you said you have to weigh, and in most cases, its rare I have to increase the size of a field, so it only irks me when that rare moment happens. But I think as you said putting it out there as a caution is good.
I think the real problem stems from a conflict between relational theory and the sql standard.
Your claim is that the max size of a field is the most important characteristic, but I don't believe that is true, it only seems that way. People have long needed to add max boundary limits based on performance / storage concerns, as an implementation detail of the underlying systems.
This means that the tools around max length have become very pervasive and easy to use; so even "crappy tools" and "lowest common denominator systems" have pretty standard meta-data oriented support for max length. This makes it easy to declare this piece of metadata, and so the cycle continues.
But's lets pretend that max size was the most important characteristic. Even then, what is the right size for a first_name field? 10 characters? 20? 100? It's pretty arbitrary really, and setting it to _anything_ makes the application worse (by introducing arbitrary limits). You really do want an unbounded type, but the sql standard doesn't give you varchar() or varchar(max). If you want to add varchar, you are stuck making it "hopefully long enough". Poor data modeling indeed.
So, I'd love to see varchar() adopted by sql standard, and then have all of the tools support it to mean unbounded varchar data; but until that happens varchar is no more correct than text.
Many of us don't care about being absolutely correct, but being fuzzy correct.
By that I mean if you dragged a 1000 people in a room and asked them what should be the max length of a first name,
after about 150 - no one would be raising their hands. So in short somewhere between 20 and 100 is your optimal fuzzy correct answer.
What is the max length of a tweet. Well it has a maxlength right, its not infinite. The max length seems somewhat arbitrary but the fact it exists
and is under n characters has benefits. It forces people to just state the most critical parts, it prevents information overload, and from a
storage/band width perspective, it means you can easily estimate the size of a page with 1000 tweets.
That brings up the performance benefit of limiting text. I can better predict the bandwidth performance of my queries
which is particularly important for web paging or if you have to transfer data to a third party source.
It also means unsuitably large text is not going to mess up the look of my page by taking up two lines instead of 1.
The problem is that while max-length is a sometimes-useful restriction on a field, it is not sufficient. For example, you say:
"It also means unsuitably large text is not going to mess up the look of my page by taking up two lines instead of 1."
But, it actually does not mean anything of the sort. If you've set varchar(15), I can easily put 5 linebreak characters ... or 4 ... in there, messing up your formatting. So max-length doesn't really help you *at all* there. You need a regex mask.
What VARCHAR(#) does is give the database designer the *illusion* of control over input without actually giving him/her any real control.
That part of the standard, like numeric primary keys, dates to the storage limitations of the databases of the 1980's, when counting characters was very important since you only had 16MB of disk space. It's quite outdated now.
All that being said, the awkward handling of subscript data types (like VARCHAR(#)) in ALTER TABLE in PostgreSQL is and has been a TODO. And it would be really good to fix it, it just requires a substantial refactoring of some very messy code.
Its good to hear that this is on the TODO. Getting back to your point about *illusions*.
First the issue you describe about people stuffing new lines in a size constrained input field rarely happens for 2 reasons. 1) The input fields don't allow new lines unless they are textarea.
2) Even when you html escape text (or at least my crappy special html escape function doesn't), it keeps new line as is which promptly gets ignore by the html standard
unless you PRE them.
As a side benefit -- Those people trying to post their own data with their own forms
will be trapped by the database when they try to stuff in a 1000 character name in a field that only accepts 20 characters. So to me -- its not an illusion, its
an extra security blanket. Its not a perfect blanket, but its a blanket.
On a side note, I can't tell you the countless number of times I have been saved when someone gives a me a data feed and they happened to
innocently switch the order of the first name field and a long unconstrained description text field. The import throws an error and sends me an email. I don't need to put in any extra filters and all that
to get this benefit. I don't have to wonder why suddenly my page is randomly slower because 100 of my records have big chunks of text in the first name field that I had assumed to be short
and safe to display in a list.
I will reiterate, I am not looking for absolute control of input, but what I consider good enough control without lifting too many fingers.
Unconstrained text just doesn't do that for me and it doesn't do it for a lot of people I have talked to. You can argue about limitations
of the 1980s, but some limitations are good. Storage -- particularly bandwidth is still a concern. Its still a concern on my crappy phone. Its still concern for my
limited mind who doesn't want to decipher the meaning of a 1000 character first name field :).
A relational database is a 40 year old limiting technology, but I still choose it in most cases over less limiting
NoSQL databases. I want limits even if you consider they are self-imposed ones.
I agree, schema should be meaningful.
If you want to use TEXT to capture a field that can never contain anything more than a 4 character alpha code, then likewise you should have no problem using NUMERIC to capture anything that contained nothing but digits. You realize, this becomes a regressive pattern: why not just use text for each field? Even primary keys using BIGSERIALs can be modified to use DEFAULT NEXTVAL('silly_id_seq')::TEXT!
Or you could just create a single field of TEXT and dump your content in as XML or JSON structures...
Ah heck, lets just dump it all in a flat file and write our own data parser and search tools....
No - not for me.
Arguments like "I underestimated the width" are not valid. The speak more to the lack of domain knowledge than to the limitation of the database. Having to put up with waiting for a column resize to complete is a natural consequence of not having done the proper research ahead of time.
You will be a better engineer having learned the lesson.
I agree with all you are saying. In fact its rare that I have to change the column size and when I do, its because the business meaning has changed. Like if a client code field needs to be increased, its because we bought out another company or something and need to prefix a division in there.
What irritates me most is not the slowness of the resizing, but this attitude that some people have that just because some of us see the value of limiting sizes of fields, some time machine from the cave ages must have accidentally dropped us off in their century and we need to be redumacated :)
This is exact one of my two reasons for which I use varchar(n) instead of text:
"The size gives auto form generators a sense of the width to make a field and limits to set on what can be put in it."
Another reason is, as already mentioned by other folks, the ODBC-like interface component I use converts varchar and text to different user interface properties.
Problems with the implementation.
1. char() not preserving space - yes docmented but catches people out who ue multiple vendors
2. expressions which come back as text and often in ODBC drivers as a long varchar which to a dynamic sql engine is essentially a lob. eg. select 'A' || C1 || 'Z' from T and so on
Some 'applicance' vendors who have postres in their 'roots' seem to have changed their engines to not carry this idiom forward but a few still have it.
Similar comments if you want to compare ANSI/ISO interval and timestamp issues re TZ which can catch folks out.
Syndicate This Blog
Show tagged entries
Remote RSS/OPML-Blogroll Feed
No RSS/OPML feed selected