For starters, I am pretty tired of the following sentiments from some PostgreSQL people:
99% of the people who choose varchar(x) over text in PostgreSQL in most cases
are just ignorant folk and don't realize that text is just as fast if not faster than varchar in PostgreSQL.
stuff your most despised database here compatibility is not high on my priority list.
It is unfortunate you have to work with the crappy tools you work with that can't see the beauty in PostgreSQL text implementation.
Just get something better that treats PostgreSQL as the superior creature it is.
I don't like these sentiments because, it gives the perception of PostgreSQL users as a bunch of prissy
people who expect everyone to step up to their level of misguided sophistication.
We shouldn't cater to the lowest common denominator always, but when the lowest common denominator
approach is easier to write, easier to read, and achieves the same purpose, why should people be forced to do it any other way.
Domains and triggers are great things, but they are overkill in most cases and are harder to inspect.
I think depesz nails it on the head when he points out the main reason why varchar(x) is worse than text in PostgreSQL.
It is because if you have a big table, changing the size of a varchar field takes eons in PostgreSQL. This is not so much the case with other databases we have worked with.
To me this is not a problem with everyone else, but a problem with PostgreSQL, and something that should be improved on. In fact you would think this would be simpler in
PostgreSQL than any other database since varchar and text are implemented essentially the same. The main difference it seems is that a text field always results in creation of an accompanying toast table
where as varchar doesn't always.
Instead newcomers are patted on the head and told
You young'in, just use text and throw a domain around it if you want to limit size.
I am not going to write a domain or trigger everytime I have to limit the size of a field, for several reasons
and listed in priority.
Easily to access and self-enforcing meta data is important. In many cases, its more important than performance.
When I look at a CREATE TABLE statement or open up a table in a designer, the size there gives a sense of what goes in there. One can
query practically any databases system tables (and the ANSI compliant ones have information_schema.columns) and get the size of a field
with a single query. This simplicity in portability is a beautiful thing. This simplicity is important for people developing cross database tools.
gives a sense of what is safe to select and what is not. The size gives auto form generators a sense of the width to make a field and limits to set
on what can be put in it. It allows the tool to quickly determine
whether to put in a regular text box or a text area.
varchar(...) forces everyone to state this exactly the same across all databases. I don't need to figure
out how to make sense of the logic embedded in a domain object or trigger or the fact that people can state the same thing differently with domains.
If you don't care about the ease with which tool developers can extend their tools to work with your DBMS or making
the common place needs easy, then you don't
care about growing your community.
Using domains when you are talking about 20 fields is way too much typing if all the fields need to be different sizes.
varchar(x) has a semantic meaning that is easy to define in 2 characters or less. Domains do not.
And to David -- "In the real world things have min sizes as well as max sizes"
Yes, but maxlength in most cases is the most important thing. So important that it dwarfs the relevancy of other minutia.
Most cases we don't bother with all the other nuances because it clutters the most important thing MAX LENGTH. Less is better than more.
I don't want people stuffing
the kitch sink in a first name field. Many people like me have t100s of first names they got in a naming ceremony.
If they want to call themselves ! or @ or some 4 letter word, I really don't care,
but if you are known by a 100 different names, I really only want to know about one or two of them, not all of them :)
I am a lazy control freak who loves self-documenting structures. I like order. I like things done consistently and I don't expect to work hard to get that. Being
a lazy control freak means I have more time to trick others and machines into doing my work for me so I can lounge around sipping long island iced teas :).
So in short I expect my relational database, not only to provide me good performance, but to provide me structure and ease of use as well.
I like my crappy tools. My tools
do what they are designed to do and I expect PostgreSQL to do the same without me having to jump thru hoops.
I want easy walls I can put up and tear down. I choose my foreign key constraints and my field sizes carefully
-- because they do more than just perform well. They are self-documenting objects that can draw themselves on a screen and are always
consistent with the simple rules I set up.
If I set a cascade update/delete, it means I expect the child data to go away when the parent does. If I set a foreign
key with restrict delete, it means if the child has data, then the parent is way too important to be allowed to be deleted lightly.
If I use a varchar(x) -- it means I want a max of x characters in that field.
If I use a char(n), it means I expect the data to be exactly (n) characters long - which admittedly is rare.
Very valid points Leo, thanks for taking the time to write this. I came to PostgreSQL from MySQL - The "switch" was daunting enough, it took several months for me to realize how to start doing things the Postgres way. If the features weren't somewhat similar, and the data-types somewhat transferable the switch would have been much, much more difficult.
I will agree that it *should* be faster to alter the length of a varchar, near instant in fact. We should fix that.
I've spent too much time having to futz with field sizes because of changing requirements or bad guesses as to the maximum size, especially when it often doesn't really matter as far as the db is concerned.
Therefore I now much prefer to simply use text and not constrain the length at all in the schema. I instead enforce that at the application layer.
I can already hear the "but, but...", but max length is just not something I worry about. Anyone who would want to enter 100 names into my name field will not be having direct access to my db. Use the nice app I've written, thank you.
Aside from that I am constraint crazy. But max length just changes too much, and seems pointless if you're going to sidestep that by setting it way too high anyway.
Do it at the app layer and be done with it. And screw all that work in making domains and such. In the 4 places it's that important, use a CHECK().
On a side note, I got a nice giggle out of the "tired of the following sentiments..." bit, as it definitely sounds like many PG people I know ;-)
Jon Erdman (aka StuckMojo)
That would be all well and good in some cases, except that in many of our cases, we have the same database feeding 4 or 5 applications in different locations.
Its easier to centralize that in the database for us rather than coding it 5 times. If you have to feed to government systems with their silly regulations of size, its nice to know that your structure already meets the criteria and you don't need to check it when you pass it off.
First of all, a bit offtopic - could you s/hubert/depesz/ig ?
Second - I absolutely agree that it should be fixed in Pg. I actually feel ashamed when I have to explain this to someone - especially someone who knows that the binary format is the same.
As for tools - I don't have this particular itch, as my tool of choice is psql, but I do feel your pain, and I think it's a valid point when considering your datatypes and schema.
And lastly - all my advices, and posts are generally targeted at new users of Pg - experiences dbas, know that there is no rule without exceptions. And I still believe that newbies should stick with text - as long as it will take our fine developers to fix size altering - just to avoid the problem when they will need to make this space for names a little bit longer.
Fixed well for this article anyway :). I don't agree newbies are better with text particularly those new to databases, because it teaches new PostgreSQL users to focus on implementation details of a product. For me, if I'm using PostgreSQL as a tool in teaching relational databases, which I really hope a lot of university's will do, because it is a great example of a good relational database with sound standard ANSI support, I don't want them to focus on teaching PostgreSQL specific work-arounds for problems. When the fine PostgreSQL developers improve on it, that lesson will be obsolete anyway.
I think its one of those things like you said you have to weigh, and in most cases, its rare I have to increase the size of a field, so it only irks me when that rare moment happens. But I think as you said putting it out there as a caution is good.
I think the real problem stems from a conflict between relational theory and the sql standard.
Your claim is that the max size of a field is the most important characteristic, but I don't believe that is true, it only seems that way. People have long needed to add max boundary limits based on performance / storage concerns, as an implementation detail of the underlying systems.
This means that the tools around max length have become very pervasive and easy to use; so even "crappy tools" and "lowest common denominator systems" have pretty standard meta-data oriented support for max length. This makes it easy to declare this piece of metadata, and so the cycle continues.
But's lets pretend that max size was the most important characteristic. Even then, what is the right size for a first_name field? 10 characters? 20? 100? It's pretty arbitrary really, and setting it to _anything_ makes the application worse (by introducing arbitrary limits). You really do want an unbounded type, but the sql standard doesn't give you varchar() or varchar(max). If you want to add varchar, you are stuck making it "hopefully long enough". Poor data modeling indeed.
So, I'd love to see varchar() adopted by sql standard, and then have all of the tools support it to mean unbounded varchar data; but until that happens varchar is no more correct than text.
Many of us don't care about being absolutely correct, but being fuzzy correct.
By that I mean if you dragged a 1000 people in a room and asked them what should be the max length of a first name,
after about 150 - no one would be raising their hands. So in short somewhere between 20 and 100 is your optimal fuzzy correct answer.
What is the max length of a tweet. Well it has a maxlength right, its not infinite. The max length seems somewhat arbitrary but the fact it exists
and is under n characters has benefits. It forces people to just state the most critical parts, it prevents information overload, and from a
storage/band width perspective, it means you can easily estimate the size of a page with 1000 tweets.
That brings up the performance benefit of limiting text. I can better predict the bandwidth performance of my queries
which is particularly important for web paging or if you have to transfer data to a third party source.
It also means unsuitably large text is not going to mess up the look of my page by taking up two lines instead of 1.
The problem is that while max-length is a sometimes-useful restriction on a field, it is not sufficient. For example, you say:
"It also means unsuitably large text is not going to mess up the look of my page by taking up two lines instead of 1."
But, it actually does not mean anything of the sort. If you've set varchar(15), I can easily put 5 linebreak characters ... or 4 ... in there, messing up your formatting. So max-length doesn't really help you *at all* there. You need a regex mask.
What VARCHAR(#) does is give the database designer the *illusion* of control over input without actually giving him/her any real control.
That part of the standard, like numeric primary keys, dates to the storage limitations of the databases of the 1980's, when counting characters was very important since you only had 16MB of disk space. It's quite outdated now.
All that being said, the awkward handling of subscript data types (like VARCHAR(#)) in ALTER TABLE in PostgreSQL is and has been a TODO. And it would be really good to fix it, it just requires a substantial refactoring of some very messy code.
Its good to hear that this is on the TODO. Getting back to your point about *illusions*.
First the issue you describe about people stuffing new lines in a size constrained input field rarely happens for 2 reasons. 1) The input fields don't allow new lines unless they are textarea.
2) Even when you html escape text (or at least my crappy special html escape function doesn't), it keeps new line as is which promptly gets ignore by the html standard
unless you PRE them.
As a side benefit -- Those people trying to post their own data with their own forms
will be trapped by the database when they try to stuff in a 1000 character name in a field that only accepts 20 characters. So to me -- its not an illusion, its
an extra security blanket. Its not a perfect blanket, but its a blanket.
On a side note, I can't tell you the countless number of times I have been saved when someone gives a me a data feed and they happened to
innocently switch the order of the first name field and a long unconstrained description text field. The import throws an error and sends me an email. I don't need to put in any extra filters and all that
to get this benefit. I don't have to wonder why suddenly my page is randomly slower because 100 of my records have big chunks of text in the first name field that I had assumed to be short
and safe to display in a list.
I will reiterate, I am not looking for absolute control of input, but what I consider good enough control without lifting too many fingers.
Unconstrained text just doesn't do that for me and it doesn't do it for a lot of people I have talked to. You can argue about limitations
of the 1980s, but some limitations are good. Storage -- particularly bandwidth is still a concern. Its still a concern on my crappy phone. Its still concern for my
limited mind who doesn't want to decipher the meaning of a 1000 character first name field :).
A relational database is a 40 year old limiting technology, but I still choose it in most cases over less limiting
NoSQL databases. I want limits even if you consider they are self-imposed ones.
If you want to use TEXT to capture a field that can never contain anything more than a 4 character alpha code, then likewise you should have no problem using NUMERIC to capture anything that contained nothing but digits. You realize, this becomes a regressive pattern: why not just use text for each field? Even primary keys using BIGSERIALs can be modified to use DEFAULT NEXTVAL('silly_id_seq')::TEXT!
Or you could just create a single field of TEXT and dump your content in as XML or JSON structures...
Ah heck, lets just dump it all in a flat file and write our own data parser and search tools....
No - not for me.
Arguments like "I underestimated the width" are not valid. The speak more to the lack of domain knowledge than to the limitation of the database. Having to put up with waiting for a column resize to complete is a natural consequence of not having done the proper research ahead of time.
You will be a better engineer having learned the lesson.
I agree with all you are saying. In fact its rare that I have to change the column size and when I do, its because the business meaning has changed. Like if a client code field needs to be increased, its because we bought out another company or something and need to prefix a division in there.
What irritates me most is not the slowness of the resizing, but this attitude that some people have that just because some of us see the value of limiting sizes of fields, some time machine from the cave ages must have accidentally dropped us off in their century and we need to be redumacated :)