This article is about collation in library, information, and computer science. Collation is the assembly of written information into a standard order. Many systems of collation are based on numerical order or alphabetical order, or alphabet letters numbered and combinations thereof.
Collation differs from classification in that classification is concerned with arranging information into logical categories, while collation is concerned with the ordering of items of information, usually based on the form of their identifiers. A collation algorithm such as the Unicode collation algorithm defines an order through the process of comparing two given character strings and deciding which should come before the other. The main advantage of collation is that it makes it fast and easy for a user to find an element in the list, or to confirm that it is absent from the list. Strings representing numbers may be sorted based on the values of the numbers that they represent. A similar approach may be taken with strings representing dates or other items that can be ordered chronologically or in some other natural fashion. Alphabetical order is the basis for many systems of collation where items of information are identified by strings consisting principally of letters from an alphabet. The ordering of the strings relies on the existence of a standard ordering for the letters of the alphabet in question.
To decide which of two strings comes first in alphabetical order, initially their first letters are compared. The string whose first letter appears earlier in the alphabet comes first in alphabetical order. If the first letters are the same, then the second letters are compared, and so on, until the order is decided. Capital letters are typically treated as equivalent to their corresponding lowercase letters. For alternative treatments in computerized systems, see Automated collation, below. When strings contain spaces or other word dividers, the decision must be taken whether to ignore these dividers or to treat them as symbols preceding all other letters of the alphabet. Abbreviations may be treated as if they were spelt out in full.
There is also a traditional convention in English that surnames beginning Mc and M’ are listed as if those prefixes were written Mac. Strings that represent personal names will often be listed by alphabetical order of surname, even if the given name comes first. For example, Juan Hernandes and Brian O’Leary should be sorted as “Hernandes, Juan” and “O’Leary, Brian” even if they are not written this way. Very common initial words, such as The in English, are often ignored for sorting purposes. So The Shining would be sorted as just “Shining” or “Shining, The”.
Sometimes such characters are treated as if they came before or after all the letters of the alphabet. Languages have different conventions for treating modified letters and certain letter combinations. In several languages the rules have changed over time, and so older dictionaries may use a different order than modern ones. Furthermore, collation may depend on use.
For example, German dictionaries and telephone directories use different approaches. The radical-and-stroke system is cumbersome compared to an alphabetical system in which there are a few characters, all unambiguous. The choice of which components of a logograph comprise separate radicals and which radical is primary is not clear-cut. As a result, logographic languages often supplement radical-and-stroke ordering with alphabetic sorting of a phonetic conversion of the logographs.
In addition, in Greater China, surname stroke ordering is a convention in some official documents where people’s names are listed without hierarchy. The radical-and-stroke system, or some similar pattern-matching and stroke-counting method, was traditionally the only practical method for constructing dictionaries that someone could use to look up a logograph whose pronunciation was unknown. With the advent of computers, dictionary programs are now available that allow one to handwrite a character using a mouse or stylus. When information is stored in digital systems, collation may become an automated process. It is then necessary to implement an appropriate collation algorithm that allows the information to be sorted in a satisfactory manner for the application in question. Often the aim will be to achieve an alphabetical or numerical ordering that follows the standard criteria as described in the preceding sections.