Database Character Set and Collation
If your Craft project uses the default MySQL database driver (or MariaDB), it’s worth taking a moment to consider how the selected character set (charset) and collation will affect things like sorting or searching.
Definitions #
- The character set is the set of symbols and encodings (an “alphabet”) used by the database to store text.
- The collation is the set of rules used to compare and order those characters when querying.
Craft 5.x defaults to the utf8mb4
character set and utf8mb4_0900_ai_ci
collation, when using MySQL. This is also the default combination for MySQL 8.x; it’s language-nonspecific, accent-insensitive (ai
), case-insensitive (ci
), and is based on Unicode Collation Algorithm version 9.0.0 weight keys (0900
).
Languages #
If your project has a primary language with accents or other character sequences that require special handling in element queries (as constraints, search keywords, or when sorting), you may want to use the corresponding language-specific collation. Keep in mind that all of your content will use a single charset and collation.
Craft provides the db/convert-charset
command to simplify the process of switching character sets and collations. You can either…
- …run the command with the new charset and collation…
craft db/convert-charset [charset] [collation]
…and then add the same values to your .env under
CRAFT_DB_CHARSET
andCRAFT_DB_COLLATION
. - …set the
CRAFT_DB_CHARSET
andCRAFT_DB_COLLATION
environment variables, then run db/convert-charset without any arguments (accepting the suggested values when prompted) to apply those settings.
If this command sounds familiar, you may have run it during the Craft 5.x upgrade!
Collations and Querying #
Suppose you have two entries, one with the title of Jagerstrasse
and the other with the title of Jägerstraße
. Let’s compare the following query for an entry by slug, across two collations:
{% set entries = craft.entries()
.section('locations')
.title('Jagerstrasse')
.all() %}
{% for entry in entries %}
<p>{{ entry.title }}: {{ entry.id }}</p>
{% endfor %}
With the default charset and collation, both entries are returned, as matching is accent-insensitive. ß
and ss
are treated as equivalent.
However, if you change your collation to utf8mb4_german2_ci
and run the query again, only the entry with the title of Jagerstrasse
will be returned, as the accent (or lack thereof) is significant. Again, ß
and ss
are considered equivalent.
Slugs and URIs #
This can also affect how Craft matches element URIs. For example, if the URI of an entry is /jägerstraße
but a client tries to access /jagerstrasse
(without the accented a
), the database’s collation determines whether the entry is matched.
The same is true when Craft validates the uniqueness of a slug or URI: when looking for collisions with existing elements, comparisons are done in the database, meaning some collations might allow two similar slugs (like jägerstraße
and jagerstrasse
) to coexist while others won’t.
Collation can also affect less obvious characters, like whether a hyphen (-
) is ordered before or after open parenthesis ((
).
Mixing character sets and collations across tables is strongly discouraged, and is a frequent cause of performance issues, especially when selections or constraints involve indexes.
Related Features #
The limitAutoSlugsToAscii
setting is off by default. Authors may always manually define a slug with characters outside this space, but some rules and normalizations are always applied:
- The following characters are not allowed:
'"‘’“”ʻ[](){}:
- Slugs are down-cased unless the
allowUppercaseInSlug
setting is enabled - Whitespace and empty strings are collapsed into a single
slugWordSeparator
Handles (typically used as human-readable identifiers for system components like sites and entry types) have an even narrower set of allowed characters, and are validated with this expression: [a-zA-Z][a-zA-Z0-9_]
.
Queries against custom field values (stored in Craft’s JSON content column) are case-sensitive, by default (regardless of your database’s collation). You can force queries against some fields to be case-insensitive by passing a hash:
{% set location = craft.entries()
.section('locations')
.streetName({
value: 'jagerstrasse',
caseInsensitive: true,
})
.all() %}
Further Reading #
- Language-specific collations (MySQL documentation)
- Everything you never wanted to know about MySQL Charsets & Collations (Vince Salvino, CodeRed)