Database Character Set and Collation

If your Craft project uses the default MySQL database driver (or MariaDB), it’s worth taking a moment to consider how the selected character set (charset) and collation will affect things like sorting or searching.

Definitions #

  • The character set is the set of symbols and encodings (an “alphabet”) used by the database to store text.
  • The collation is the set of rules used to compare and order those characters when querying.

Craft 5.x defaults to the utf8mb4 character set and utf8mb4_0900_ai_ci collation, when using MySQL. This is also the default combination for MySQL 8.x; it’s language-nonspecific, accent-insensitive (ai), case-insensitive (ci), and is based on Unicode Collation Algorithm version 9.0.0 weight keys (0900).

Languages #

If your project has a primary language with accents or other character sequences that require special handling in element queries (as constraints, search keywords, or when sorting), you may want to use the corresponding language-specific collation. Keep in mind that all of your content will use a single charset and collation.

Craft provides the db/convert-charset command to simplify the process of switching character sets and collations. You can either…

  • …run the command with the new charset and collation…
    craft db/convert-charset [charset] [collation]
    

    …and then add the same values to your .env under CRAFT_DB_CHARSET and CRAFT_DB_COLLATION.

  • …set the CRAFT_DB_CHARSET and CRAFT_DB_COLLATION environment variables, then run db/convert-charset without any arguments (accepting the suggested values when prompted) to apply those settings.

If this command sounds familiar, you may have run it during the Craft 5.x upgrade!

Collations and Querying #

Suppose you have two entries, one with the title of Jagerstrasse and the other with the title of Jägerstraße. Let’s compare the following query for an entry by slug, across two collations:

{% set entries = craft.entries()
  .section('locations')
  .title('Jagerstrasse')
  .all() %}

{% for entry in entries %}
    <p>{{ entry.title }}: {{ entry.id }}</p>
{% endfor %}

With the default charset and collation, both entries are returned, as matching is accent-insensitive. ß and ss are treated as equivalent.

However, if you change your collation to utf8mb4_german2_ci and run the query again, only the entry with the title of Jagerstrasse will be returned, as the accent (or lack thereof) is significant. Again, ß and ss are considered equivalent.

Slugs and URIs #

This can also affect how Craft matches element URIs. For example, if the URI of an entry is /jägerstraße but a client tries to access /jagerstrasse (without the accented a), the database’s collation determines whether the entry is matched.

The same is true when Craft validates the uniqueness of a slug or URI: when looking for collisions with existing elements, comparisons are done in the database, meaning some collations might allow two similar slugs (like jägerstraße and jagerstrasse) to coexist while others won’t.

Collation can also affect less obvious characters, like whether a hyphen (-) is ordered before or after open parenthesis (().

Mixing character sets and collations across tables is strongly discouraged, and is a frequent cause of performance issues, especially when selections or constraints involve indexes.

Related Features #

The limitAutoSlugsToAscii setting is off by default. Authors may always manually define a slug with characters outside this space, but some rules and normalizations are always applied:

  • The following characters are not allowed: '"‘’“”ʻ[](){}:
  • Slugs are down-cased unless the allowUppercaseInSlug setting is enabled
  • Whitespace and empty strings are collapsed into a single slugWordSeparator

Handles (typically used as human-readable identifiers for system components like sites and entry types) have an even narrower set of allowed characters, and are validated with this expression: [a-zA-Z][a-zA-Z0-9_].

Queries against custom field values (stored in Craft’s JSON content column) are case-sensitive, by default (regardless of your database’s collation). You can force queries against some fields to be case-insensitive by passing a hash:

{% set location = craft.entries()
  .section('locations')
  .streetName({
    value: 'jagerstrasse',
    caseInsensitive: true,
  })
  .all() %}

Further Reading #