Karaktersets & collation (MySQL)
Ik gebruik deze parameters
- Tekencodering:
utf8mb4
- andersutf8
- Collation:
utf8mb4_general_ci
- andersgeneral_ci
.
Inleiding
Karaktersets?
Verschillende systemen (Amazon, Bol.com, WordPress, Google Shopping, Drupal, etc.) gebruiken verschillende karaktersets, oftewel systemen om karakters te coderen in bits and bytes. Amazon gebruikt bij import van productgegevens op sommige plekken zelfs karaktersets door elkaar heen. MySQL-databases kan hier rekening mee houden. Het is dus belangrijk om thuis te zijn in karaktersets, als je precies bent, en niet van verrassingen houdt.
Collation?
Collation heeft betrekking op de sorteervolgorde van tekst in verschillende talen: In verschillende talen worden woorden enigszins verschillend gesorteerd. Dat heeft te maken met hele interessante zaken, die ik echter nooit precies uit elkaar kan houden: glyphs, ligaturen, diacritics, graphemes → digraphs & trigraphs.
Voor mij is collation een stuk minder relevant dan karaktersets, maar toch doe ik het graag goed.
MySQL behandelt karaktersets en collation vaak tezamen. Daarom ook in dit artikel.
Multi-level markeringen
In MySQL wordt de gebruikte karakterset op diverse plekken gespecificeerd:
- Op database-niveau
- Op tabelniveau
- Per kolom.
Het is zelfs mogelijk om in een tabel verschillende kolommen in verschillende tekencoderingen te hebben. Sterker nog: Dat overkomt mij voortdurend. En dan krijg je collation-gerelateerde foutmeldingen, bv. bij select-queries.
Beschikbare karaktersets
mysql> show character set; +----------+---------------------------------+---------------------+--------+ | Charset | Description | Default collation | Maxlen | +----------+---------------------------------+---------------------+--------+ | big5 | Big5 Traditional Chinese | big5_chinese_ci | 2 | | dec8 | DEC West European | dec8_swedish_ci | 1 | | cp850 | DOS West European | cp850_general_ci | 1 | | hp8 | HP West European | hp8_english_ci | 1 | | koi8r | KOI8-R Relcom Russian | koi8r_general_ci | 1 | | latin1 | cp1252 West European | latin1_swedish_ci | 1 | | latin2 | ISO 8859-2 Central European | latin2_general_ci | 1 | | swe7 | 7bit Swedish | swe7_swedish_ci | 1 | | ascii | US ASCII | ascii_general_ci | 1 | | ujis | EUC-JP Japanese | ujis_japanese_ci | 3 | | sjis | Shift-JIS Japanese | sjis_japanese_ci | 2 | | hebrew | ISO 8859-8 Hebrew | hebrew_general_ci | 1 | | tis620 | TIS620 Thai | tis620_thai_ci | 1 | | euckr | EUC-KR Korean | euckr_korean_ci | 2 | | koi8u | KOI8-U Ukrainian | koi8u_general_ci | 1 | | gb2312 | GB2312 Simplified Chinese | gb2312_chinese_ci | 2 | | greek | ISO 8859-7 Greek | greek_general_ci | 1 | | cp1250 | Windows Central European | cp1250_general_ci | 1 | | gbk | GBK Simplified Chinese | gbk_chinese_ci | 2 | | latin5 | ISO 8859-9 Turkish | latin5_turkish_ci | 1 | | armscii8 | ARMSCII-8 Armenian | armscii8_general_ci | 1 | | utf8 | UTF-8 Unicode | utf8_general_ci | 3 | | ucs2 | UCS-2 Unicode | ucs2_general_ci | 2 | | cp866 | DOS Russian | cp866_general_ci | 1 | | keybcs2 | DOS Kamenicky Czech-Slovak | keybcs2_general_ci | 1 | | macce | Mac Central European | macce_general_ci | 1 | | macroman | Mac West European | macroman_general_ci | 1 | | cp852 | DOS Central European | cp852_general_ci | 1 | | latin7 | ISO 8859-13 Baltic | latin7_general_ci | 1 | | utf8mb4 | UTF-8 Unicode | utf8mb4_general_ci | 4 | | cp1251 | Windows Cyrillic | cp1251_general_ci | 1 | | utf16 | UTF-16 Unicode | utf16_general_ci | 4 | | utf16le | UTF-16LE Unicode | utf16le_general_ci | 4 | | cp1256 | Windows Arabic | cp1256_general_ci | 1 | | cp1257 | Windows Baltic | cp1257_general_ci | 1 | | utf32 | UTF-32 Unicode | utf32_general_ci | 4 | | binary | Binary pseudo charset | binary | 1 | | geostd8 | GEOSTD8 Georgian | geostd8_general_ci | 1 | | cp932 | SJIS for Windows Japanese | cp932_japanese_ci | 2 | | eucjpms | UJIS for Windows Japanese | eucjpms_japanese_ci | 3 | | gb18030 | China National Standard GB18030 | gb18030_chinese_ci | 4 | +----------+---------------------------------+---------------------+--------+ 41 rows in set (0,00 sec)
Beschikbare collations
mysql> show collation; +--------------------------+----------+-----+---------+----------+---------+ | Collation | Charset | Id | Default | Compiled | Sortlen | +--------------------------+----------+-----+---------+----------+---------+ | big5_chinese_ci | big5 | 1 | Yes | Yes | 1 | | big5_bin | big5 | 84 | | Yes | 1 | | dec8_swedish_ci | dec8 | 3 | Yes | Yes | 1 | | dec8_bin | dec8 | 69 | | Yes | 1 | | cp850_general_ci | cp850 | 4 | Yes | Yes | 1 | | cp850_bin | cp850 | 80 | | Yes | 1 | | hp8_english_ci | hp8 | 6 | Yes | Yes | 1 | | hp8_bin | hp8 | 72 | | Yes | 1 | | koi8r_general_ci | koi8r | 7 | Yes | Yes | 1 | | koi8r_bin | koi8r | 74 | | Yes | 1 | | latin1_german1_ci | latin1 | 5 | | Yes | 1 | | latin1_swedish_ci | latin1 | 8 | Yes | Yes | 1 | | latin1_danish_ci | latin1 | 15 | | Yes | 1 | | latin1_german2_ci | latin1 | 31 | | Yes | 2 | | latin1_bin | latin1 | 47 | | Yes | 1 | | latin1_general_ci | latin1 | 48 | | Yes | 1 | | latin1_general_cs | latin1 | 49 | | Yes | 1 | | latin1_spanish_ci | latin1 | 94 | | Yes | 1 | | latin2_czech_cs | latin2 | 2 | | Yes | 4 | | latin2_general_ci | latin2 | 9 | Yes | Yes | 1 | | latin2_hungarian_ci | latin2 | 21 | | Yes | 1 | | latin2_croatian_ci | latin2 | 27 | | Yes | 1 | | latin2_bin | latin2 | 77 | | Yes | 1 | | swe7_swedish_ci | swe7 | 10 | Yes | Yes | 1 | | swe7_bin | swe7 | 82 | | Yes | 1 | | ascii_general_ci | ascii | 11 | Yes | Yes | 1 | | ascii_bin | ascii | 65 | | Yes | 1 | | ujis_japanese_ci | ujis | 12 | Yes | Yes | 1 | | ujis_bin | ujis | 91 | | Yes | 1 | | sjis_japanese_ci | sjis | 13 | Yes | Yes | 1 | | sjis_bin | sjis | 88 | | Yes | 1 | | hebrew_general_ci | hebrew | 16 | Yes | Yes | 1 | | hebrew_bin | hebrew | 71 | | Yes | 1 | | tis620_thai_ci | tis620 | 18 | Yes | Yes | 4 | | tis620_bin | tis620 | 89 | | Yes | 1 | | euckr_korean_ci | euckr | 19 | Yes | Yes | 1 | | euckr_bin | euckr | 85 | | Yes | 1 | | koi8u_general_ci | koi8u | 22 | Yes | Yes | 1 | | koi8u_bin | koi8u | 75 | | Yes | 1 | | gb2312_chinese_ci | gb2312 | 24 | Yes | Yes | 1 | | gb2312_bin | gb2312 | 86 | | Yes | 1 | | greek_general_ci | greek | 25 | Yes | Yes | 1 | | greek_bin | greek | 70 | | Yes | 1 | | cp1250_general_ci | cp1250 | 26 | Yes | Yes | 1 | | cp1250_czech_cs | cp1250 | 34 | | Yes | 2 | | cp1250_croatian_ci | cp1250 | 44 | | Yes | 1 | | cp1250_bin | cp1250 | 66 | | Yes | 1 | | cp1250_polish_ci | cp1250 | 99 | | Yes | 1 | | gbk_chinese_ci | gbk | 28 | Yes | Yes | 1 | | gbk_bin | gbk | 87 | | Yes | 1 | | latin5_turkish_ci | latin5 | 30 | Yes | Yes | 1 | | latin5_bin | latin5 | 78 | | Yes | 1 | | armscii8_general_ci | armscii8 | 32 | Yes | Yes | 1 | | armscii8_bin | armscii8 | 64 | | Yes | 1 | | utf8_general_ci | utf8 | 33 | Yes | Yes | 1 | | utf8_bin | utf8 | 83 | | Yes | 1 | | utf8_unicode_ci | utf8 | 192 | | Yes | 8 | | utf8_icelandic_ci | utf8 | 193 | | Yes | 8 | | utf8_latvian_ci | utf8 | 194 | | Yes | 8 | | utf8_romanian_ci | utf8 | 195 | | Yes | 8 | | utf8_slovenian_ci | utf8 | 196 | | Yes | 8 | | utf8_polish_ci | utf8 | 197 | | Yes | 8 | | utf8_estonian_ci | utf8 | 198 | | Yes | 8 | | utf8_spanish_ci | utf8 | 199 | | Yes | 8 | | utf8_swedish_ci | utf8 | 200 | | Yes | 8 | | utf8_turkish_ci | utf8 | 201 | | Yes | 8 | | utf8_czech_ci | utf8 | 202 | | Yes | 8 | | utf8_danish_ci | utf8 | 203 | | Yes | 8 | | utf8_lithuanian_ci | utf8 | 204 | | Yes | 8 | | utf8_slovak_ci | utf8 | 205 | | Yes | 8 | | utf8_spanish2_ci | utf8 | 206 | | Yes | 8 | | utf8_roman_ci | utf8 | 207 | | Yes | 8 | | utf8_persian_ci | utf8 | 208 | | Yes | 8 | | utf8_esperanto_ci | utf8 | 209 | | Yes | 8 | | utf8_hungarian_ci | utf8 | 210 | | Yes | 8 | | utf8_sinhala_ci | utf8 | 211 | | Yes | 8 | | utf8_german2_ci | utf8 | 212 | | Yes | 8 | | utf8_croatian_ci | utf8 | 213 | | Yes | 8 | | utf8_unicode_520_ci | utf8 | 214 | | Yes | 8 | | utf8_vietnamese_ci | utf8 | 215 | | Yes | 8 | | utf8_general_mysql500_ci | utf8 | 223 | | Yes | 1 | | ucs2_general_ci | ucs2 | 35 | Yes | Yes | 1 | | ucs2_bin | ucs2 | 90 | | Yes | 1 | | ucs2_unicode_ci | ucs2 | 128 | | Yes | 8 | | ucs2_icelandic_ci | ucs2 | 129 | | Yes | 8 | | ucs2_latvian_ci | ucs2 | 130 | | Yes | 8 | | ucs2_romanian_ci | ucs2 | 131 | | Yes | 8 | | ucs2_slovenian_ci | ucs2 | 132 | | Yes | 8 | | ucs2_polish_ci | ucs2 | 133 | | Yes | 8 | | ucs2_estonian_ci | ucs2 | 134 | | Yes | 8 | | ucs2_spanish_ci | ucs2 | 135 | | Yes | 8 | | ucs2_swedish_ci | ucs2 | 136 | | Yes | 8 | | ucs2_turkish_ci | ucs2 | 137 | | Yes | 8 | | ucs2_czech_ci | ucs2 | 138 | | Yes | 8 | | ucs2_danish_ci | ucs2 | 139 | | Yes | 8 | | ucs2_lithuanian_ci | ucs2 | 140 | | Yes | 8 | | ucs2_slovak_ci | ucs2 | 141 | | Yes | 8 | | ucs2_spanish2_ci | ucs2 | 142 | | Yes | 8 | | ucs2_roman_ci | ucs2 | 143 | | Yes | 8 | | ucs2_persian_ci | ucs2 | 144 | | Yes | 8 | | ucs2_esperanto_ci | ucs2 | 145 | | Yes | 8 | | ucs2_hungarian_ci | ucs2 | 146 | | Yes | 8 | | ucs2_sinhala_ci | ucs2 | 147 | | Yes | 8 | | ucs2_german2_ci | ucs2 | 148 | | Yes | 8 | | ucs2_croatian_ci | ucs2 | 149 | | Yes | 8 | | ucs2_unicode_520_ci | ucs2 | 150 | | Yes | 8 | | ucs2_vietnamese_ci | ucs2 | 151 | | Yes | 8 | | ucs2_general_mysql500_ci | ucs2 | 159 | | Yes | 1 | | cp866_general_ci | cp866 | 36 | Yes | Yes | 1 | | cp866_bin | cp866 | 68 | | Yes | 1 | | keybcs2_general_ci | keybcs2 | 37 | Yes | Yes | 1 | | keybcs2_bin | keybcs2 | 73 | | Yes | 1 | | macce_general_ci | macce | 38 | Yes | Yes | 1 | | macce_bin | macce | 43 | | Yes | 1 | | macroman_general_ci | macroman | 39 | Yes | Yes | 1 | | macroman_bin | macroman | 53 | | Yes | 1 | | cp852_general_ci | cp852 | 40 | Yes | Yes | 1 | | cp852_bin | cp852 | 81 | | Yes | 1 | | latin7_estonian_cs | latin7 | 20 | | Yes | 1 | | latin7_general_ci | latin7 | 41 | Yes | Yes | 1 | | latin7_general_cs | latin7 | 42 | | Yes | 1 | | latin7_bin | latin7 | 79 | | Yes | 1 | | utf8mb4_general_ci | utf8mb4 | 45 | Yes | Yes | 1 | | utf8mb4_bin | utf8mb4 | 46 | | Yes | 1 | | utf8mb4_unicode_ci | utf8mb4 | 224 | | Yes | 8 | | utf8mb4_icelandic_ci | utf8mb4 | 225 | | Yes | 8 | | utf8mb4_latvian_ci | utf8mb4 | 226 | | Yes | 8 | | utf8mb4_romanian_ci | utf8mb4 | 227 | | Yes | 8 | | utf8mb4_slovenian_ci | utf8mb4 | 228 | | Yes | 8 | | utf8mb4_polish_ci | utf8mb4 | 229 | | Yes | 8 | | utf8mb4_estonian_ci | utf8mb4 | 230 | | Yes | 8 | | utf8mb4_spanish_ci | utf8mb4 | 231 | | Yes | 8 | | utf8mb4_swedish_ci | utf8mb4 | 232 | | Yes | 8 | | utf8mb4_turkish_ci | utf8mb4 | 233 | | Yes | 8 | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | | utf8mb4_danish_ci | utf8mb4 | 235 | | Yes | 8 | | utf8mb4_lithuanian_ci | utf8mb4 | 236 | | Yes | 8 | | utf8mb4_slovak_ci | utf8mb4 | 237 | | Yes | 8 | | utf8mb4_spanish2_ci | utf8mb4 | 238 | | Yes | 8 | | utf8mb4_roman_ci | utf8mb4 | 239 | | Yes | 8 | | utf8mb4_persian_ci | utf8mb4 | 240 | | Yes | 8 | | utf8mb4_esperanto_ci | utf8mb4 | 241 | | Yes | 8 | | utf8mb4_hungarian_ci | utf8mb4 | 242 | | Yes | 8 | | utf8mb4_sinhala_ci | utf8mb4 | 243 | | Yes | 8 | | utf8mb4_german2_ci | utf8mb4 | 244 | | Yes | 8 | | utf8mb4_croatian_ci | utf8mb4 | 245 | | Yes | 8 | | utf8mb4_unicode_520_ci | utf8mb4 | 246 | | Yes | 8 | | utf8mb4_vietnamese_ci | utf8mb4 | 247 | | Yes | 8 | | cp1251_bulgarian_ci | cp1251 | 14 | | Yes | 1 | | cp1251_ukrainian_ci | cp1251 | 23 | | Yes | 1 | | cp1251_bin | cp1251 | 50 | | Yes | 1 | | cp1251_general_ci | cp1251 | 51 | Yes | Yes | 1 | | cp1251_general_cs | cp1251 | 52 | | Yes | 1 | | utf16_general_ci | utf16 | 54 | Yes | Yes | 1 | | utf16_bin | utf16 | 55 | | Yes | 1 | | utf16_unicode_ci | utf16 | 101 | | Yes | 8 | | utf16_icelandic_ci | utf16 | 102 | | Yes | 8 | | utf16_latvian_ci | utf16 | 103 | | Yes | 8 | | utf16_romanian_ci | utf16 | 104 | | Yes | 8 | | utf16_slovenian_ci | utf16 | 105 | | Yes | 8 | | utf16_polish_ci | utf16 | 106 | | Yes | 8 | | utf16_estonian_ci | utf16 | 107 | | Yes | 8 | | utf16_spanish_ci | utf16 | 108 | | Yes | 8 | | utf16_swedish_ci | utf16 | 109 | | Yes | 8 | | utf16_turkish_ci | utf16 | 110 | | Yes | 8 | | utf16_czech_ci | utf16 | 111 | | Yes | 8 | | utf16_danish_ci | utf16 | 112 | | Yes | 8 | | utf16_lithuanian_ci | utf16 | 113 | | Yes | 8 | | utf16_slovak_ci | utf16 | 114 | | Yes | 8 | | utf16_spanish2_ci | utf16 | 115 | | Yes | 8 | | utf16_roman_ci | utf16 | 116 | | Yes | 8 | | utf16_persian_ci | utf16 | 117 | | Yes | 8 | | utf16_esperanto_ci | utf16 | 118 | | Yes | 8 | | utf16_hungarian_ci | utf16 | 119 | | Yes | 8 | | utf16_sinhala_ci | utf16 | 120 | | Yes | 8 | | utf16_german2_ci | utf16 | 121 | | Yes | 8 | | utf16_croatian_ci | utf16 | 122 | | Yes | 8 | | utf16_unicode_520_ci | utf16 | 123 | | Yes | 8 | | utf16_vietnamese_ci | utf16 | 124 | | Yes | 8 | | utf16le_general_ci | utf16le | 56 | Yes | Yes | 1 | | utf16le_bin | utf16le | 62 | | Yes | 1 | | cp1256_general_ci | cp1256 | 57 | Yes | Yes | 1 | | cp1256_bin | cp1256 | 67 | | Yes | 1 | | cp1257_lithuanian_ci | cp1257 | 29 | | Yes | 1 | | cp1257_bin | cp1257 | 58 | | Yes | 1 | | cp1257_general_ci | cp1257 | 59 | Yes | Yes | 1 | | utf32_general_ci | utf32 | 60 | Yes | Yes | 1 | | utf32_bin | utf32 | 61 | | Yes | 1 | | utf32_unicode_ci | utf32 | 160 | | Yes | 8 | | utf32_icelandic_ci | utf32 | 161 | | Yes | 8 | | utf32_latvian_ci | utf32 | 162 | | Yes | 8 | | utf32_romanian_ci | utf32 | 163 | | Yes | 8 | | utf32_slovenian_ci | utf32 | 164 | | Yes | 8 | | utf32_polish_ci | utf32 | 165 | | Yes | 8 | | utf32_estonian_ci | utf32 | 166 | | Yes | 8 | | utf32_spanish_ci | utf32 | 167 | | Yes | 8 | | utf32_swedish_ci | utf32 | 168 | | Yes | 8 | | utf32_turkish_ci | utf32 | 169 | | Yes | 8 | | utf32_czech_ci | utf32 | 170 | | Yes | 8 | | utf32_danish_ci | utf32 | 171 | | Yes | 8 | | utf32_lithuanian_ci | utf32 | 172 | | Yes | 8 | | utf32_slovak_ci | utf32 | 173 | | Yes | 8 | | utf32_spanish2_ci | utf32 | 174 | | Yes | 8 | | utf32_roman_ci | utf32 | 175 | | Yes | 8 | | utf32_persian_ci | utf32 | 176 | | Yes | 8 | | utf32_esperanto_ci | utf32 | 177 | | Yes | 8 | | utf32_hungarian_ci | utf32 | 178 | | Yes | 8 | | utf32_sinhala_ci | utf32 | 179 | | Yes | 8 | | utf32_german2_ci | utf32 | 180 | | Yes | 8 | | utf32_croatian_ci | utf32 | 181 | | Yes | 8 | | utf32_unicode_520_ci | utf32 | 182 | | Yes | 8 | | utf32_vietnamese_ci | utf32 | 183 | | Yes | 8 | | binary | binary | 63 | Yes | Yes | 1 | | geostd8_general_ci | geostd8 | 92 | Yes | Yes | 1 | | geostd8_bin | geostd8 | 93 | | Yes | 1 | | cp932_japanese_ci | cp932 | 95 | Yes | Yes | 1 | | cp932_bin | cp932 | 96 | | Yes | 1 | | eucjpms_japanese_ci | eucjpms | 97 | Yes | Yes | 1 | | eucjpms_bin | eucjpms | 98 | | Yes | 1 | | gb18030_chinese_ci | gb18030 | 248 | Yes | Yes | 2 | | gb18030_bin | gb18030 | 249 | | Yes | 1 | | gb18030_unicode_520_ci | gb18030 | 250 | | Yes | 8 | +--------------------------+----------+-----+---------+----------+---------+ 222 rows in set (0,00 sec)
Wat ik gebruik
In MySQL werk ik met deze instellingen:
- Tekencodering:
utf8mb4
- andersutf8
- Collation:
utf8mb4_general_ci
- andersgeneral_ci
.
SQL:
create schema blub default character set utf8mb4 collate utf8mb4_0900_ai_ci;
Tekencodering exportbestanden
Bij exports, kun je bepalen met welke codering het exportbestand moet worden aangemaakt. En anders kan ik de betreffende tabel omzetten naar bv. latin1. Voorbeeld:
alter table blub_tmp character set = utf8, collate = utf8_general_ci;
Codering database uitlezen
SELECT default_character_set_name FROM information_schema.SCHEMATA WHERE schema_name = "schemaname";
Wellicht eenvoudiger:
show variables like "character_set_database";
Codering tabellen uitlezen
Alle tabellen (geloof ik):
SELECT CCSA.character_set_name FROM information_schema.`TABLES` T, information_schema.`COLLATION_CHARACTER_SET_APPLICABILITY` CCSA WHERE CCSA.collation_name = T.table_collation AND T.table_schema = "schemaname" AND T.table_name = "tablename";
Voorbeeld mbt. een specifieke tabel:
select table_collation from information_schema.tables where table_schema="webwinkels" and table_name="superquery_export";
De waardes die ik op m'n ontwikkelmachine met zo'n 30 databases tegenkom:
binary latin1_swedish_ci utf_bin utf8_general_ci utf8mb4_general_ci utf8mb4_unicode_ci
Codering kolommen uitlezen
SELECT character_set_name FROM information_schema.`COLUMNS` WHERE table_schema = "schemaname" AND table_name = "tablename" AND column_name = "columnname";
Bv.:
SELECT collation_name FROM information_schema.COLUMNS where table_schema="webwinkels" and TABLE_NAME="superquery_export";
Kolom-codering aanpassen
Je kunt niet zomaar schrijven naar information_schema
, maar dat hoeft ook niet:
ALTER TABLE mytable CONVERT TO CHARACTER SET latin1
Waarschuwing: UTF8MB3-alias (zomer 2020)
Het probleem
Waarschuwing bij het aanmaken van een tabel:
1 warning(s): 3719 'utf8' is currently an alias for the character set UTF8MB3, but will be an alias for UTF8MB4 in a future release. Please consider using UTF8MB4 in order to be unambiguous.
Aanvullende gegevens
utf8mb3
:
- Identiek aan codering
ucs2
- Ondersteund enkel BMP (Basic Multilingual Plan; Plane 0; 0000-ffff (0-65.536); Eerste pagina van 2^16 tekens) [1]
- Ondersteund geen additionele planes (Supplementary planes)
- Multibyte characters: Max. 3 bytes/character.
De essentie
utf8mb3
is deprecated. Gebruik voortaan alleen nogutf8mb4
[2].- Specificeer expliciet
utf8mb4
- Collation:
collate utf8mb4_0900_ai_ci
- Uitleg volgt nog.
Bronnen
- https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb3.html
- https://en.wikipedia.org/wiki/Plane_(Unicode)
Casus: Collation-foutmelding (april 2021)
Foutmelding
De mooiste dingen in het leven beginnen met een foutmelding:
Error Code: 1267. Illegal mix of collations (utf8mb4_unicode_520_ci,IMPLICIT) and (utf8mb4_0900_ai_ci,IMPLICIT) for operation '='
Collation op database-niveau
De tabellen in deze query, lijken allemaal de standaard-instellingen van de bijbehorende databases te volgen. Dit betreffen queries uit twee verschillende databases. Waarschijnlijk hebben die databases verschillende instellingen voor collation (collatie, sorteervolgorde?)
select schema_name, default_character_set_name, default_collation_name from information_schema.schemata;
Verrassend: Alle eigen databases lijken dezelfde collation te hebben:
schema_name default_character_set_name default_collation_name ------------------ -------------------------- ---------------------- mysql utf8mb4 utf8mb4_0900_ai_ci information_schema utf8 utf8_general_ci performance_schema utf8mb4 utf8mb4_0900_ai_ci sys utf8mb4 utf8mb4_0900_ai_ci example_cm utf8mb4 utf8mb4_0900_ai_ci ... utf8mb4 utf8mb4_0900_ai_ci
Collation op tabel-niveau
Volgende stap: Wat is de collision voor de verschillende tabellen? Query
select table_schema, table_name, engine, table_collation from information_schema.tables where ( table_schema = "knl_s1" or table_schema = "dwh_org" ) and ( table_name = "wp_posts" or table_name = "wp_postmeta" or table_name = "oem_bal" );
Levert op:
table_schema table_name engine table_collation ------------ ---------- ------ --------------- dwh_org oem_bal InnoDB utf8mb4_0900_ai_ci knl_s1 wp_postmeta InnoDB utf8mb4_unicode_520_ci knl_s1 wp_posts InnoDB utf8mb4_unicode_520_ci
Oplossing:
alter table knl_s1.wp_postmeta collate = utf8mb4_0900_ai_ci; alter table knl_s1.wp_posts collate = utf8mb4_0900_ai_ci;
Collation op kolom-niveau
Argh, de foutmelding is er nog steeds. Misschien dat dit op veld-niveau gefixet moet worden. Nieuw voor mij: Dit kun je gemakkelijk achterhalen in MySQL Workbench door met de rechtermuistoets op een entiteit (db, tabel) te klikken en te kiezen ... inspector:
knl_s1.wp_postmeta.meta_value utf8mb4_unicode_520_ci knl_s1.wp_posts.meta_value utf8mb4_unicode_520_ci
Oplossing:
alter table knl_s1.wp_postmeta change column meta_key meta_key VARCHAR(255) CHARACTER SET 'utf8mb4' COLLATE 'utf8mb4_0900_ai_ci' NULL DEFAULT NULL; alter table knl_s1.wp_postmeta change column meta_value meta_value LONGTEXT CHARACTER SET 'utf8mb4' COLLATE 'utf8mb4_0900_ai_ci' NULL DEFAULT NULL;
P.s.
Ongetwijfeld bestaat er een manier om in één keer de collation van alle velden in een database aan te passen. Lijkt me een prima vraagstuk, volgende keer als ik deze foutmelding krijg.
Zie ook
- Create database (MySQL)
- Strings, bytes, karakters & codering (MySQL)
- Tabel aanmaken (MySQL)
- Tekencodering achterhalen
- Tekencodering (MySQL)
Bronnen
- http://mysql.rjweb.org/doc.php/charcoll → Leuk achtergrondartikel + practisch, maar beetje oud
- https://docs.python.org/2/howto/unicode.html → Leuk achtergrondartikel
- http://www.whitesmith.co/blog/latin1-to-utf8/ → Leuk geschreven en mooie vormgeving
- http://www.garethsprice.com/blog/2011/fix-mysql-latin1-utf-character-encoding/
- http://blog.codekills.net/2012/03/20/in-mysql-latin1-isnt-actually-latin1/ → Oei, moeilijk!
- http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html
- http://stackoverflow.com/questions/1049728/how-do-i-see-what-character-set-a-mysql-database-table-column-is