Jump to content

KDb/Database Collation Issues: Difference between revisions

From KDE Community Wiki
< KDb
Jstaniek (talk | contribs)
Created page with "."
 
Jstaniek (talk | contribs)
No edit summary
 
Line 1: Line 1:
.
== Introduction ==
 
=== Types ===
Collation type is a rule used when comparing (or sorting) text strings in a database. Lets assume we have three database records in a ''name'' column of a table:
  Joe
  Adam
  joan
 
* NOCASE collation means that a comparison like ''name > "adam"'' will return:
 
  Joe
  Adam
  joan
 
: In other words, case-insensitive comparison is performed.
 
* BINARY collation means that a comparison like ''name > "adam"'' will return:
 
  joan
 
: Case-sensitive comparison is performed here, so neither "Joe" nor "Adam" are ''greater'' than "adam".
 
=== Affected SQL Operators ===
Following SQL comparison operators are affected:
  < > <= >= <> != = == LIKE.
 
=== Implications For Data Integrity ===
Example for MySQL:
  CREATE TABLE dictionary (
  word varchar(255) COLLATE latin2_bin NOT NULL,
  );
  ALTER TABLE dictionary ADD UNIQUE INDEX (word);
Having BINARY collation ''latin2_bin'' defined now it's possible to insert
  Bag
  bag
while, with NOCASE collation, the unique index would not allow this.
 
=== Problem with Non-latin1 Character Sets ===
Database backend needs to offer built-in support for Non-latin1 character collations. Kexi could not implement a workaround at the client side if a query has to be executed at the server side.
 
== Collation in SQLite Backend ==
SQLite supports BINARY collation by default.
 
'''Non-latin1 Character Sets:''' we can patch SQLite to add such a support. See [https://www.sqlite.org/datatype3.html#collation User-defined Collation Sequences].
 
== Collation in MySQL Backend ==
''By default, MySQL searches are not case sensitive.''
 
See [https://dev.mysql.com/doc/refman/8.4/en/case-sensitivity.html] and [https://dev.mysql.com/doc/refman/8.4/en/charset-mysql.html]
 
To force BINARY collation, we will add a COLLATE clause, e.g.:
  SELECT * FROM table WHERE name>='abc' '''COLLATE latin1_bin'''
 
Notes: '''latin1_bin''' collation name always works. Collation and character sets can be properly selected on db creation or connecting only for MySQL >= 4.1. See also [https://dev.mysql.com/doc/refman/8.4/en/charset-table.html Table Character Set and Collation] (mysql manual). So for newer MySQL versions use specific collations, not just latin1_bin.
 
A good default for collation/character set seems to be ''utf8_general_ci/utf8''.
 
TODO
 
== Collation in PostgreSQL Backend ==
TODO
 
== Implications for Query Designer ==
A good default is a #1 priority here. In most cases users expect NOCASE collation to be the default, so we're:
 
* adding ''NOCASE'' flag to appropriate KDbBinaryExpr objects when SQL statement is generated/anaysed
* supporting ''BINARY'' keyword in the Design View of the Query Designer
 
== Competition ==
* [[Kexi/Migration/MDB#MS_Access_Annoyances|MS Access does not support case-sensitive queries]].

Latest revision as of 22:40, 5 October 2024

Introduction

Types

Collation type is a rule used when comparing (or sorting) text strings in a database. Lets assume we have three database records in a name column of a table:

 Joe
 Adam
 joan
  • NOCASE collation means that a comparison like name > "adam" will return:
 Joe
 Adam
 joan
In other words, case-insensitive comparison is performed.
  • BINARY collation means that a comparison like name > "adam" will return:
 joan
Case-sensitive comparison is performed here, so neither "Joe" nor "Adam" are greater than "adam".

Affected SQL Operators

Following SQL comparison operators are affected:

 < > <= >= <> != = == LIKE.

Implications For Data Integrity

Example for MySQL:

 CREATE TABLE dictionary (
  word varchar(255) COLLATE latin2_bin NOT NULL,
 );
 ALTER TABLE dictionary ADD UNIQUE INDEX (word);

Having BINARY collation latin2_bin defined now it's possible to insert

 Bag
 bag

while, with NOCASE collation, the unique index would not allow this.

Problem with Non-latin1 Character Sets

Database backend needs to offer built-in support for Non-latin1 character collations. Kexi could not implement a workaround at the client side if a query has to be executed at the server side.

Collation in SQLite Backend

SQLite supports BINARY collation by default.

Non-latin1 Character Sets: we can patch SQLite to add such a support. See User-defined Collation Sequences.

Collation in MySQL Backend

By default, MySQL searches are not case sensitive.

See [1] and [2]

To force BINARY collation, we will add a COLLATE clause, e.g.:

 SELECT * FROM table WHERE name>='abc' COLLATE latin1_bin

Notes: latin1_bin collation name always works. Collation and character sets can be properly selected on db creation or connecting only for MySQL >= 4.1. See also Table Character Set and Collation (mysql manual). So for newer MySQL versions use specific collations, not just latin1_bin.

A good default for collation/character set seems to be utf8_general_ci/utf8.

TODO

Collation in PostgreSQL Backend

TODO

Implications for Query Designer

A good default is a #1 priority here. In most cases users expect NOCASE collation to be the default, so we're:

  • adding NOCASE flag to appropriate KDbBinaryExpr objects when SQL statement is generated/anaysed
  • supporting BINARY keyword in the Design View of the Query Designer

Competition