Amarok/Archives/Proposals/DatabaseProject

From KDE Community Wiki

This page describes the proposed changes needed to make Amarok database lightning fast, even when the database is running with low memory/CPU resources.

To summarize, Amarok needs PRIMARY KEYs on INTEGERs with SEQUENCEs. 
Using PRIMARY KEYs on INTEGERS is also recommended by SQLite, like explained below.

The proposed changes are very minimal and mostly apply to the database schema. Amarok source code will need very little changes, except for creating the database and running SQL queries like left joins.

Goals

  • Amarok should start very fast.

(muesli: the startup time is not only bound to the database. even with an empty database, loading takes longer than that already)

(pégésql: you are right, but this can help.)

  • Drag&drop, even of all files, should not take more than 1 second (the best being 0.1 second).

(muesli: already works, go to smart-playlists, select all collection and drag it. if your version works slower and adds the files in blocks of 50 each, you need to tweak your source. we have this in releases, since sqlite works slightly diffrent. this will be easier to change when we get database-plugins. means: 2.0)

(pégésql: drag&drop of hundreds of file is THE needed feature of Amarok. If you are willing to discover your music, you need to browse it very fast. For example, I ripped a 4-CD set for my kids with 200 songs. It takes a long time to drag&drop the songs. We need more speed. The result should be immediate. Only one query is needed.)

  • The database should be usable for any third-party development, like radio-broadcast software or web interface. As described below, it means offering reliable primary keys and a set of views.

Primary keys on INTEGERS with SEQUENCE

In database theory, a primary key is more than a unique index.

Using MySQL, some developers may believe that primary keys and unique indexes are the same. This is a false belief. A primary key has three goals:

A primary key is a reliable unique index

Reliable means that a primary key is supposed to keep the same value over time. It is usually accepted in database theory that only INTEGER using a SEQUENCE should be used for primary keys.

Adding integer primary keys on each table is not a complicated work. It will not break Amarok source code. People and developers willing to use the IDs will be able to use them. It is their choice.

A primary key is needed by query planners

The query planner is the complex engine inside the database system. Modern database systems like PostgreSQL collect statistics about queries, analyse and execute them faster over time. To find the best solutions, a query planner needs primary keys on all tables.

Without primary keys, a database is mostly blind. On fast computers with a lot of shared memory, you may not see it, as the database may execute entirely in memory. But on low-end computers, this can be a real problem.

A primary key should reduce memory usage and offer fast disc access

SQLite recommends adding primary keys on integers with sequences on this page : http://www.sqlite.org/datatypes.html

2.0   The INTEGER PRIMARY KEY
One exception to the typelessness of SQLite is a column whose type is INTEGER PRIMARY KEY. 
(And you must use "INTEGER" not "INT". A column of type INT PRIMARY KEY is typeless just like any other.) 
INTEGER PRIMARY KEY columns must contain a 32-bit signed integer. 
Any attempt to insert non-integer data will result in an error. 
INTEGER PRIMARY KEY columns can be used to implement the equivalent of AUTOINCREMENT. 

If you try to insert a NULL into an INTEGER PRIMARY KEY column, 
the column will actually be filled with a integer that is one greater than the largest key already in the table. 

Or if the largest key is 2147483647, then the column will be filled with a random integer. 
Either way, the INTEGER PRIMARY KEY column will be assigned a unique integer.

What does this paragraph mean?

Probably that SQLite internals are optimized when a primary key is set on an integer value. Database systems use integers internally to access data row. Simple and fast. Whenever we set a primary key on a text field, data access is slowed.

Inside eavery database (PostgreSQL, MySQL or even SQLite), records are referenced using integers, not text values. Database systems have been optimized to access data very fast using INTEGER primary key. Using a large text unique index is not an optimized solution for fast disc access.

Left Joins

To be writen. We all agree that INNER JOINS should be replaced by LEFT JOINs. But this can be done at any later time.

Views

Views are very interesting, because they store query statistics. It allows the planner to prepare queries and execute them faster over time.

Therefore, it is recommended to migrate complex queries using JOINS to VIEWS. Usually, not much work is involved, as VIEWs are used for SELECTs, not INSERTs.

Implementation steps

IMHO, the database should be changed in the following steps:

  • Firstly add PRIMARY KEYS on INTEGERs using SEQUENCES on each table. This will not break Amarok source code. This must be done before beta3 is shipped. I propose that everyone tries to implement the needed changes now.

There is no need to work on issues like playlist or colour if the database is not compliant with standards.

  • After Beta3, LEFT JOINs can be implemented.
  • Then, when needed, we can also add VIEWs.

The design of database pug-ins is an Amarok client-side issue. Database plug-ins will not make PostgreSQL, MySQL or SQLite run faster. Plug-ins are simply a different project.

Recommended development platform

PostgreSQL 8.x is a reference SQL99 database.

A reference platform means that it offers all features of modern database systems, but also that it is a community driven project, like Xorg is a reference for X and GNU/Linux kernel a reference for Unix.

For features and speed, PostgreSQL can only compare to Oracle or DB2. Except these three databases, other systems offer "would-be" features, which means that some features are not implemented yet, but marketing department says "Yes, we support this ... in the future ... this is planned for version y.x".

If your code runs under PostgreSQL it is likely to run everywhere on all SQL99 compliant platforms.

Does SQL99 compliance have any relevance? Generally amaroK SQL queries have been developed with MySQL or sqlite in mind, and then they have to be fixed to work on the other and postgresql. I was always under the understanding that postgresql provided more features then mysql or sqlite. --Eean 17:01, 1 April 2006 (EST)

For writing and debugging SQL queries, I recommend using PostgreSQL 8.1 and pgAdmin3 graphical interface.

pgAdmin3 gives graphical access to all PostgreSQL features:

pgadmin3_linux.png

  • In PostgreSQL, turn on logging all queries. It will allow you to copy and paste queries in pgAdmin3 to analyse them. Queries should execute mostly in memory, looping on disc only in the last step of execution. pgAdmin3 query interface can display query plans graphicaly.

pgadmin3_explain.png