This page describes what will be implemented for milestone#1 of LevelDB Storage Engine. For development after MS1, see LevelDB Storage Engine Development
We will use one LevelDB instance for mysqld process. LevelDB keys will be prefixed with 'dbname.table_name.PRIMARY' for the table itself, and 'dbname.table_name.index_name' for the secondary indexes. This allows to store arbitrary number of tables/indexes in one LevelDB instance.
We will rely on LevelDB's compression to make the storage compact. Data that goes into LevelDB's key will be stored in KeyTupleFormat (which allows mysql's lookup/index ordering functions to work).
Data that goes into LevelDB's value will be stored in table->record[0] format, except blobs. (Blobs will require special storage convention because they store a char* pointer in table->record[0]).
We will need to support blobs because table nodetable has a mediumtext field.
Non-unique secondary indexes will be supported.
LevelDB stores {KEY->VALUE} mappings. Non-unique index will need to have some unique values for KEY. This can be arranged by using this mapping:
KEY = {index_columns, primary_key_columns}.
VALUE = {nothing}
Using primary key as suffix will make DB::Get() useless. Instead, we will have to do lookups with:
get(secondary_index_key_val)
{
open cursor for (secondary_index_key_val)
read the first record
if (record > secondary_index_key_val)
return NOT_FOUND;
else
return FOUND;
}
TODO: we will not support unique secondary indexes in MS1. ALTER/CREATE TABLE statements attempting to add a unique index will fail. Is this ok?
We will use what was discussed in the "Pessimistic locking proposal".
Basic idea is: LevelDB's operations do blind overwrites. In order to make sure we're not overwriting anything, we will use this pattern:
acqure record lock; read; modify as necessary; write; release record lock;
record locks can be obtained for {table_name, primary_key}. Locks are exclusive, for a given record, only one lock can obtained at a time. A lock can be obtained when its record doesn't exist (see INSERT below)
UPDATE will use the above scheme. It will get row locks for the keys it is reading in order to prevent concurrent updates.
INSERT will use a row lock to make sure the record of interest does not exist.
If a DELETE statement has a form of
DELETE FROM tbl WHERE tbl.primary_key=const
then it theoretically can be translated into a DB::Delete() call, that is, into a write-without-read. In other cases, we will need to do reads and put locks on the rows that we want to delete.
MS1 will only implement the variant with locking DELETE.
SELECT statements will use a read snapshot. They will not put (or check) whether there are any locks for records they are reading. This is similar to the definition of read-committed isolation level.
We will also support SELECT FOR UPDATE. In this mode, the read records will be locked with a write lock until the end of the transaction.
As specified in previous sections, we will be employing locks on the values of {dbname, tablename, primary_key_value}. Locks will be exclusive: only one transaction can hold a lock at a time.
Locks are acquired one-by-one, which allows for deadlocks. There will be no deadlock detection or deadlock prevention systems. Instead, lock waits will time out after @@leveldb_lock_wait_timeout milliseconds. When @@leveldb_lock_wait_timeout==0, lock system will not wait at all, attempt to acquire a lock that's occupied will result in an immediate transaction abort.
Locks will be stored in a highly-concurrent hashtable. Current candidate for it is mysys/lf_hash.
The changes made by transaction will be accumulated as a LevelDB batch operation, and applied at transaction commit. This has a consequence:
the transaction is unable to see its own changes until it commits
We'll call the above CANT-SEE-OWN-CHANGES property. The property is contrary to the SQL's semantics. In SQL, one expects to see the changes they've made in preceding statements. However, the set of transactions we're targeting can live with CANT-SEE-OWN-CHANGES, so we'll live with the property for MS1.
After MS1, LevelDB SE will make sure that CANT-SEE-OWN changes is not observed. It will use the following approach:
MS1 will support:
Note: DB::GetApproximateSizes() returns amount of disk space occupied by the data. The number cannot be directly translated to #rows, because
Because of this, optimizer will have very imprecise input. It is expected to be still sufficient for MS1.
We will need to do fast bulk data loads. During the bulk load, writes-without-reads are ok: the user knows he's not overwriting data, he doesn't care about @@rows_affected.
These will be achieved as follows:
Bulk-loading mode means:
© 2023 MariaDB
Licensed under the Creative Commons Attribution 3.0 Unported License and the GNU Free Documentation License.
https://mariadb.com/kb/en/leveldb-storage-engine-ms1/