Data mapping from HBase to SQL
The Jira task for this is MDEV-122
Nobody is working on this feature ATM. See Cassandra Storage Engine for a related development that has reached the release stage.
This page describes a feature that's under development. The feature has not been released (even in beta), its interface and function may change, etc.
HBase API defines two ways to read data:
Both kinds of scans allow to specify:
The default behavior for versioned columns is to return only the most recent version. HBase API also allows to ask for
One can see two ways to map HBase tables to SQL tables:
HBase shell has 'scan' command, here's an example of its output:
hbase(main):007:0> scan 'testtable' ROW COLUMN+CELL myrow-1 column=colfam1:q1, timestamp=1297345476469, value=value-1 myrow-2 column=colfam1:q2, timestamp=1297345495663, value=value-2 myrow-2 column=colfam1:q3, timestamp=1297345508999, value=value-3
Here, one HBase row produces multiple rows in the query output. Each output row represents one (row_id, column) combination, so rows with multiple columns (and multiple revisions of column data) can be easily represented.
Mapping could be defined as follows:
CREATE TABLE hbase_tbl_cells ( row_id binary(MAX_HBASE_ROWID_LEN), column_family binary(MAX_HBASE_COLFAM_LEN), column_name binary(MAX_HBASE_NAME_LEN), timestamp TIMESTAMP, value BLOB, PRIMARY KEY (row_id, column_family, column_name, timestamp) ) ENGINE=hbase_cell;
There is no need for dynamic columns in this mapping.
# Point-select: SELECT value FROM hbase_cell WHERE row_id='hbase_row_id' AND column_family='hbase_column_family' AND column_name='hbase_column' ...
# Range select: # (the example uses BETWEEN but we will support arbitrary predicates) SELECT value FROM hbase_cell WHERE row_id BETWEEN 'hbase_row_id1' AND 'hbase_row_id2' AND column_family='hbase_column_family' AND column_name='hbase_column'
# Update a value for {row, column}
UPDATE hbase_cell SET value='value'
WHERE row_id='hbase_row' AND
column_family='col_family' AND column_name='col_name'
# Add a column into row
INSERT INTO hbase_cell values ('hbase_row', 'col_family','col_name','value');
Note that
The table is defined as having a
PRIMARY KEY (row_id, column_family, column_name, timestamp)
which allows to make use of range optimizer to get ranges on
If a range specifies one row, we can read it with HTable.get(), otherwise we'll have to use HTable.getScanner() and make use of the obtained scanner.
HBase API allows to scan a range of rows, retrieving only certain column name or certain column families. In our SQL mapping, this can be written as:
SELECT value FROM hbase_cell WHERE row_id BETWEEN 'hbase_row_id1' AND 'hbase_row_id2' AND column_family='hbase_column_family' (*)
If we feed this into the range optimizer, it will produce a range:
('hbase_row_id1', 'hbase_column_family') <= (row_id, column_family) <=
('hbase_row_id2', 'hbase_column_family')
which includes all column families for records which satisfy
'hbase_row_id1' < rowid < 'hbase_row_id2'
This will cause extra data to be read.
Possible solutions:
INSERT will be translated into HTable.checkAndPut(..., value=NULL) call. That way, attempt to insert a {rowid, column} that already exists will fail.
MySQL/MariaDB's storage engine API handles DELETEs like this:
ha_hbase_cell can remember {rowid, column_name} of the record, and then use HBase.checkAndDelete() call, so that we're sure we're deleting what we've read.
If we get a statement in form of
DELETE FROM hbase_cell WHERE rowid='hbase_row_id' AND column_family='...' AND column_name='...';
then reading the record is redundant (we could just make one HBase.checkAndDelete() call). This will require some form of query pushdown, though.
UPDATEs are similar to deletes as long as row_id, column_family, and column_name fields are not changed (that is, only column_value changes). Like with DELETEs:
If UPDATE statement changes row_id, column_family, or column_name field, it becomes totally different. HBase doesn't allow to change rowid of a record. We can only remove the record with old rowid, and insert a record with the new rowid. HBase doesn't support multi-row transactions, so we'll want to insert the new variant of the record before we have deleted the old one (I assume that data duplication is better than data loss).
For first milestone, we could disallow UPDATEs that change row_id, column_family or column_name.
Let each row in HBase table be mapped into a row from SQL point of view:
SELECT * FROM hbase_table; row-id column1 column2 column3 column4 ... ------ ------- ------- ------- ------- row1 data1 data2 row2 data3 row3 data4 data5
The problem is that the set of columns in a HBase table is not fixed and is potentially is very large. The solution is to put all columns into one blob column and use Dynamic Columns (http://kb.askmonty.org/en/dynamic-columns) functions to pack/extract values of individual columns:
row-id dyn_columns
------ ------------------------------
row1 {column1=data1,column2=data2}
row2 {column3=data3}
row3 {column1=data4,column4=data5}
Table DDL could look like this:
CREATE TABLE hbase_tbl_rows ( row_id BINARY(MAX_HBASE_ROWID_LEN), columns BLOB, -- All columns/values packed in dynamic column format PRIMARY KEY (row_id) ) ENGINE=hbase_row;
(TODO: Does Hbase have MAX_HBASE_ROWID_LEN limit? What is it? We can ignore this. Let the user define 'row_id' column with whatever limit he desires; don't do operations with rows that have row_id longer than the limit)
Functions for reading data:
COLUMN_GET(dynamic_column, column_nr as type) COLUMN_EXISTS(dynamic_column, column_nr); COLUMN_LIST(dynamic_column);
Functions for data modification:
COLUMN_ADD(dynamic_column, column_nr, value [as type], ...) COLUMN_DELETE(dynamic_column, column_nr, column_nr, ...);
Dynamic column functions cannot be used as-is:
Names for dynamic columns are covered in MDEV-377
# Point-select, get value of one column SELECT COLUMN_GET(hbase_tbl.columns, 'column_name' AS INTEGER) FROM hbase_tbl WHERE row_id='hbase_row_id';
# Range select: # (the example uses BETWEEN but we will support arbitrary predicates) SELECT COLUMN_GET(hbase_tbl.columns, 'column_name' AS INTEGER) FROM hbase_tbl WHERE row_id BETWEEN 'hbase_row_id1' AND 'hbase_row_id2';
# Update or add a column for a row UPDATE hbase_tbl SET columns=COLUMN_ADD(columns, 'column_name', 'value') WHERE row_id='hbase_row_id1';
Use of COLUMN_ADD like above will make no check whether column_name=X already existed for that row. If it did, it will be silently overwritten.
ATTENTION: There seems to be no easy way to do something that would be like SQL's INSERT statement, i.e. which would fail if the data you're changing already exists.
One can write a convoluted IF(..., ....) expression will do the store-if-not-exist operation, but it's bad when basic operations require convoluted expressions.
ATTENTION: One could also question whether a statement with semantics of "store this data irrespectively of what was there before" has any value for "remote" storage engine, where you're not the only one who's modifying the data.
# Set all columns at once, overwriting the content that was there
UPDATE hbase_tbl SET columns=... WHERE row_id='hbase_row_id1';
UPDATE hbase_tbl SET columns=COLUMN_CREATE('column1', 'foo') WHERE row_id='row1';
Note that the lsat statement will cause all columns except for 'column1' to be deleted for row 'row1'. This seems logical for SQL but there is no such operation in HBase.
# Insert a new row with column(s)
INSERT INTO hbase_tbl (row_id, columns) VALUES
('hbase_row_id', COLUMN_CREATE('column_name', 'column-value'));
Q: It's not clear how to access versioned data? Can we go without versioned data for the first milestone? (and then, use @@hbase_timestamp for the second milestone?)
Q: It's not clear how to select "all columns from column family X".
The table declares:
row_id BINARY(MAX_HBASE_ROWID_LEN), ... PRIMARY KEY (row_id)
which allows to use range/ref optimizer to extract ranges over the row_id column.
One can also imagine a realistic query which uses conditions on hbase column names:
SELECT column_get(columns, 'some_data') FROM hbase_tbl WHERE row_id BETWEEN 'first_interesting_row' and 'last_interesting_row' AND column_get(columns, 'attribute' as string)='eligible';
Range optimizer is unable to catch conditions in form of
column_get(columns, 'attribute' as string)='eligible'
We'll need to either extend it, or create another condition analyzer.
Currently, MariaDB works with Dynamic Columns with this scenario:
If we use this approach with HBase, we will cause a lot of overhead with reading/writing of unneeded data.
This scheme ensures there are no redundant data reads, at the cost making extra mysqld<->HBase roundtrips (which are likely to be expensive)
This may cause redundant data reads, for example for
SELECT COLUMN_GET(hbase_tbl, 'column1' AS INTEGER)
FROM hbase_tbl
WHERE
row_id BETWEEN 'hbase_row_id1' AND 'hbase_row_id2' AND
COLUMN_GET(hbase_tbl, 'column2' AS INTEGER)=1
column1 will be read for rows which have column2!=1. This still seems to be better than making extra roundtrips.
There is a question of what should be done when the query has references like
COLUMN_GET(hbase_tbl, {non-const-item} AS ...)
where it is not possible to tell in advance which columns must be read. Possible approaches are
See above sections: we'll be able to analyze condition on row_id, and a list of columns we need to read. That will give sufficient info to do either an HTable.get() call, or call HTable.getScanner() and use the scanner.
INSERT should make sure it actually creates the row, it should not overwrite existing rows. This is not trivial in HBase. The closest we can get is to make a number of HTable.checkAndPut() calls, with the checks that we're not overwriting the data.
This will cause INSERT ('row_id', COLUMN_CREATE('column1', 'data')) to succeed even if the table already had a row with ('row_id', COLUMN_CREATE('column2', 'data')).
Another possible problem is that INSERT can fail mid-way (we will insert only some columns of the record).
DELETE seems to be ok: we can delete all {rowid, column_name} combinations for the given row_id. I'm not sure, perhaps this will require multiple HBase calls.
Just like with per-cell mapping, UPDATEs that change the row_id are actually deletions followed by inserts. We can disallow them in the first milestone.
The most frequent form of UPDATE is expected to be one that changes the value of a column:
UPDATE hbase_tbl SET columns=COLUMN_ADD(columns, 'column_name', 'value') WHERE row_id='hbase_row_id1' AND COLUMN_GET(columns, 'column_name')='foo';
For that one, we need modified Dynamic Column Functions that will represent *changes* in the set of columns (and not *state*), so that we can avoid reading columns and writing them back.
This is a simplification of the per-row mapping. Suppose, the user is only interested in particular columns with names `column1` and `column2`. They create a table with this definition:
CREATE TABLE hbase_tbl_cells ( row_id binary(MAX_HBASE_ROWID_LEN), column1 TYPE, column2 TYPE, PRIMARY KEY (row_id), KEY(column1), KEY(column2) ) ENGINE=hbase_columns;
and then access it. Access is done like in per-row mapping, but without use of dynamic columns.
This mapping imposes lots of restrictions: it is only possible to select a fixed set of columns, there is no way to specify version of the data, etc.
If we select two columns from a certain row, per-cell mapping produces "vertical" result, while per-row mapping produces "horizontal" result.
# Per-cell:
SELECT column_name, value
FROM hbase_cell
WHERE
row_id='hbase_row_id1' AND
column_family='col_fam' AND column_name IN ('column1','column2')
+-------------+-------+
| column_name | value |
+-------------+-------+
| column1 | val1 |
| column2 | val2 |
+-------------+-------+
# Per row: SELECT COLUMN_GET(columns, 'col_fam:column1') as column1, COLUMN_GET(columns, 'col_fam:column2') as column2, FROM hbase_row WHERE row_id='hbase_row_id1' +---------+---------+ | column1 | column2 | +---------+---------+ | val1 | val2 | +---------+---------+
Per-cell mapping:
Per-row (or select-columns) mapping is easier when:
HBase is in Java, and its native client API is a java library. We need to interface with it from C++ storage engine code. Possible options are:
This requires HBase installation to run a Thrift server
HBaseRpc.java for "Unofficial Hadoop / HBase RPC protocol documentation" Q: will we need joins, i.e. do I need to implement Multi-Range-Read and support Batched Key Access right away?
© 2023 MariaDB
Licensed under the Creative Commons Attribution 3.0 Unported License and the GNU Free Documentation License.
https://mariadb.com/kb/en/hbase-storage-engine/