Shard-Query is a high performance MPP (massively parallel processing) query engine for MariaDB and MySQL which offers increased parallelism compared to stand-alone servers. This increased parallelism is achieved by taking advantage of MariaDB/MySQL partitioning, sharding, common query clauses like BETWEEN and IN, or some combination of the above.
Shard-Query is implemented with PHP and Gearman.
Shard-Query is targeted mainly at big data problems, and OLAP queries in general. The primary goal of Shard-Query is to enable low-latency query access to extremely large volumes of data utilizing commodity hardware and open source database software. Shard-Query is a federated query engine which is designed to perform as much work in parallel as possible over a sharded dataset, that is one that is split over multiple servers (shards) or partitioned tables.
Shard-Query behaves like READ-COMMITTED
in terms of transaction isolation for a single query. Shard-Query works best with a star schema data set, and is therefore targeted towards large data marts. Shard-Query also works well with extremely large tables, such as machine generated logs or sensor data.
The swanhart-tools repository, on GitHub, contains tools for MariaDB and MySQL from Justin Swanhart. Shard-Query is located in the shard-query
directory. The whole project can be downloaded via the GitHub's Download button. It is possible to report bugs using the Issues tab.
A project on Google Code also exists, but it is obsolete. All bugs from Google Code and all downloads from Google Code are deprecated and will be removed.
A Facebook page is also available.
No web server is required. However, Apache 2 has been tested and a configuration file is included in the project.
The proxy
directory contains a MySQL/MariaDB proxy. It can be used to transmit data between Shard-Query and applications, so that client programs don't need to know Shard-Query. The proxy is a Lua script which uses the JSON for Lua library.
You can run just about all SQL queries over your dataset.
For SELECT
queries:
SUM
, COUNT
, MIN
, MAX
and AVG
are the fastest aggregate operations SUM
/COUNT(DISTINCT ..)
are supported, but are slower STD
/VAR
/etc are supported but aggregation is not pushed down at all (slowest) PERCENTILE(expr, N)
- take a percentile, for example percentile(score,90) JOIN
s are supported (no self joins, or joins of tables sharded by different keys)
ORDER BY
, GROUP BY
, HAVING
, WITH ROLLUP
, and LIMIT
are supported UNION
s are fully supported.
Also supports:
INSERT
, UPDATE
, DELETE
, LOAD DATA INFILE CREATE TABLE
, ALTER TABLE
and DROP TABLE
Details about how these statements are performed are explained before.
Shard-Query is designed to work with a sharded data set. Shard-Query is designed to make a group of shards behave as one virtual database server, even for complex queries. Shard-Query includes two shard "mappers", described below.
Hash based sharding features very fast row to shard lookups (this is called mapping) because no network round trip is required to calculate the lookup. The downsides are significant though, because the number of shards is fixed.
Directory based sharding uses a database table on a directory database server to map rows to shards. This is very scalable, because rows can be migrated between shards, and the number of shards is not fixed (it can be easily grown). Directory mapping can be a problem if the directory must be very large (too large for a single directory server.) Directory mapping is slightly slower than hash mapping because a round trip and database lookups must be performed.
Shard-Query uses one of the above mappers in order to determine to which shards a query should be sent based on the WHERE
clause used in the query. It may be possible to reduce the number of shards to which to send a query when the WHERE
clause includes a condition on the shard column.
Shard-Query automatically rewrites queries so that they return proper results when the query is executed over multiple shards. For example, when the COUNT
aggregate function is used, it is executed as the SUM
of the COUNT
from each of the queried shards. Shard-Query elects a 'coordinator node' for each SELECT
query, which is used for aggregating the results.
When loading data or inserting rows, Shard-Query examines the data being inserted and sends the row to the appropriate shard. The loader is massively parallel and will load delimited files in chunks in parallel.
All SQL except SELECT
and INSERT
is sent to all shards.
Shard-Query supports both sharded and un-sharded tables. Un-sharded tables are tables which do not contain the shard column. Insertions into un-sharded tables go to all shards. If a query contains no sharded tables, then the query is sent to only a single shard to prevent duplication of data.
Shard-Query includes
bin/run_query
- tool to run queries. use php ./run_query --help
. This is the tool you should look at to see how to integrate Shard-Query in your own app. loader
- tool to load (or pre-split) flat files for sharded loading. The REST UI is installed in /usr/share/shard-query/ui
by default.
You can reach the UI by visiting the url: http://your_server/shard_query
The default username is: user and the password is mpp.
© 2019 MariaDB
Licensed under the Creative Commons Attribution 3.0 Unported License and the GNU Free Documentation License.
https://mariadb.com/kb/en/shard-query/