This document discusses techniques for improving performance for data-warehouse-like tables in MariaDB and MySQL.
Details on summary tables is covered in the companion document: Summary Tables.
This list mirrors "Data Warehouse" terminology.
Techniques that should be applied to the huge Fact table.
There are exceptions where the Fact table must be accessed to retrieve multiple rows. However, you should minimize the number of INDEXes on the table because they are likely to be costly on INSERT.
Once you have built the Summary table(s), there is not much need for the Fact table. One option that you should seriously consider is to not have a Fact table. Or, at least, you could purge old data from it sooner than you purge the Summary tables. Maybe even keep the Summary tables forever.
Case 1: You need to find the raw data involved in some event. But how will you find those row(s)? This is where a secondary index may be required.
If a secondary index is bigger than can be cached in RAM, and if the column(s) being indexed is random, then each row inserted may cause a disk hit to update the index. This limits insert speed to something like 100 rows per second (on ordinary disks). Multiple random indexes slow down insertion further. RAID striping and/or SSDs speed up insertion. Write caching helps, but only for bursts.
Case 2: You need some event, but you did not plan ahead with the optimal INDEX. Well, if the data is PARTITIONed on date, so even if you have a clue of when the event occurred, "partition pruning" will keep the query from being too terribly slow.
Case 3: Over time, the application is likely to need new 'reports', which may lead to a new Summary table. At this point, it would be handy to scan through the old data to fill up the new table.
Case 4: You find a flaw in the summarization, and need to rebuild an existing Summary table.
Cases 3 and 4 both need the "raw" data. But they don't necessarily need the data sitting in a database table. It could be in the pre-database format (such as log files). So, consider not building the Fact table, but simply keep the raw data, comressed, on some file system.
When talking about billions of rows in the Fact table, it is essentially mandatory that you "batch" the inserts. There are two main ways:
A third way is to INSERT or LOAD into a Staging table, then
Chunk size should usually be 100-1000 rows.
If your data is coming in continually, and you are adding a batching layer, let's do some math. Compute your ingestion rate -- R rows per second.
If batching seems viable, then design the batching layer to gather for S seconds or 100-1000 rows, whichever comes first.
(Note: Similar math applies to rapid UPDATEs of a table.)
Normalization is important in Data Warehouse applications because it significantly cuts down on the disk footprint and improves performance. There are other reasons for normalizing, but space is the important one for DW.
Here is a typical pattern for a Dimension table:
CREATE TABLE Emails ( email_id MEDIUMINT UNSIGNED NOT NULL AUTO_INCREMENT, -- don't make bigger than needed email VARCHAR(...) NOT NULL, PRIMARY KEY (email), -- for looking up one way INDEX(email_id) -- for looking up the other way (UNIQUE is not needed) ) ENGINE = InnoDB; -- to get clustering
Notes:
I bring this up as a separate topic because of some of the subtle issues that can happen.
You may be tempted to do
INSERT IGNORE INTO Foos SELECT DISTINCT foo FROM Staging; -- not wise
It has the problem of "burning" AUTO_INCREMENT ids. This is because MariaDB pre-allocates ids before getting to "IGNORE". That could rapidly increase the AUTO_INCREMENT values beyond what you expected.
Better is this...
INSERT IGNORE INTO Foos SELECT DISTINCT foo FROM Staging LEFT JOIN Foos ON Foos.foo = Staging.foo WHERE Foos.foo_id IS NULL;
Notes:
Once that INSERT is done, this will find all the foo_ids it needs:
INSERT INTO Fact (..., foo_id, ...) SELECT ..., Foos.foo_id, ... FROM Staging JOIN Foos ON Foos.foo = Staging.foo;
An advantage of "Batched Normalization" is that you can summarize directly from the Staging table. Two approaches:
Case 1: PRIMARY KEY (dy, foo) and summarization is in lock step with, say, changes in `dy`.
INSERT INTO Summary (dy, foo, ct, blah_total) SELECT DATE(dt) as dy, foo, COUNT(*) as ct, SUM(blah) as blah_total) FROM Staging GROUP BY 1, 2;
Case 2: (dy, foo) is a non-UNIQUE INDEX.
Case 3: PRIMARY KEY (dy, foo) and summarization can happen anytime.
INSERT INTO Summary (dy, foo, ct, blah_total) ON DUPLICATE KEY UPDATE ct = ct + VALUE(ct), blah_total = blah_total + VALUE(bt) SELECT DATE(dt) as dy, foo, COUNT(*) as ct, SUM(blah) as bt) FROM Staging GROUP BY 1, 2;
This document lists a number of ways to do things. Your situation may lead to one approach being more/less acceptable. But, if you are thinking "Just tell me what to do!", then here:
Those techniques should perform well and scale well in most cases. As you develop your situation, you may discover why I described alternative solutions.
Typically the Fact table is PARTITION BY RANGE (10-60 ranges of days/weeks/etc) and needs purging (DROP PARTITION) periodically. This discusses a safe/clean way to design the partitioning and do the DROPs: Purging PARTITIONs
For "read scaling", backup, and failover, use master-slave replication or something fancier. Do ingestion only on a single active master; it replicate to the slave(s). Generate reports on the slave(s).
Galera (including Percona XtraDB Cluster) provides a way to do "write scaling", but it may add more complexity than benefits. Probably the "Extreme Design" is the outline for how to ingest on multiple Galera nodes. Special care needs to be taken on when to use (or not use) transactions, and how to handle a 'deadlock' on COMMIT (which is key to Galera's performance).
"Sharding" is the splitting of data across multiple servers. (In contrast, replication and Galera have the same data on all servers, requiring all data to be written to all servers.)
With the non-sharding techniques described here, terabyte(s) of data can be handled by a single machine. Tens of terabytes probably requires sharding.
Sharding is beyond the scope of this document.
With the techniques described here, you may be able to achieve the following performance numbers. I say "may" because every data warehouse situation is different, and you may require performance-hurting deviations from what I describe here. I give multiple options for some aspects; these may cover some of your deviations.
One big performance killer is UUID/GUID keys. Since they are very 'random', updates of them (at scale) are limited to 1 row = 1 disk hit. Plain disks can handle only 100 hits/second. RAID and/or SSD can increase that to something like 1000 hits/sec. Huge amounts of RAM (for caching the random index) are a costly solution. It is possible to turn type-1 UUIDs into roughly-chronological keys, thereby mittigating the performance problems if the UUIDs are written/read with some chronological clustering. UUID discussion
Hardware, etc:
"Count the disk hits" -- back-of-envelope performance analysis
More on Count the Disk Hits
Look at your data; compute raw rows per second (or hour or day or year). There are about 30M seconds in a year; 86,400 seconds per day. Inserting 30 rows per second becomes a billion rows per year.
10 rows per second is about all you can expect from an ordinary machine (after allowing for various overheads). If you have less than that, you don't have many worries, but still you should probably create Summary tables. If more than 10/sec, then batching, etc, becomes vital. Even on spiffy hardware, 100/sec is about all you can expect without utilizing the techniques here.
Let's say your insert rate is only one-tenth of your disk IOPs (eg, 10 rows/sec vs 100 IOPs). Also, let's say your data is not "bursty"; that is, the data comes in somewhat soothly throughout the day.
Note that 10 rows/sec (300M/year) implies maybe 30GB for data + indexes + normalization tables + summary tables for 1 year. I would call this "not so big".
Still, the normalization and summarization are important. Normalization keeps the data from being, say, twice as big. Summarization speeds up the reports by orders of magnitude.
Let's design and analyse a "simple ingestion scheme" for 10 rows/second, without 'batching'.
# Normalize: $foo_id = SELECT foo_id FROM Foos WHERE foo = $foo; if no $foo_id, then INSERT IGNORE INTO Foos ... # Inserts: BEGIN; INSERT INTO Fact ...; INSERT INTO Summary ... ON DUPLICATE KEY UPDATE ...; COMMIT; # (plus code to deal with errors on INSERTs or COMMIT)
Depending on the number and randomness of your indexes, etc, 10 Fact rows may (or may not) take less than 100 IOPs.
Also, note that as the data grows over time, random indexes will become less and less likely to be cached. That is, even if runs fine with 1 year's worth of data, it may be in trouble with 2 year's worth.
For those reasons, I started this discussion with a wide margin (10 rows versus 100 IOPs).
Rick James graciously allowed us to use this article in the Knowledge Base.
Rick James' site has other useful tips, how-tos, optimizations, and debugging tips.
Original source: http://mysql.rjweb.org/doc.php/datawarehouse
© 2019 MariaDB
Licensed under the Creative Commons Attribution 3.0 Unported License and the GNU Free Documentation License.
https://mariadb.com/kb/en/data-warehousing-techniques/