This document discusses the creation and maintenance of "Summary Tables". It is a companion to the document on Data Warehousing Techniques.
The basic terminology ("Fact Table", "Normalization", etc) is covered in that document.
Summary tables are a performance necessity for large tables. MariaDB and MySQL do not provide any automated way to create such, so I am providing techniques here.
(Other vendors provide something similar with "materialized views".)
When you have millions or billions of rows, it takes a long time to summarize the data to present counts, totals, averages, etc, in a size that is readily digestible by humans. By computing and saving subtotals as the data comes in, one can make "reports" run much faster. (I have seen 10x to 1000x speedups.) The subtotals go into a "summary table". This document guides you on efficiency in both creating and using such tables.
A summary table includes two sets of columns:
The "date" might be a DATE (a 3-byte native datatype), or an hour, or some other time interval. A 3-byte MEDIUMINT UNSIGNED 'hour' can be derived from a DATETIME or TIMESTAMP via
FLOOR(UNIX_TIMESTAMP(dt) / 3600) FROM_UNIXTIME(hour * 3600)
The "dimensions" (a DW term) are some of the columns of the "Fact" table. Examples: Country, Make, Product, Category, Host Non-dimension examples: Sales, Quantity, TimeSpent
There would be one or more indexes, usually starting with some dimensions and ending with the date field. By ending with the date, one can efficiently get a range of days/weeks/etc. even when each row summarizes only one day.
There will typically be a "few" summary tables. Often one summary table can serve multiple purposes sufficiently efficiently.
As a rule of thumb, a summary table will have one-tenth the number of rows as the Fact table. (This number is very loose.)
Let's talk about a large chain of car dealerships. The Fact table has all the sales with columns such as datetime, salesman_id, city, price, customer_id, make, model, model_year. One Summary table might focus on sales:
PRIMARY KEY(city, datetime), Aggregations: ct, sum_price # Core of INSERT..SELECT: DATE(datetime) AS date, city, COUNT(*) AS ct, SUM(price) AS sum_price # Reporting average price for last month, broken down by city: SELECT city, SUM(sum_price) / SUM(ct) AS 'AveragePrice' FROM SalesSummary WHERE datetime BETWEEN ... GROUP BY city; # Monthly sales, nationwide, from same summary table: SELECT MONTH(datetime) AS 'Month', SUM(ct) AS 'TotalSalesCount' SUM(sum_price) AS 'TotalDollars' FROM SalesSummary WHERE datetime BETWEEN ... GROUP BY MONTH(datetime); # This might benefit from a secondary INDEX(datetime)
"Augment" in this section means to add new rows into the summary table or increment the counts in existing rows.
Plan A: "While inserting" rows into the Fact table, augment the summary table(s). This is simple, and workable for a smaller DW database (under 10 Fact table rows per second). For larger DW databases, Plan A likely to be too costly to be practical.
Plan B: "Periodically", via cron or an EVENT.
Plan C: "As needed". That is, when someone asks for a report, the code first updates the summary tables that will be needed.
Plan D: "Hybrid" of B and C. C, by itself, can led to long delays for the report. By also doing B, those delays can be kept low.
Plan E: (This is not advised.) "Rebuild" the entire summary table from the entire Fact table. The cost of this is prohibitive for large tables. However, Plan E may be needed when you decide to change the columns of a Summary Table, or discover a flaw in the computations. To lessen the impact of an entire build, adapt the chunking techniques in Deleting in chunks .
Plan F: "Staging table". This is primarily for very high speed ingestion. It is mentioned briefly in this blog, and discussed more thoroughly in the companion blog: High Speed Ingestion
INSERT INTO Fact ...; INSERT INTO Summary (..., ct, foo, ...) VALUES (..., 1, foo, ...) ON DUPLICATE KEY UPDATE ct = ct+1, sum_foo = sum_foo + VALUES(foo), ...;
IODKU (Insert On Duplicate Key Update) will update an existing row or create a new row. It knows which to do based on the Summary table's PRIMARY KEY.
Caution: This approach is costly, and will not scale to an ingestion rate of over, say, 10 rows per second (Or maybe 50/second on SSDs). More discussion later.
If your reports need to be up-to-the-second, you need "as needed" or "hybrid". If your reports have less urgency (eg, weekly reports that don't include 'today'), then "periodically" might be best.
For a daily summaries, augmenting the summary tables could be done right after midnight. But, beware of data coming "late".
For both "periodic" and "as needed", you need a definitive way of keeping track of where you "left off".
Case 1: You insert into the Fact table first and it has an AUTO_INCREMENT id: Grab MAX(id) as the upper bound for summarizing and put it either into some other secure place (an extra table), or put it into the row(s) in the Summary table as you insert them. (Caveat: AUTO_INCREMENT ids do not work well in multi-master, including Galera, setups.)
Case 2: If you are using a 'staging' table, there is no issue. (More on staging tables later.)
This applies to multi-row (batch) INSERT and LOAD DATA.
The Fact table needs an AUTO_INCREMENT id, and you need to be able to find the exact range of ids inserted. (This may be impractical in any multi-master setup.)
Then perform bulk summarization using
FROM Fact WHERE id BETWEEN min_id and max_id
Load the data (via INSERTs or LOAD DATA) en masse into a "staging table". Then perform batch summarization from the Staging table. And batch copy from the Staging table to the Fact table. Note that the Staging table is handy for batching "normalization" during ingestion. See also [[data-warehousing-high-speed-ingestion|High Speed Ingestion
Let's say your summary table has a DATE, `dy`, and a dimension, `foo`. The question is: Should (foo, dy) be the PRIMARY KEY? Or a non-UNIQUE index?
Case 1: PRIMARY KEY (foo, dy) and summarization is in lock step with, say, changes in `dy`.
This case is clean and simple -- until you get to endcases. How will you handle the case of data arriving 'late'? Maybe you will need to recalculate some chunks of data? If so, how?
Case 2: (foo, dy) is a non-UNIQUE INDEX.
This case is clean and simple, but it can clutter the summary table because multiple rows can occur for a given (foo, dy) pair. The report will always have to SUM() up values because it cannot assume there is only one row, even when it is reporting on a single `foo` for a single `dy`. This forced-SUM is not really bad -- you should do it anyway; that way all your reports are written with one pattern.
Case 3: PRIMARY KEY (foo, dy) and summarization can happen anytime.
Since you should be using InnoDB, there needs to be an explicit PRIMARY KEY. One approach when you do not have a 'natural' PK is this:
id INT UNSIGNED AUTO_INCREMENT NOT NULL, ... PRIMARY KEY(foo, dy, id), -- `id` added to make unique INDEX(id) -- sufficient to keep AUTO_INCREMENT happy
This case pushes the complexity onto the summarization by doing a IODKU.
Advice? Avoid Case 1; too messy. Case 2 is ok if the extra rows are not too common. Case 3 may be the closest to "once size fits all".
When summarizing, include
COUNT(*) AS ct and SUM(foo) AS sum_foo. When reporting, the "average" is computed as SUM(sum_foo) / SUM(ct). That is mathematically correct.
Exception... Let's say you are looking at weather temperatures. And you monitoring station gets the temp periodically, but unreliably. That is, the number of readings for a day varies. Further, you decide that the easiest way to compensate for the inconsistency is to do something like: Compute the avg temp for each day, then average those across the month (or other timeframe).
Formula for Standard Deviation:
SQRT( SUM(sum_foo2)/SUM(ct) - POWER(SUM(sum_foo)/SUM(ct), 2) )
Where sum_foo2 is SUM(foo * foo) from the summary table. sum_foo and sum_foo2 should be FLOAT. FLOAT gives you about 7 significant digits, which is more than enough for things like average and standard deviation. FLOAT occupies 4 bytes. DOUBLE would give you more precision, but occupies 8 bytes. INT and BIGINT are not practical because they may lead to complaints about overflow.
The idea here is to first load a set of Fact records into a "staging table", with the following characteristics (at least):
If you have bulk inserts (Batch INSERT or LOAD DATA) then consider doing the normalization and summarization immediately after each bulk insert.
More details: High Speed Ingestion
Here is a more complex way to design the system, with the goal of even more scaling.
Explanation and comments:
Stats on the system that this 'extreme design' came from: Fact Table: 450GB, 100M rows/day (batch of 4M/hour), 60 day retention (60+24 partitions), 75B/row, 7 summary tables, under 10 minutes to ingest and summarize the hourly batch. The INSERT..SELECT handled over 20K rows/sec going into the Fact table. Spinning drives (not SSD) with RAID-10.
One technique involves summarizing some of the data, then recording where you "left off", so that next time, you can start there. There are some subtle issues with "left off" that you should be cautious of.
If you use a DATETIME or TIMESTAMP as "left off", beware of multiple rows with the same value.
If you use an AUTO_INCREMENT as "left off" beware of:
So, nothing works, at least not in a multi-threaded environment?
If you can live with an occasional hiccup (skipped record), then maybe this is 'not a problem' for you.
The "Flip-Flop Staging" is a safe alternative, optionally combined with the "Extreme Design".
If you have many threads simultaneously INSERTing into one staging table, then here is an efficient way to handle a large load: Have a process that flips that staging table with another, identical, staging table, and performs bulk normalization, Fact insertion, and bulk summarization.
The flipping step uses a fast, atomic, RENAME.
Here is a sketch of the code:
# Prep for flip: CREATE TABLE new LIKE Staging; # Swap (flip) Staging tables: RENAME TABLE Staging TO old, new TO Staging; # Normalize new `foo`s: # (autocommit = 1) INSERT IGNORE INTO Foos SELECT fpp FROM old LEFT JOIN Foos ... # Prep for possible deadlocks, etc while... START TRANSACTION; # Add to Fact: INSERT INTO Fact ... FROM old JOIN Foos ... # Summarize: INSERT INTO Summary ... FROM old ... GROUP BY ... COMMIT; end-while # Cleanup: DROP TABLE old;
Meanwhile, ingestion can continue writing to `Staging`. The ingestion INSERTs will conflict with the RENAME, but will be resolved gracefully and silently and quickly.
How fast should you flip-flop? Probably the best scheme is to
If Staging is 'big', an iteration will take longer, but run more efficiently. Hence, it is self-regulating.
In a Galera (or InnoDB Cluster?) environment, each node could be receiving input. If can afford to loose a few rows, have `Staging` be a non-replicated MEMORY table. Otherwise, have one `Staging` per node and be InnoDB; it will be more secure, but slower and not without problems. In particular, if a node dies completely, you somehow need to process its `Staging` table.
To look at what a report needs, look at the WHERE clause that would provide the data. Some examples, assuming data about service records for automobiles: The GROUP BY to gives a clue of what the report might be about.
1. WHERE make = ? AND model_year = ? GROUP BY service_date, service_type 2. WHERE make = ? AND model = ? GROUP BY service_date, service_type 3. WHERE service_type = ? GROUP BY make, model, service_date 4. WHERE service_date between ? and ? GROUP BY make, model, model_year
You need to allow for 'ad hoc' queries? Well, look at all the ad hoc queries -- they all have a date range, plus nail down one or two other things. (I rarely see something as ugly as '%CL%' for nailing down another dimension.) So, start by thinking of date plus one or two other dimensions as the 'key' into a new summary table. Then comes the question of what data might be desired -- counts, sums, etc. Eventually you have a small set of summary tables. Then build a front end to allow them to pick only from those possibilities. It should encourage use of the existing summary tables, not not be truly 'open ended'.
Later, another 'requirement' may surface. So, build another summary table. Of course, it may take a day to initially populate it.
Does one ever need to summarize a summary table? Yes, but only in extreme situations. Usually a 'weekly' report can be derived from a 'daily' summary table; building a separate weekly summary table not being worth the effort.
Would one ever PARTITION a Summary Table? Yes, in extreme situations, such as the table being large, and
Rick James graciously allowed us to use this article in the Knowledge Base.
Rick James' site has other useful tips, how-tos, optimizations, and debugging tips.
Original source: http://mysql.rjweb.org/doc.php/summarytables
© 2019 MariaDB
Licensed under the Creative Commons Attribution 3.0 Unported License and the GNU Free Documentation License.