One would like to do "SELECT ... ORDER BY RAND() LIMIT 10" to get 10 rows at random. But this is slow. The optimizer does
All the algorithms given below are "fast", but most introduce flaws:
"Fast" means avoiding reading all the rows. There are many techniques that require a full table scan, or at least an index scan. They are not acceptable for this list. There is even a technique that averages half a scan; it is relegated to a footnote.
Here's a way to measure performance without having a big table.
FLUSH STATUS; SELECT ...; SHOW SESSION STATUS LIKE 'Handler%';
If some of the "Handler" numbers look like the number of rows in the table, then there was a table scan.
None of the queries presented here need a full table (or index) scan. Each has a time proportional to the number of rows returned.
Virtually all published algorithms involve a table scan. The previously published version of this blog had, embarrassingly, several algorithms that had table scans.
Sometimes the scan can be avoided via a subquery. For example, the first of these will do a table scan; the second will not.
SELECT * FROM RandTest AS a WHERE id = FLOOR(@min + (@max - @min + 1) * RAND()); -- BAD: table scan SELECT * FROM RandTest AS a JOIN ( SELECT FLOOR(@min + (@max - @min + 1) * RAND()) AS id -- Good; single eval. ) b USING (id);
SELECT r.* FROM ( SELECT FLOOR(mm.min_id + (mm.max_id - mm.min_id + 1) * RAND()) AS id FROM ( SELECT MIN(id) AS min_id, MAX(id) AS max_id FROM RandTest ) AS mm ) AS init JOIN RandTest AS r ON r.id = init.id;
(Of course, you might be able to simplify this. For example, min_id is likely to be 1. Or precalculate limits into @min and @max.)
-- First select is one-time: SELECT @min := MIN(id), @max := MAX(id) FROM RandTest; SELECT DISTINCT * FROM RandTest AS a JOIN ( SELECT FLOOR(@min + (@max - @min + 1) * RAND()) AS id FROM RandTest LIMIT 11 -- more than 10 (to compensate for dups) ) b USING (id) LIMIT 10; -- the desired number of rows
The FLOOR expression could lead to duplicates, hence the inflated inner LIMIT. There could (rarely) be so many duplicates that the inflated LIMIT leads to fewer than the desired 10 different rows. One approach to that Flaw is to rerun the query if it delivers too few rows.
A variant:
SELECT r.* FROM ( SELECT FLOOR(mm.min_id + (mm.max_id - mm.min_id + 1) * RAND()) AS id FROM ( SELECT MIN(id) AS min_id, MAX(id) AS max_id FROM RandTest ) AS mm JOIN ( SELECT id dummy FROM RandTest LIMIT 11 ) z ) AS init JOIN RandTest AS r ON r.id = init.id LIMIT 10;
Again, ugly but fast, regardless of table size.
This gets 50 "consecutive" ids (possibly with gaps), then delivers a random 10 of them.
-- First select is one-time: SELECT @min := MIN(id), @max := MAX(id) FROM RandTest; SELECT a.* FROM RandTest a JOIN ( SELECT id FROM ( SELECT id FROM ( SELECT @min + (@max - @min + 1 - 50) * RAND() AS start FROM DUAL ) AS init JOIN RandTest y WHERE y.id > init.start ORDER BY y.id LIMIT 50 -- Inflated to deal with gaps ) z ORDER BY RAND() LIMIT 10 -- number of rows desired (change to 1 if looking for a single row) ) r ON a.id = r.id;
Yes, it is complex, but yes, it is fast, regardless of the table size.
(Unfinished: need to check these.)
Assuming `rnd` is a FLOAT (or DOUBLE) populated with RAND() and INDEXed:
SELECT r.* FROM ( SELECT RAND() AS start FROM DUAL ) init JOIN RandTest r WHERE r.rnd >= init.start ORDER BY r.rnd LIMIT 10;
SELECT r.* FROM ( SELECT RAND() * ( SELECT rnd FROM RandTest ORDER BY rnd DESC LIMIT 10,1 ) AS start ) AS init JOIN RandTest r WHERE r.rnd > init.start ORDER BY r.rnd LIMIT 10; SELECT @start := RAND(), @cutoff := CAST(1.1 * 10 + 5 AS DECIMAL(20,8)) / TABLE_ROWS FROM information_schema.TABLES WHERE TABLE_SCHEMA = 'dbname' AND TABLE_NAME = 'RandTest'; -- 0.0030 SELECT d.* FROM ( SELECT a.id FROM RandTest a WHERE rnd BETWEEN @start AND @start + @cutoff ) sample JOIN RandTest d USING (id) ORDER BY rand() LIMIT 10;
RIGHT( HEX( (1<<24) * (1+RAND()) ), 6)
can be used as a `start` for adapting a gapped AUTO_INCREMENT case. If the field is BINARY instead of hex, then
UNHEX(RIGHT( HEX( (1<<24) * (1+RAND()) ), 6))
Rick James graciously allowed us to use this article in the Knowledge Base.
Rick James' site has other useful tips, how-tos, optimizations, and debugging tips.
Original source: http://mysql.rjweb.org/doc.php/random
© 2019 MariaDB
Licensed under the Creative Commons Attribution 3.0 Unported License and the GNU Free Documentation License.
https://mariadb.com/kb/en/data-sampling-techniques-for-efficiently-finding-a-random-row/