Certain kinds of IN-subqueries cannot be flattened into semi-joins. These subqueries can be both correlated or non-correlated. In order to provide consistent performance in all cases, MariaDB provides several alternative strategies for these types of subqueries. Whenever several strategies are possible, the optimizer chooses the optimal one based on cost estimates.
The two primary non-semi-join strategies are materialization (also called outside-in materialization), and in-to-exists transformation. Materialization is applicable only for non-correlated subqueries, while in-to-exist can be used both for correlated and non-correlated subqueries.
An IN subquery cannot be flattened into a semi-join in the following cases. The examples below use the World database from the MariaDB regression test suite.
The subquery is located directly or indirectly under an OR operation in the WHERE clause of the outer query.
Query pattern:
SELECT ... FROM ... WHERE (expr1, ..., exprN) [NOT] IN (SELECT ... ) OR expr;
Example:
SELECT Name FROM Country WHERE (Code IN (select Country from City where City.Population > 100000) OR Name LIKE 'L%') AND surfacearea > 1000000;
The subquery predicate itself is negated.
Query pattern:
SELECT ... FROM ... WHERE ... (expr1, ..., exprN) NOT IN (SELECT ... ) ...;
Example:
SELECT Country.Name FROM Country, CountryLanguage WHERE Code NOT IN (SELECT Country FROM CountryLanguage WHERE Language = 'English') AND CountryLanguage.Language = 'French' AND Code = Country;
The subquery is located in the SELECT or HAVING clauses of the outer query.
Query pattern:
SELECT field1, ..., (SELECT ...) WHERE ...; SELECT ... WHERE ... HAVING (SELECT ...);
Example:
select Name, City.id in (select capital from Country where capital is not null) as is_capital from City where City.population > 10000000;
The subquery itself is a UNION, while the IN predicate may be anywhere in the query where IN is allowed.
Query pattern:
... [NOT] IN (SELECT ... UNION SELECT ...)
Example:
SELECT * from City where (Name, 91) IN (SELECT Name, round(Population/1000) FROM City WHERE Country = "IND" AND Population > 2500000 UNION SELECT Name, round(Population/1000) FROM City WHERE Country = "IND" AND Population < 100000);
The basic idea of subquery materialization is to execute the subquery and store its result in an internal temporary table indexed on all its columns. Naturally, this is possible only when the subquery is non-correlated. The IN predicate tests whether its left operand is present in the subquery result. Therefore it is not necessary to store duplicate subquery result rows in the temporary table. Storing only unique subquery rows provides two benefits - the size of the temporary table is smaller, and the index on all its columns can be unique.
If the size of the temporary table is less than the tmp_table_size system variable, the table is a hash-indexed in-memory HEAP table. In the rare cases when the subquery result exceeds this limit, the temporary table is stored on disk in an ARIA or MyISAM B-tree indexed table (ARIA is the default).
Subquery materialization happens on demand during the first execution of the IN predicate. Once the subquery is materialized, the IN predicate is evaluated very efficiently by index lookups of the outer expression into the unique index of the materialized temporary table. If there is a match, IN is TRUE, otherwise IN is FALSE.
An IN predicate may produce a NULL result if there is a NULL value in either of its arguments. Depending on its location in a query, a NULL predicate value is equivalent to FALSE. These are the cases when substituting NULL with FALSE would reject exactly the same result rows. A NULL result of IN is indistinguishable from a FALSE if the IN predicate is:
In all these cases the evaluation of IN is performed as described in the previous paragraph via index lookups into the materialized subquery. In all remaining cases when NULL cannot be substituted with FALSE, it is not possible to use index lookups. This is not a limitation in the server, but a consequence of the NULL semantics in the ANSI SQL standard.
Suppose an IN predicate is evaluated as
NULL IN (select not_null_col from t1)
, that is, the left operand of IN is a NULL value, and there are no NULLs in the subquery. In this case the value of IN is neither FALSE, nor TRUE. Instead it is NULL. If we were to perform an index lookup with the NULL as a key, such a value would not be found in not_null_col, and the IN predicate would incorrectly produce a FALSE.
In general, an NULL value on either side of an IN acts as a "wildcard" that matches any value, and if a match exists, the result of IN is NULL. Consider the following example:
If the left argument of IN is the row: (7, NULL, 9)
, and the result of the right subquery operand of IN is the table:
(7, 8, 10) (6, NULL, NULL) (7, 11, 9)
The the IN predicate matches the row (7, 11, 9)
, and the result of IN is NULL. Matches where the differing values on either side of the IN arguments are matched by a NULL in the other IN argument, are called partial matches.
In order to efficiently compute the result of an IN predicate in the presence of NULLs, MariaDB implements two special algorithms for partial matching, described here in detail.
In principle the subquery materialization strategy is universal, however, due to some technical limitations in the MariaDB server, there are few cases when the server cannot apply this optimization.
In the above cases, the server reverts to the IN-TO-EXISTS transformation.
This optimization is the only subquery execution strategy that existed in older versions of MariaDB and MySQL prior to MariaDB 5.3. We have made various changes and fixed a number of bugs in this code as well, but in essence it remains the same.
For the time being we refer the reader to the MySQL documentation of this optimization.
Depending on the query and data, either of the two strategies described here may result in orders of magnitude better/worse plan than the other strategy.
Older versions of MariaDB and any current MySQL version (including MySQL 5.5, and MySQL 5.6 DMR as of July 2011) implement only the IN-TO-EXISTS transformation. As illustrated below, this strategy is inferior in many common cases to subquery materialization.
Consider the following query over the data of the DBT3 benchmark scale 10. Find customers with top balance in their nations:
SELECT * FROM part WHERE p_partkey IN (SELECT l_partkey FROM lineitem WHERE l_shipdate between '1997-01-01' and '1997-02-01') ORDER BY p_retailprice DESC LIMIT 10;
The times to run this query is as follows:
+--+------------------+--------+--------------+-------------------+----+------+---------------------------+ |id|select_type |table |type |key |ref |rows |Extra | +--+------------------+--------+--------------+-------------------+----+------+---------------------------+ | 1|PRIMARY |part |ALL |NULL |NULL|199755|Using where; Using filesort| | 2|DEPENDENT SUBQUERY|lineitem|index_subquery|i_l_suppkey_partkey|func| 14|Using where | +--+------------------+--------+--------------+-------------------+----+------+---------------------------+
+--+------------+-----------+------+------------------+----+------+-------------------------------+ |id|select_type |table |type |key |ref |rows |Extra | +--+------------+-----------+------+------------------+----+------+-------------------------------+ | 1|PRIMARY |part |ALL |NULL |NULL|199755|Using temporary; Using filesort| | 1|PRIMARY |<subquery2>|eq_ref|distinct_key |func| 1| | | 2|MATERIALIZED|lineitem |range |l_shipdate_partkey|NULL|160060|Using where; Using index | +--+------------+-----------+------+------------------+----+------+-------------------------------+
The speedup here is practically infinite, because both MySQL and older MariaDB versions cannot complete the query in any reasonable time.
In order to show the benefits of partial matching we extended the customer table from the DBT3 benchmark with two extra columns:
Both columns are prefixed with the percent NULL values in the column, that is, c_pref_nationkey_05 contains 5% NULL values.
Consider the query "Find all customers that didn't buy from a preferred country, and from a preferred brand withing some date ranges":
SELECT count(*) FROM customer WHERE (c_custkey, c_pref_nationkey_05, c_pref_brand_05) NOT IN (SELECT o_custkey, s_nationkey, p_brand FROM orders, supplier, part, lineitem WHERE l_orderkey = o_orderkey and l_suppkey = s_suppkey and l_partkey = p_partkey and p_retailprice < 1200 and l_shipdate >= '1996-04-01' and l_shipdate < '1996-04-05' and o_orderdate >= '1996-04-01' and o_orderdate < '1996-04-05');
The speedup for this query is 20 times.
TODO
In certain cases it may be necessary to override the choice of the optimizer. Typically this is needed for benchmarking or testing purposes, or to mimic the behavior of an older version of the server, or if the optimizer made a poor choice.
All the above strategies can be controlled via the following switches in optimizer_switch system variable.
The two main optimizer switches - materialization and in_to_exists cannot be simultaneously off. If both are set to off, the server will issue an error.
© 2019 MariaDB
Licensed under the Creative Commons Attribution 3.0 Unported License and the GNU Free Documentation License.
https://mariadb.com/kb/en/non-semi-join-subquery-optimizations/