MySQL 8.0 Reference Manual(读书笔记51节--Optimizing Subqueries, Derived Tables, View References, and Common Table Expressions(1))
The MySQL query optimizer has different strategies available to evaluate【ɪˈvæljueɪt 评价;评估;估计;】 subqueries:
• For a subquery used with an IN, = ANY, or EXISTS predicate, the optimizer has these choices:
• Semijoin • Materialization • EXISTS strategy
• For a subquery used with a NOT IN, <> ALL or NOT EXISTS predicate, the optimizer has these choices:
• Materialization【实现;具体化;物质化;】 • EXISTS strategy
For a derived table, the optimizer has these choices (which also apply to view references and common table expressions):
• Merge the derived table into the outer query block • Materialize the derived table to an internal temporary table
【A limitation on UPDATE and DELETE statements that use a subquery to modify a single table is that the optimizer does not use semijoin or materialization subquery optimizations. As a workaround, try rewriting them as multiple-table UPDATE and DELETE statements that use a join rather than a subquery.】
1 Optimizing IN and EXISTS Subquery Predicates with Semijoin Transformations
A semijoin is a preparation-time transformation that enables multiple execution strategies such as table pullout, duplicate weedout, first match, loose scan, and materialization. The optimizer uses semijoin strategies to improve subquery execution, as described in this section.
For an inner join between two tables, the join returns a row from one table as many times as there are matches in the other table. But for some questions, the only information that matters is whether there is a match, not the number of matches. Suppose that there are tables named class and roster that list classes in a course curriculum and class rosters (students enrolled in each class), respectively. To list the classes that actually have students enrolled, you could use this join:
SELECT class.class_num, class.class_name FROM class INNER JOIN roster WHERE class.class_num = roster.class_num;
However, the result lists each class once for each enrolled student. For the question being asked, this is unnecessary duplication of information.
Assuming that class_num is a primary key in the class table, duplicate suppression is possible by using SELECT DISTINCT, but it is inefficient to generate all matching rows first only to eliminate duplicates later.
The same duplicate-free result can be obtained by using a subquery:
SELECT class_num, class_name FROM class WHERE class_num IN (SELECT class_num FROM roster);
Here, the optimizer can recognize that the IN clause requires the subquery to return only one instance of each class number from the roster table. In this case, the query can use a semijoin; that is, an operation that returns only one instance of each row in class that is matched by rows in roster.
The following statement, which contains an EXISTS subquery predicate, is equivalent to the previous statement containing an IN subquery predicate:
SELECT class_num, class_name FROM class WHERE EXISTS (SELECT * FROM roster WHERE class.class_num = roster.class_num);
In MySQL 8.0.16 and later, any statement with an EXISTS subquery predicate is subject to the same semijoin transforms as a statement with an equivalent IN subquery predicate.
Beginning with MySQL 8.0.17, the following subqueries are transformed into antijoins:
• NOT IN (SELECT ... FROM ...)
• NOT EXISTS (SELECT ... FROM ...).
• IN (SELECT ... FROM ...) IS NOT TRUE
• EXISTS (SELECT ... FROM ...) IS NOT TRUE.
• IN (SELECT ... FROM ...) IS FALSE
• EXISTS (SELECT ... FROM ...) IS FALSE.
In short, any negation of a subquery of the form IN (SELECT ... FROM ...) or EXISTS (SELECT ... FROM ...) is transformed into an antijoin.
An antijoin is an operation that returns only rows for which there is no match. Consider the query shown here:
SELECT class_num, class_name FROM class WHERE class_num NOT IN (SELECT class_num FROM roster);
This query is rewritten internally as the antijoin SELECT class_num, class_name FROM class ANTIJOIN roster ON class_num, which returns one instance of each row in class that is not matched by any rows in roster. This means that, for each row in class, as soon as a match is found in roster, the row in class can be discarded.
Antijoin transformations cannot in most cases be applied if the expressions being compared are nullable. An exception to this rule is that (... NOT IN (SELECT ...)) IS NOT FALSE and its equivalent (... IN (SELECT ...)) IS NOT TRUE can be transformed into antijoins.
Outer join and inner join syntax is permitted in the outer query specification, and table references may be base tables, derived tables, view references, or common table expressions.
In MySQL, a subquery must satisfy these criteria to be handled as a semijoin (or, in MySQL 8.0.17 and later, an antijoin if NOT modifies the subquery):
• It must be part of an IN, = ANY, or EXISTS predicate that appears at the top level of the WHERE or ON clause, possibly as a term in an AND expression. For example:
SELECT ... FROM ot1, ... WHERE (oe1, ...) IN (SELECT ie1, ... FROM it1, ... WHERE ...);
Here, ot_i and it_i represent tables in the outer and inner parts of the query, and oe_i and ie_i represent expressions that refer to columns in the outer and inner tables.
In MySQL 8.0.17 and later, the subquery can also be the argument to an expression modified by NOT, IS [NOT] TRUE, or IS [NOT] FALSE.
• It must be a single SELECT without UNION constructs.
• It must not contain a HAVING clause.
• It must not contain any aggregate functions (whether it is explicitly or implicitly grouped).
• It must not have a LIMIT clause.
• The statement must not use the STRAIGHT_JOIN join type in the outer query.
• The STRAIGHT_JOIN modifier must not be present.
• The number of outer and inner tables together must be less than the maximum number of tables permitted in a join.
• The subquery may be correlated or uncorrelated. In MySQL 8.0.16 and later, decorrelation【解(抗,去)相关;】 looks at trivially【平凡地;平凡;琐细地;】 correlated predicates in the WHERE clause of a subquery used as the argument to EXISTS, and makes it possible to optimize it as if it was used within IN (SELECT b FROM ...). The term trivially correlated means that the predicate is an equality predicate, that it is the sole predicate in the WHERE clause (or is combined with AND), and that one operand is from a table referenced in the subquery and the other operand is from the outer query block.
• The DISTINCT keyword is permitted but ignored. Semijoin strategies automatically handle duplicate removal.
• A GROUP BY clause is permitted but ignored, unless the subquery also contains one or more aggregate functions.
• An ORDER BY clause is permitted but ignored, since ordering is irrelevant【ɪˈreləvənt 无关的;不相关的;无关紧要的;】 to the evaluation of semijoin strategies.
If a subquery meets the preceding criteria, MySQL converts it to a semijoin (or, in MySQL 8.0.17 or later, an antijoin if applicable) and makes a cost-based choice from these strategies:
• Convert the subquery to a join, or use table pullout and run the query as an inner join between subquery tables and outer tables. Table pullout pulls a table out from the subquery to the outer query
.• Duplicate Weedout: Run the semijoin as if it was a join and remove duplicate records using a temporary table.
• FirstMatch: When scanning the inner tables for row combinations and there are multiple instances of a given value group, choose one rather than returning them all. This "shortcuts" scanning and eliminates production of unnecessary rows.
• LooseScan: Scan a subquery table using an index that enables a single value to be chosen from each subquery's value group.
• Materialize the subquery into an indexed temporary table that is used to perform a join, where the index is used to remove duplicates. The index might also be used later for lookups when joining the temporary table with the outer tables; if not, the table is scanned.
Each of these strategies can be enabled or disabled using the following optimizer_switch system variable flags:
• The semijoin flag controls whether semijoins are used. Starting with MySQL 8.0.17, this also applies to antijoins.
• If semijoin is enabled, the firstmatch, loosescan, duplicateweedout, and materialization flags enable finer control over the permitted semijoin strategies.
• If the duplicateweedout semijoin strategy is disabled, it is not used unless all other applicable strategies are also disabled.
• If duplicateweedout is disabled, on occasion the optimizer may generate a query plan that is far from optimal. This occurs due to heuristic pruning during greedy search, which can be avoided by setting optimizer_prune_level=0.
These flags are enabled by default.
The optimizer minimizes differences in handling of views and derived tables. This affects queries that use the STRAIGHT_JOIN modifier and a view with an IN subquery that can be converted to a semijoin. The following query illustrates this because the change in processing causes a change in transformation, and thus a different execution strategy:
CREATE VIEW v AS SELECT * FROM t1 WHERE a IN (SELECT b FROM t2); SELECT STRAIGHT_JOIN * FROM t3 JOIN v ON t3.x = v.a;
The optimizer first looks at the view and converts the IN subquery to a semijoin, then checks whether it is possible to merge the view into the outer query. Because the STRAIGHT_JOIN modifier in the outer query prevents semijoin, the optimizer refuses the merge, causing derived table evaluation using a materialized table.
EXPLAIN output indicates the use of semijoin strategies as follows:
• For extended EXPLAIN output, the text displayed by a following SHOW WARNINGS shows the rewritten query, which displays the semijoin structure. From this you can get an idea about which tables were pulled out of the semijoin. If a subquery was converted to a semijoin, you should see that the subquery predicate is gone and its tables and WHERE clause were merged into the outer query join list and WHERE clause.
• Temporary table use for Duplicate Weedout is indicated by Start temporary and End temporary in the Extra column. Tables that were not pulled out and are in the range of EXPLAIN output rows covered by Start temporary and End temporary have their rowid in the temporary table.
• FirstMatch(tbl_name) in the Extra column indicates join shortcutting.
• LooseScan(m..n) in the Extra column indicates use of the LooseScan strategy. m and n are key part numbers.
• Temporary table use for materialization is indicated by rows with a select_type value of MATERIALIZED and rows with a table value of .
n MySQL 8.0.21 and later, a semijoin transformation can also be applied to a single-table UPDATE or DELETE statement that uses a [NOT] IN or [NOT] EXISTS subquery predicate, provided that the statement does not use ORDER BY or LIMIT, and that semijoin transformations are allowed by an optimizer hint or by the optimizer_switch setting.
2 Optimizing Subqueries with Materialization
The optimizer uses materialization to enable more efficient subquery processing. Materialization speeds up query execution by generating a subquery result as a temporary table, normally in memory. The first time MySQL needs the subquery result, it materializes that result into a temporary table. Any subsequent time the result is needed, MySQL refers again to the temporary table. The optimizer may index the table with a hash index to make lookups fast and inexpensive. The index contains unique values to eliminate duplicates and make the table smaller.
Subquery materialization uses an in-memory temporary table when possible, falling back to on-disk storage if the table becomes too large.
If materialization is not used, the optimizer sometimes rewrites a noncorrelated subquery as a correlated subquery. For example, the following IN subquery is noncorrelated (where_condition involves only columns from t2 and not t1):
SELECT * FROM t1 WHERE t1.a IN (SELECT t2.b FROM t2 WHERE where_condition);
The optimizer might rewrite this as an EXISTS correlated subquery:
SELECT * FROM t1 WHERE EXISTS (SELECT t2.b FROM t2 WHERE where_condition AND t1.a=t2.b);
Subquery materialization using a temporary table avoids such rewrites and makes it possible to execute the subquery only once rather than once per row of the outer query.
For subquery materialization to be used in MySQL, the optimizer_switch system variable materialization flag must be enabled.With the materialization flag enabled, materialization applies to subquery predicates that appear anywhere (in the select list, WHERE, ON, GROUP BY, HAVING, or ORDER BY), for predicates that fall into any of these use cases:
• The predicate has this form, when no outer expression oe_i or inner expression ie_i is nullable. N is 1 or larger.
(oe_1, oe_2, ..., oe_N) [NOT] IN (SELECT ie_1, i_2, ..., ie_N ...)
• The predicate has this form, when there is a single outer expression oe and inner expression ie. The expressions can be nullable.
oe [NOT] IN (SELECT ie ...)
• The predicate is IN or NOT IN and a result of UNKNOWN (NULL) has the same meaning as a result of FALSE.
The following examples illustrate how the requirement for equivalence of UNKNOWN and FALSE predicate evaluation affects whether subquery materialization can be used. Assume that where_condition involves columns only from t2 and not t1 so that the subquery is noncorrelated.
This query is subject to materialization:
SELECT * FROM t1 WHERE t1.a IN (SELECT t2.b FROM t2 WHERE where_condition);
Here, it does not matter whether the IN predicate returns UNKNOWN or FALSE. Either way, the row from t1 is not included in the query result.
An example where subquery materialization is not used is the following query, where t2.b is a nullable column:
SELECT * FROM t1 WHERE (t1.a,t1.b) NOT IN (SELECT t2.a,t2.b FROM t2 WHERE where_condition);
The following restrictions apply to the use of subquery materialization:
• The types of the inner and outer expressions must match. For example, the optimizer might be able to use materialization if both expressions are integer or both are decimal, but cannot if one expression is integer and the other is decimal.
• The inner expression cannot be a BLOB.
Use of EXPLAIN with a query provides some indication of whether the optimizer uses subquery materialization:
• Compared to query execution that does not use materialization, select_type may change from DEPENDENT SUBQUERY to SUBQUERY. This indicates that, for a subquery that would be executed once per outer row, materialization enables the subquery to be executed just once.
• For extended EXPLAIN output, the text displayed by a following SHOW WARNINGS includes materialize and materialized-subquery.
In MySQL 8.0.21 and later, MySQL can also apply subquery materialization to a single-table UPDATE or DELETE statement that uses a [NOT] IN or [NOT] EXISTS subquery predicate, provided that the statement does not use ORDER BY or LIMIT, and that subquery materialization is allowed by an optimizer hint or by the optimizer_switch setting.
3 Optimizing Subqueries with the EXISTS Strategy
Certain optimizations are applicable to comparisons that use the IN (or =ANY) operator to test subquery results. This section discusses these optimizations, particularly with regard to the challenges that NULL values present. The last part of the discussion suggests how you can help the optimizer.
Consider the following subquery comparison:
outer_expr IN (SELECT inner_expr FROM ... WHERE subquery_where)
MySQL evaluates queries “from outside to inside.” That is, it first obtains the value of the outer expression outer_expr, and then runs the subquery and captures the rows that it produces.
A very useful optimization is to “inform” the subquery that the only rows of interest are those where the inner expression inner_expr is equal to outer_expr. This is done by pushing down an appropriate equality into the subquery's WHERE clause to make it more restrictive. The converted comparison looks like this:
EXISTS (SELECT 1 FROM ... WHERE subquery_where AND outer_expr=inner_expr)
After the conversion, MySQL can use the pushed-down equality to limit the number of rows it must examine to evaluate the subquery.
More generally, a comparison of N values to a subquery that returns N-value rows is subject to the same conversion. If oe_i and ie_i represent corresponding outer and inner expression values, this subquery comparison:
(oe_1, ..., oe_N) IN (SELECT ie_1, ..., ie_N FROM ... WHERE subquery_where)
Becomes:
EXISTS (SELECT 1 FROM ... WHERE subquery_where AND oe_1 = ie_1 AND ... AND oe_N = ie_N)
For simplicity, the following discussion assumes a single pair of outer and inner expression values.
The “pushdown” strategy just described works if either of these conditions is true:
• outer_expr and inner_expr cannot be NULL.
• You need not distinguish【dɪˈstɪŋɡwɪʃ 区分;辨别;分清;使有别于;使出众;认出;看清;】 NULL from FALSE subquery results. If the subquery is a part of an OR or AND expression in the WHERE clause, MySQL assumes that you do not care. Another instance where the optimizer notices that NULL and FALSE subquery results need not be distinguished is this construct:
... WHERE outer_expr IN (subquery)
In this case, the WHERE clause rejects the row whether IN (subquery) returns NULL or FALSE.
Suppose that outer_expr is known to be a non-NULL value but the subquery does not produce a row such that outer_expr = inner_expr. Then outer_expr IN (SELECT ...) evaluates as follows:
• NULL, if the SELECT produces any row where inner_expr is NULL
• FALSE, if the SELECT produces only non-NULL values or produces nothing
In this situation, the approach of looking for rows with outer_expr = inner_expr is no longer valid. It is necessary to look for such rows, but if none are found, also look for rows where inner_expr is NULL. Roughly speaking, the subquery can be converted to something like this:
EXISTS (SELECT 1 FROM ... WHERE subquery_where AND (outer_expr=inner_expr OR inner_expr IS NULL))
The need to evaluate the extra IS NULL condition is why MySQL has the ref_or_null access method:
mysql> EXPLAIN SELECT outer_expr IN (SELECT t2.maybe_null_key FROM t2, t3 WHERE ...) FROM t1; *************************** 1. row *************************** id: 1 select_type: PRIMARY table: t1 ... *************************** 2. row *************************** id: 2 select_type: DEPENDENT SUBQUERY table: t2 type: ref_or_null possible_keys: maybe_null_key key: maybe_null_key key_len: 5 ref: func rows: 2 Extra: Using where; Using index ...
The unique_subquery and index_subquery subquery-specific access methods also have “or NULL” variants.
The additional OR ... IS NULL condition makes query execution slightly more complicated【ˈkɑːmplɪkeɪtɪd 复杂的;难懂的;】 (and some optimizations within the subquery become inapplicable), but generally this is tolerable【ˈtɑːlərəbl 可接受的;可容忍的;还可以的;可忍受的;过得去的;尚好的;】.
The situation is much worse when outer_expr can be NULL. According to the SQL interpretation of NULL as “unknown value,” NULL IN (SELECT inner_expr ...) should evaluate to:
• NULL, if the SELECT produces any rows
• FALSE, if the SELECT produces no rows
For proper evaluation, it is necessary to be able to check whether the SELECT has produced any rows at all, so outer_expr = inner_expr cannot be pushed down into the subquery. This is a problem because many real world subqueries become very slow unless the equality can be pushed down.
Essentially【ɪˈsenʃəli 基本上;本质上;根本上;】, there must be different ways to execute the subquery depending on the value of outer_expr.
The optimizer chooses SQL compliance over speed, so it accounts for the possibility that outer_expr might be NULL:
• If outer_expr is NULL, to evaluate the following expression, it is necessary to execute the SELECT to determine whether it produces any rows:
NULL IN (SELECT inner_expr FROM ... WHERE subquery_where)
It is necessary to execute the original SELECT here, without any pushed-down equalities of the kind mentioned previously.
• On the other hand, when outer_expr is not NULL, it is absolutely essential that this comparison:
outer_expr IN (SELECT inner_expr FROM ... WHERE subquery_where)
Be converted to this expression that uses a pushed-down condition:
EXISTS (SELECT 1 FROM ... WHERE subquery_where AND outer_expr=inner_expr)
Without this conversion, subqueries are slow.
To solve the dilemma【dɪˈlemə 困境;(进退两难的)窘境;】 of whether or not to push down conditions into the subquery, the conditions are wrapped within “trigger” functions. Thus, an expression of the following form:
outer_expr IN (SELECT inner_expr FROM ... WHERE subquery_where)
Is converted into:
EXISTS (SELECT 1 FROM ... WHERE subquery_where AND trigcond(outer_expr=inner_expr))
More generally, if the subquery comparison is based on several pairs of outer and inner expressions, the conversion takes this comparison:
(oe_1, ..., oe_N) IN (SELECT ie_1, ..., ie_N FROM ... WHERE subquery_where)
And converts it to this expression:
EXISTS (SELECT 1 FROM ... WHERE subquery_where AND trigcond(oe_1=ie_1) AND ... AND trigcond(oe_N=ie_N) )
Each trigcond(X) is a special function that evaluates to the following values:
• X when the “linked” outer expression oe_i is not NULL
• TRUE when the “linked” outer expression oe_i is NULL
【Trigger functions are not triggers of the kind that you create with CREATE TRIGGER.】
Equalities that are wrapped【ræpt 包,裹(礼物等);用…包裹(或包扎、覆盖等);用…缠绕(或围紧);】 within trigcond() functions are not first class predicates for the query optimizer. Most optimizations cannot deal with predicates that may be turned on and off at query execution time, so they assume any trigcond(X) to be an unknown function and ignore it. Triggered equalities【ɪˈkwɑlətiz 平等;相等;均等;】 can be used by those optimizations:
• Reference optimizations: trigcond(X=Y [OR Y IS NULL]) can be used to construct ref, eq_ref, or ref_or_null table accesses.
• Index lookup-based subquery execution engines: trigcond(X=Y) can be used to construct unique_subquery or index_subquery accesses.
• Table-condition generator: If the subquery is a join of several tables, the triggered condition is checked as soon as possible.
When the optimizer uses a triggered condition to create some kind of index lookup-based access (as for the first two items of the preceding list), it must have a fallback strategy for the case when the condition is turned off. This fallback strategy is always the same: Do a full table scan. In EXPLAIN output, the fallback shows up as Full scan on NULL key in the Extra column:
mysql> EXPLAIN SELECT t1.col1, t1.col1 IN (SELECT t2.key1 FROM t2 WHERE t2.col2=t1.col2) FROM t1\G *************************** 1. row *************************** id: 1 select_type: PRIMARY table: t1 ... *************************** 2. row *************************** id: 2 select_type: DEPENDENT SUBQUERY table: t2 type: index_subquery possible_keys: key1 key: key1 key_len: 5 ref: func rows: 2 Extra: Using where; Full scan on NULL key
If you run EXPLAIN followed by SHOW WARNINGS, you can see the triggered condition:
*************************** 1. row *************************** Level: Note Code: 1003 Message: select `test`.`t1`.`col1` AS `col1`, <in_optimizer>(`test`.`t1`.`col1`, <exists>(<index_lookup>(<cache>(`test`.`t1`.`col1`) in t2 on key1 checking NULL where (`test`.`t2`.`col2` = `test`.`t1`.`col2`) having trigcond(<is_not_null_test>(`test`.`t2`.`key1`))))) AS `t1.col1 IN (select t2.key1 from t2 where t2.col2=t1.col2)` from `test`.`t1`
The use of triggered conditions has some performance implications【ˌɪmpləˈkeɪʃənz (被)牵连,牵涉;含意;可能的影响(或作用、结果);暗指;】. A NULL IN (SELECT ...) expression now may cause a full table scan (which is slow) when it previously did not. This is the price paid for correct results (the goal of the trigger-condition strategy is to improve compliance, not speed).
For multiple-table subqueries, execution of NULL IN (SELECT ...) is particularly slow because the join optimizer does not optimize for the case where the outer expression is NULL. It assumes that subquery evaluations with NULL on the left side are very rare, even if there are statistics that indicate otherwise. On the other hand, if the outer expression might be NULL but never actually is, there is no performance penalty【ˈpenəlti 处罚;惩罚;刑罚;点球;(对犯规者的)判罚;不利;害处;】.
To help the query optimizer better execute your queries, use these suggestions:
• Declare a column as NOT NULL if it really is. This also helps other aspects of the optimizer by simplifying condition testing for the column.
• If you need not distinguish a NULL from FALSE subquery result, you can easily avoid the slow execution path. Replace a comparison that looks like this:
outer_expr [NOT] IN (SELECT inner_expr FROM ...)
with this expression:
(outer_expr IS NOT NULL) AND (outer_expr [NOT] IN (SELECT inner_expr FROM ...))
Then NULL IN (SELECT ...) is never evaluated because MySQL stops evaluating AND parts as soon as the expression result is clear.
Another possible rewrite:
[NOT] EXISTS (SELECT inner_expr FROM ... WHERE inner_expr=outer_expr)
The subquery_materialization_cost_based flag of the optimizer_switch system variable enables control over the choice between subquery materialization and IN-to-EXISTS subquery transformation.