Starting with MySQL 5.7 we introduced the Query Rewrite Plugin. That tool is really useful for changing queries. Of course the best location to modify the query is the source code of the application, but this is not always possible. Either the application is not under your control or queries are generated from a framework like Hibernate and sometimes it is hard to change the query generation.
If you are interested in details about the Query Rewrite Plugin, I recommend this blogpost from the MySQL Engineering: http://mysqlserverteam.com/the-query-rewrite-plugins/
Recently I was asked how this works in replication environments. Which query goes into the binlog?
If you are using the Rewriter plugin that comes with MySQL 5.7, the answer is easy: This plugin only supports rewriting SELECT queries. SELECT queries don't get into the binlog at all. Simple.
But you might write your own preparse or postparse plugin. In that case you can define the behavior with the server option --log-raw. See documentation here: https://dev.mysql.com/doc/refman/5.7/en/server-options.html#option_mysqld_log-raw
You can either bring the original query to the binlog or the rewritten query. So all flexibility you need. However be aware that --log-raw also affects logging of passwords in the general log file. With --log-raw passwords are written in plain text to the log files. So consider this side effect when switching --log-raw on or off.
Wednesday, April 13, 2016
Monday, April 4, 2016
MySQL 5.7: Optimizer finds best index by expression
The optimizer in MySQL 5.7 leverages generated columns. Generated columns will physically store data in two cases: Either the column is defined as STORED or you create an index on a virtual column. The optimizer will leverage such an index automatically if it encounters the same expression in a statement. Let's see an example:
mysql> DESC squares;
+-------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------------+------+-----+---------+-------+
| dx | int(10) unsigned | YES | | NULL | |
| dy | int(10) unsigned | YES | | NULL | |
+-------+------------------+------+-----+---------+-------+
2 rows in set (0.00 sec)
mysql> SELECT COUNT(*) FROM squares;
+----------+
| COUNT(*) |
+----------+
| 2097152 |
+----------+
1 row in set (0.77 sec)
We have a large table with 2 million rows. Selecting rows by the surface area of squares can hardly leverage an index on dx or dy:
mysql> EXPLAIN SELECT * FROM squares WHERE dx*dy=221\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: squares
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 2092860
filtered: 100.00
Extra: Using where
1 row in set, 1 warning (0.00 sec)
Now let's add an index over a generated, virtual column that defines the area:
mysql> ALTER TABLE squares ADD COLUMN (area INT AS (dx*dy));
Query OK, 0 rows affected (0.02 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> ALTER TABLE squares ADD INDEX (area);
Query OK, 0 rows affected (5.24 sec)
Records: 0 Duplicates: 0 Warnings: 0
Now we can run query again:
mysql> EXPLAIN SELECT * FROM squares WHERE dx*dy=221\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: squares
partitions: NULL
type: ref
possible_keys: area
key: area
key_len: 5
ref: const
rows: 18682
filtered: 100.00
Extra: NULL
1 row in set, 1 warning (0.00 sec)
I did not change the query! The WHERE condition is still dx*dy. Nevertheless the optimizer finds the generated column, sees the index and decides to leverage that.
So you can add complex indexes and without changing the application code you can benefit from these indexes. That makes life much easier.
One limitation though: It seems the optimizer recognizes expressions only in the WHERE clause. It will not use the generated column and index for the SELECT expression:
mysql> EXPLAIN SELECT SUM(dx*dy) FROM squares\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: squares
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 2092860
filtered: 100.00
Extra: NULL
1 row in set, 1 warning (0.00 sec)
mysql> EXPLAIN SELECT SUM(area) FROM squares\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: squares
partitions: NULL
type: index
possible_keys: NULL
key: area
key_len: 5
ref: NULL
rows: 2092860
filtered: 100.00
Extra: Using index
1 row in set, 1 warning (0.00 sec)
mysql> DESC squares;
+-------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------------+------+-----+---------+-------+
| dx | int(10) unsigned | YES | | NULL | |
| dy | int(10) unsigned | YES | | NULL | |
+-------+------------------+------+-----+---------+-------+
2 rows in set (0.00 sec)
mysql> SELECT COUNT(*) FROM squares;
+----------+
| COUNT(*) |
+----------+
| 2097152 |
+----------+
1 row in set (0.77 sec)
We have a large table with 2 million rows. Selecting rows by the surface area of squares can hardly leverage an index on dx or dy:
mysql> EXPLAIN SELECT * FROM squares WHERE dx*dy=221\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: squares
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 2092860
filtered: 100.00
Extra: Using where
1 row in set, 1 warning (0.00 sec)
Now let's add an index over a generated, virtual column that defines the area:
mysql> ALTER TABLE squares ADD COLUMN (area INT AS (dx*dy));
Query OK, 0 rows affected (0.02 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> ALTER TABLE squares ADD INDEX (area);
Query OK, 0 rows affected (5.24 sec)
Records: 0 Duplicates: 0 Warnings: 0
Now we can run query again:
mysql> EXPLAIN SELECT * FROM squares WHERE dx*dy=221\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: squares
partitions: NULL
type: ref
possible_keys: area
key: area
key_len: 5
ref: const
rows: 18682
filtered: 100.00
Extra: NULL
1 row in set, 1 warning (0.00 sec)
I did not change the query! The WHERE condition is still dx*dy. Nevertheless the optimizer finds the generated column, sees the index and decides to leverage that.
So you can add complex indexes and without changing the application code you can benefit from these indexes. That makes life much easier.
One limitation though: It seems the optimizer recognizes expressions only in the WHERE clause. It will not use the generated column and index for the SELECT expression:
mysql> EXPLAIN SELECT SUM(dx*dy) FROM squares\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: squares
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 2092860
filtered: 100.00
Extra: NULL
1 row in set, 1 warning (0.00 sec)
mysql> EXPLAIN SELECT SUM(area) FROM squares\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: squares
partitions: NULL
type: index
possible_keys: NULL
key: area
key_len: 5
ref: NULL
rows: 2092860
filtered: 100.00
Extra: Using index
1 row in set, 1 warning (0.00 sec)
CHECK constraint for MySQL - NOT NULL on generated columns
Update: Starting with MySQL 8.0.16 we do have CHECK constraints implemented in SQL! See here.
During our recent TechTour event the idea came up to implement JSON document validation not necessarily via foreign keys (as I have shown here) but to define the generated column as NOT NULL. The generation expression must be defined in a way that it returns NULL for invalid data.
DISCLAIMER: This has already been explored by yoku0825 in his blogpost. He deserves all credit!
Let's do a short test:
mysql> CREATE TABLE checker (
i int,
i_must_be_between_7_and_12 BOOLEAN
AS (IF(i BETWEEN 7 AND 12, true, NULL))
VIRTUAL NOT NULL);
Query OK, 0 rows affected (0.04 sec)
mysql> INSERT INTO checker (i) VALUES (11);
Query OK, 1 row affected (0.01 sec)
mysql> INSERT INTO checker (i) VALUES (12);
Query OK, 1 row affected (0.01 sec)
mysql> INSERT INTO checker (i) VALUES (13);
ERROR 1048 (23000): Column 'i_must_be_between_7_and_12' cannot be null
As you can see I used the column name to create a meaningful error message when inserting invalid data. It is perfectly possible to add a generated validation column for each data column so that you run several check constraints.
Or you can even check a combination of columns:
mysql> CREATE TABLE squares (
dx DOUBLE,
dy DOUBLE,
area_must_be_larger_than_10 BOOLEAN
AS (IF(dx*dy>10.0,true,NULL)) NOT NULL);
Query OK, 0 rows affected (0.05 sec)
mysql> INSERT INTO squares (dx,dy) VALUES (7,4);
Query OK, 1 row affected (0.01 sec)
mysql> INSERT INTO squares (dx,dy) VALUES (2,4);
ERROR 1048 (23000): Column 'area_must_be_larger_than_10' cannot be null
As generated columns are virtual by default this costs no extra storage. Data volume is the same. The expression is evaluated when inserting or updating data.
If you add a validation column to an already existing table and want to verify all existing rows, you could define the validation column as STORED (instead of the default VIRTUAL). This will fail if there are any invalid rows in your existing data set. However in normal operation a virtual column seems more appropriate for performance reasons. So I recommend to always use VIRTUAL validation columns and check pre-existing data separately with a small procedure.
During our recent TechTour event the idea came up to implement JSON document validation not necessarily via foreign keys (as I have shown here) but to define the generated column as NOT NULL. The generation expression must be defined in a way that it returns NULL for invalid data.
DISCLAIMER: This has already been explored by yoku0825 in his blogpost. He deserves all credit!
Let's do a short test:
mysql> CREATE TABLE checker (
i int,
i_must_be_between_7_and_12 BOOLEAN
AS (IF(i BETWEEN 7 AND 12, true, NULL))
VIRTUAL NOT NULL);
Query OK, 0 rows affected (0.04 sec)
mysql> INSERT INTO checker (i) VALUES (11);
Query OK, 1 row affected (0.01 sec)
mysql> INSERT INTO checker (i) VALUES (12);
Query OK, 1 row affected (0.01 sec)
mysql> INSERT INTO checker (i) VALUES (13);
ERROR 1048 (23000): Column 'i_must_be_between_7_and_12' cannot be null
As you can see I used the column name to create a meaningful error message when inserting invalid data. It is perfectly possible to add a generated validation column for each data column so that you run several check constraints.
Or you can even check a combination of columns:
mysql> CREATE TABLE squares (
dx DOUBLE,
dy DOUBLE,
area_must_be_larger_than_10 BOOLEAN
AS (IF(dx*dy>10.0,true,NULL)) NOT NULL);
Query OK, 0 rows affected (0.05 sec)
mysql> INSERT INTO squares (dx,dy) VALUES (7,4);
Query OK, 1 row affected (0.01 sec)
mysql> INSERT INTO squares (dx,dy) VALUES (2,4);
ERROR 1048 (23000): Column 'area_must_be_larger_than_10' cannot be null
As generated columns are virtual by default this costs no extra storage. Data volume is the same. The expression is evaluated when inserting or updating data.
If you add a validation column to an already existing table and want to verify all existing rows, you could define the validation column as STORED (instead of the default VIRTUAL). This will fail if there are any invalid rows in your existing data set. However in normal operation a virtual column seems more appropriate for performance reasons. So I recommend to always use VIRTUAL validation columns and check pre-existing data separately with a small procedure.
Tuesday, February 2, 2016
Looking for the smallest possible MySQL Footprint
UPDATE: Starting with MySQL 8.0.16 we have introduced the new minimal tar ball distribution. Take a look here.
MySQL is known and famous for it’s simplicity and small size, especially compared to other RDBMSs. But what if you want to deploy on tiny hardware? I mean something even smaller than RaspberryPi?
MySQL is known and famous for it’s simplicity and small size, especially compared to other RDBMSs. But what if you want to deploy on tiny hardware? I mean something even smaller than RaspberryPi?
I tested three steps to make the MySQL
footprint as small as possible. All my tests were compiled for Oracle
Linux 7 on x64 platform. I did not test any ARM cross compile. And
these are the steps:
- Compile my own binary
- Remove all unnecessary tools/files
- Strip symbol information from binary
Let’s take a closer look at the tree
steps.
Compile my own binary
MySQL is available as a source release.
Using that you can configure the make process. That is documented
pretty well in the Reference Manual. By
switching off some
options I was able to reduce the binary size from 240MB to 216MB. I
switched off some performance_schema features, removed some storage
engines that are irrelevant in most environments anyway (like
ARCHIVE, NDB, EXAMPLE, …) and I removed all options for profiling.
The final CMAKE statement is at the bottom of this post.
Remove unnecessary tools
I removed scripts and binaries from
the distribution. Ted has written an interesting blog post about
this. The remaining share directory contains some SQL
scripts for installing additional tools. You need these at most once
during setup and never again. So let’s remove these. If you are
happy to live without textual error messages you can also remove the
errmsg-utf8.txt file as well and all translations in the country
specific subdirs. And if you can live with reduced charset support,
you can even remove the rest of the share directory. You are running
essentially only with a mysqld binary.
Strip symbol information from binary
All compilations are done with extended
diagnosis information in the binary. These symbol data helps if you
want to analyze a core dump for example. Symbols are included by
default in the MySQL binaries. These take a surprisingly large amount
of space. You can remove these symbols from the binary with the tool
“strip(1). After stripping
the binary size came down to 24MB, which is only 10% of the initial
size.
More ideas
There are some more options to use either system libraries or the libraries that come with the source code. Using existing libraries from the system might help save a few bytes.
Summary
It is possible to make MySQL very lean
for your (embedded) system. Despite all the functionality that we
added to MySQL in the releases since MySQL 5.1 you get a full
featured RDBMS with only a handful of MB. Here are my final results:
- MySQL 5.6, minimal features: 79MB, stripped 13MB
- MySQL 5.7, default features: 240MB, stripped 24MB
- MySQL5.7, minimal features: 216MB, stripped 24MB (removing features brings minimal savings only)
Addendum
This is the CMAKE statement I used to compile MySQL 5.7 on Oracle Linux 7:
cmake . -DCMAKE_INSTALL_PREFIX=/home/testy/TQ/dist-mysql-5.7.10/ \
-DDOWNLOAD_BOOST=1 \
-DWITH_BOOST=/home/testy/TQ/boost/ \
-DDISABLE_PSI_COND=1 \
-DDISABLE_PSI_FILE=1 \
-DDISABLE_PSI_IDLE=1 \
-DDISABLE_PSI_MEMORY=1 \
-DDISABLE_PSI_METADATA=1 \
-DDISABLE_PSI_MUTEX=1 \
-DDISABLE_PSI_RWLOCK=1 \
-DDISABLE_PSI_SOCKET=1 \
-DDISABLE_PSI_SP=1 \
-DDISABLE_PSI_STAGE=1 \
-DDISABLE_PSI_STATEMENT=1 \
-DDISABLE_PSI_STATEMENT_DIGEST=1 \
-DDISABLE_PSI_TABLE=1 \
-DWITH_ARCHIVE_STORAGE_ENGINE=0 \
-DWITH_BLACKHOLE_STORAGE_ENGINE=0 \
-DWITH_EXAMPLE_STORAGE_ENGINE=0 \
-DWITH_FEDERATED_STORAGE_ENGINE=0 \
-DWITH_PARTITION_STORAGE_ENGINE=0 \
-DWITH_PERFSCHEMA_STORAGE_ENGINE=0 \
-DENABLED_PROFILING=0 \
-DENABLE_DEBUG_SYNC=0 \
-DENABLE_DTRACE=0 \
-DENABLE_GCOV=0 \
-DENABLE_GPROF=0 \
-DOPTIMIZER_TRACE=0 \
-DWITH_CLIENT_PROTOCOL_TRACING=0 \
-DWITH_DEBUG=0 \
-DWITH_INNODB_EXTRA_DEBUG=0
-DDOWNLOAD_BOOST=1 \
-DWITH_BOOST=/home/testy/TQ/boost/ \
-DDISABLE_PSI_COND=1 \
-DDISABLE_PSI_FILE=1 \
-DDISABLE_PSI_IDLE=1 \
-DDISABLE_PSI_MEMORY=1 \
-DDISABLE_PSI_METADATA=1 \
-DDISABLE_PSI_MUTEX=1 \
-DDISABLE_PSI_RWLOCK=1 \
-DDISABLE_PSI_SOCKET=1 \
-DDISABLE_PSI_SP=1 \
-DDISABLE_PSI_STAGE=1 \
-DDISABLE_PSI_STATEMENT=1 \
-DDISABLE_PSI_STATEMENT_DIGEST=1 \
-DDISABLE_PSI_TABLE=1 \
-DWITH_ARCHIVE_STORAGE_ENGINE=0 \
-DWITH_BLACKHOLE_STORAGE_ENGINE=0 \
-DWITH_EXAMPLE_STORAGE_ENGINE=0 \
-DWITH_FEDERATED_STORAGE_ENGINE=0 \
-DWITH_PARTITION_STORAGE_ENGINE=0 \
-DWITH_PERFSCHEMA_STORAGE_ENGINE=0 \
-DENABLED_PROFILING=0 \
-DENABLE_DEBUG_SYNC=0 \
-DENABLE_DTRACE=0 \
-DENABLE_GCOV=0 \
-DENABLE_GPROF=0 \
-DOPTIMIZER_TRACE=0 \
-DWITH_CLIENT_PROTOCOL_TRACING=0 \
-DWITH_DEBUG=0 \
-DWITH_INNODB_EXTRA_DEBUG=0
Subscribe to:
Posts (Atom)