Tuesday, February 2, 2016

Looking for the smallest possible MySQL Footprint

MySQL is known and famous for it’s simplicity and small size, especially compared to other RDBMSs. But what if you want to deploy on tiny hardware? I mean something even smaller than RaspberryPi?
I tested three steps to make the MySQL footprint as small as possible. All my tests were compiled for Oracle Linux 7 on x64 platform. I did not test any ARM cross compile. And these are the steps:
  1. Compile my own binary
  2. Remove all unnecessary tools/files
  3. Strip symbol information from binary

Let’s take a closer look at the tree steps.

Compile my own binary

MySQL is available as a source release. Using that you can configure the make process. That is documented pretty well in the Reference Manual. By switching off some options I was able to reduce the binary size from 240MB to 216MB. I switched off some performance_schema features, removed some storage engines that are irrelevant in most environments anyway (like ARCHIVE, NDB, EXAMPLE, …) and I removed all options for profiling. The final CMAKE statement is at the bottom of this post.

Remove unnecessary tools

I removed scripts and binaries from the distribution. Ted has written an interesting blog post about this. The remaining share directory contains some SQL scripts for installing additional tools. You need these at most once during setup and never again. So let’s remove these. If you are happy to live without textual error messages you can also remove the errmsg-utf8.txt file as well and all translations in the country specific subdirs. And if you can live with reduced charset support, you can even remove the rest of the share directory. You are running essentially only with a mysqld binary.

Strip symbol information from binary

All compilations are done with extended diagnosis information in the binary. These symbol data helps if you want to analyze a core dump for example. Symbols are included by default in the MySQL binaries. These take a surprisingly large amount of space. You can remove these symbols from the binary with the tool “strip(1). After stripping the binary size came down to 24MB, which is only 10% of the initial size.

More ideas

There are some more options to use either system libraries or the libraries that come with the source code. Using existing libraries from the system might help save a few bytes.


It is possible to make MySQL very lean for your (embedded) system. Despite all the functionality that we added to MySQL in the releases since MySQL 5.1 you get a full featured RDBMS with only a handful of MB. Here are my final results:

  • MySQL 5.6, minimal features: 79MB, stripped 13MB
  • MySQL 5.7, default features: 240MB, stripped 24MB
  • MySQL5.7, minimal features: 216MB, stripped 24MB (removing features brings minimal savings only)


This is the CMAKE statement I used to compile MySQL 5.7 on Oracle Linux 7:
cmake . -DCMAKE_INSTALL_PREFIX=/home/testy/TQ/dist-mysql-5.7.10/        \
        -DDOWNLOAD_BOOST=1                                              \
        -DWITH_BOOST=/home/testy/TQ/boost/                              \
        -DDISABLE_PSI_COND=1   \
        -DDISABLE_PSI_FILE=1   \
        -DDISABLE_PSI_IDLE=1   \
        -DDISABLE_PSI_SP=1     \
        -DDISABLE_PSI_STAGE=1  \

Thursday, November 26, 2015

JSON memory consumption

I got some more questions on the new JSON data type and functions during our TechTours. And I like to summarize the answers in this blogpost.

Memory consumption 

The binary format of the JSON data type should consume more memory. But how much? I did a little test by comparing a freshly loaded 25,000 row dataset stored as JSON and stored as TEXT. Seven top level attributes per JSON document. Average JSON_DEPTH is 5.9 . Let's see:
mysql> DESC data_as_text;
| Field | Type    | Null | Key | Default | Extra |
| id    | int(11) | NO   | PRI | NULL    |       |
| doc   | text    | YES  |     | NULL    |       |
2 rows in set (0.00 sec)

mysql> SELECT COUNT(*),AVG(JSON_LENGTH(doc)) FROM data_as_text;
| COUNT(*) | AVG(JSON_LENGTH(doc)) |
|    25359 |                7.0000 |
1 row in set (0.81 sec)

mysql> DESC data_as_json;
| Field | Type    | Null | Key | Default | Extra          |
| id    | int(11) | NO   | PRI | NULL    | auto_increment |
| doc   | json    | NO   |     | NULL    |                |
2 rows in set (0.00 sec)

mysql> SELECT COUNT(*),AVG(JSON_LENGTH(doc)) FROM data_as_json;
| COUNT(*) | AVG(JSON_LENGTH(doc)) |
|    25359 |                7.0000 |
1 row in set (0.08 sec)

mysql> select name,allocated_size/1024/1024 AS "size in MB" from information_schema.innodb_sys_tablespaces where name like "%temp%";
| name              | size in MB  |
| temp/data_as_json | 23.00390625 |
| temp/data_as_text | 22.00390625 |
2 rows in set (0.00 sec)
The increased memory consumption is 1/22 in this case, which is roughly 4,5%. At the same time you see the benefit: The full table scan with some JSON operation has a 90% reduction in runtime when using JSON datatype.
Don't take this number for real. Of course it depends on the number of JSON attributes, character set and others. Just a rough indication. If you want all the details look at the JSON architecture in WL#8132.

Monday, November 23, 2015

Document validation of JSON columns in MySQL

Starting with the new release MySQL 5.7 there is support to store JSON documents in a column. During our recent Tech Tour events we got questions about document validation, so ensuring that a JSON document has a certain structure. (Funny. It all started with the idea to be schema-free. Now people seem to need schema enforcement.)
I have two ideas how to implement a schema validation for JSON columns. The first one is by leveraging generated columns together with a foreign key. The second idea is by implementing a trigger. Today I want to focus on the generated columns and foreign keys.
When defining foreign keys with generated columns there are two limitations we need to be aware of:
  • Foreign keys require indexes. JSON columns cannot be indexed. We need to leverage other types.
  • Only STORED generated columns are supported for foreign keys.
So here is an example of an address table that leverages JSON to define an arbitrary number of phone number entries per row. In fact I use a mixed model of relational (e.g. to enforce a strict model for name NOT NULL) and document so that phone numbers are more free to define.

 CREATE TABLE `people` (  
 `name` varchar(30) NOT NULL,  
 `firstname` varchar(30) DEFAULT NULL,  
 `birthdate` date DEFAULT NULL,  
 `phones` json DEFAULT NULL,  
 `phonekeys` varchar(30) GENERATED ALWAYS AS (json_keys(phones)) STORED,  
 KEY `phonekeys` (`phonekeys`));  

The generated column phonekeys is a string that includes the types of phone numbers for each row. Some sample data:

 mysql> INSERT INTO people (name,firstname,birthdate,phones)  
 VALUES ("Plumber", "Joe, the", "1972-05-05",'{"work": "+1(555)24680"}');  
 Query OK, 1 row affected (0.00 sec)  
 ...some more inserts...  
 mysql> SELECT * FROM people;  
 | name | firstname | birthdate | phones | phonekeys |  
 | Doe | John | 1995-04-17 | {"mobile": "+491715555555", "private": "+49305555555"} | ["mobile", "private"] |  
 | Dian | Mary | 1963-12-12 | {"work": "+43987654321"} | ["work"] |  
 | Plumber | Joe, the | 1972-05-05 | {"work": "+1(555)24680"} | ["work"] |  
 3 rows in set (0.00 sec)  

The column phonekeys gets populated automatically.
To check that we use "correct" attributes in our JSON object we can now create a table that contains the valid JSON keys:

  CREATE TABLE `valid_keys` (  
  `keylist` varchar(30) NOT NULL,  
  PRIMARY KEY (`keylist`)  
 1 row in set (0.01 sec)  
 ... after some inserts...  
 mysql> SELECT * FROM valid_keys;  
 | keylist            |  
 | ["mobile", "private", "work"] |  
 | ["mobile", "private"]     |  
 | ["work"]           |  
 3 rows in set (0.00 sec)  

Now we can define a foreign key with the people table as a child table:
mysql> alter table people add foreign key (phonekeys) references valid_keys (keylist);

That should enforce that inserted JSON documents in the people table must have a list of attributes that matches any entry in the valid_keys table. Let's try:

mysql> INSERT INTO people (name,phones) VALUES ("me", JSON_OBJECT("work","12243"));
Query OK, 1 row affected (0.01 sec)

mysql> INSERT INTO people (name,phones) VALUES ("my friend", JSON_OBJECT("home","12243"));
ERROR 1452 (23000): Cannot add or update a child row: a foreign key constraint fails (`mario`.`people`, CONSTRAINT `people_ibfk_1` FOREIGN KEY (`phonekeys`) REFERENCES `valid_keys` (`keylist`))

Works fine. "home" is not an allowed attribute. I can leverage the foreign keys to make sure my phone numbers match a certain attribute list. However it is not perfectly simple to use. With five different allowed attributes in an arbitrary order you would have to add all permutations to the valid_keys table. With five attributes you end up with 6! permutations ("not defining an attribute" is also an option, hence six), which results in 720 rows for valid_keys. But it is a first start. For more complex explamples the ideas with triggers might be more favorable.

Thursday, April 9, 2015

Secondary Indexes on XML BLOBs in MySQL 5.7

When storing XML documents in a BLOB or TEXT column there was no way to create indexes on individual XML elements or attributes. With the new auto generated columns in MySQL 5.7 (1st Release Candidate available now!) this has changed! Let me give you an example. Let's work on the following table:
 mysql> SELECT * FROM country\G  
 *************************** 1. row ***************************  
 docid: 1  
  doc: <country>  
     <city name="Berlin"><population></population></city>  
     <city name="Frankfurt"><population>643821</population></city>  
     <city name="Hamburg"><population>1704735</population></city>  
 *************************** 2. row ***************************  
 docid: 2  
  doc: <country>  
     <city name="Paris"><population>445452</population></city>  
     <city name="Lyon"></city>  
     <city name="Brest"></city>  
 *************************** 3. row ***************************  
 docid: 3  
  doc: <country>  
     <city name="Brest"><population></population></city>  
 *************************** 4. row ***************************  
 docid: 4  
  doc: <country>  
 4 rows in set (0,00 sec)  

The table has only two columns: docid and doc. Since MySQL 5.1 it is possible to extract the population value thanks to the XML functions like ExtractValue(...). But sorting the documents by the population of a country was impossible because population is not a dedicated column in the table. Starting with MySQL 5.7.6 DMR we can add an auto generated column that contains only the population. Let’s create that column:

 mysql> ALTER TABLE country ADD COLUMN population INT UNSIGNED AS (CAST(ExtractValue(doc,"/country/population") AS UNSIGNED INTEGER)) STORED;
  Query OK, 4 rows affected (0,21 sec)   
  Records: 4 Duplicates: 0 Warnings: 0   
  mysql> ALTER TABLE country ADD INDEX (population);   
  Query OK, 0 rows affected (0,22 sec)   
  Records: 0 Duplicates: 0 Warnings: 0   
  mysql> SELECT docid FROM country ORDER BY population ASC; 
  | docid |   
  |     4 |   
  |     3 |   
  |     2 |   
  |     1 |   
  4 rows in set (0,00 sec)  

The population value is extracted automatically from each document, stored in a dedicated column and the index is maintained. Really simple now. Note that the population value of the cities is NOT extracted.

What happens if we want to look for city names? Each document may contain several city names. First let’s extract the city names with the XML function and store it in an auto generated column again:

 mysql> ALTER TABLE country ADD COLUMN cities TEXT AS (ExtractValue(doc,"/country/city/@name")) STORED;  
 Query OK, 4 rows affected (0,62 sec)  
 Records: 4 Duplicates: 0 Warnings: 0  
 mysql> SELECT docid,cities FROM country;  
 | docid | cities                   |  
 |     1 | Berlin Frankfurt Hamburg |  
 |     2 | Paris Lyon Brest         |  
 |     3 | Brest                    |  
 |     4 |                          |  
 4 rows in set (0,01 sec)  

The XML function ExtractValue extracts the name attribute of all cities and concatenates these with whitespace. That makes it easy for us to leverage the FULLTEXT index in InnoDB:

 mysql> ALTER TABLE country ADD FULLTEXT (cities);  
 mysql> SELECT docid FROM country WHERE MATCH(cities) AGAINST ("Brest");  
 | docid |  
 |     2 |  
 |     3 |  
 2 rows in set (0,01 sec)  

All XML calculations are done automatically when storing data. Let’s add another XML document and query again:

 mysql> INSERT INTO country (doc) VALUES ('<country><name>USA</name><city name="New York"/><population>278357000</population></country>');  
 Query OK, 1 row affected (0,00 sec)  
 mysql> SELECT * FROM country WHERE MATCH(cities) AGAINST ("New York");  
 | docid | doc                                                                                          | population | cities   |  
 |     5 | <country><name>USA</name><city name="New York"/><population>278357000</population></country> |  278357000 | New York |  
 1 row in set (0,00 sec)  

Does this also work with JSON documents? There are JSON functions available in a labs release. These functions are currently implemented as user defined functions (UDF) in MySQL. UDFs are not supported in auto generated columns. So we have to wait until JSON functions are built-in to MySQL.
UPDATE: See this blogpost. There is a first labs release to use JSON functional indexes.

What did we learn? tl;dr

With MySQL 5.7.6 it is possible to automatically create columns from XML elements or attributes and maintain indexes on this data. Search is optimized, MySQL is doing all the work for you. And Brest is not only in France but also a city in Belarus.

Tuesday, March 24, 2015

Profiling Stored Procedures in MySQL 5.7

With the changes to performance_schema in MySQL 5.7 Development Milestone Release it is now possible to analyze and profile the execution of stored programs. This is highly useful if you develop more complex stored procedures and try to find the bottlenecks. The "old" performance_schema up to MySQL 5.6 only reported a CALL statement with a runtime, but no information on statements that were executed WITHIN the stored procedure. Now let's try this in the latest MySQL 5.7.6 DMR release. After creating some test table and a test stored procedure we need to activate the events_statements_history_long consumer, which is OFF by default:

mysql> UPDATE setup_consumers SET ENABLED="YES" 
           WHERE NAME = "events_statements_history_long"; 

Then let's call the stored procedure that we want to inspect:

mysql> CALL test.massinsert(400,405); 

To avoid that we overwrite data from the events_statements_history_long table with the following queries, let's deactivate that consumer ASAP. If you have some concurrent load running on your system, it may be wise to leverage the filter options in performance_schema like setup_actors and/or setup_objects.

mysql> UPDATE setup_consumers SET ENABLED="NO" 
          WHERE NAME = "events_statements_history_long"; 

Next step is to find our CALL statement in the events_statements_history_long table:

mysql> SELECT event_id,sql_text, 
              CONCAT(TIMER_WAIT/1000000000,"ms") AS time 
                 FROM events_statements_history_long 
       WHERE event_name="statement/sql/call_procedure"; +----------+-------------------------------+-----------+ 
| event_id | sql_text                      | time      | +----------+-------------------------------+-----------+ 
| 144      | call massinsert(100,105)      | 0.2090ms  |
| 150      | call massinsert(100,105)      | 79.9659ms | 
| 421      | CALL test.massinsert(400,405) | 74.2078ms | +----------+-------------------------------+-----------+ 
3 rows in set (0,03 sec) 

You see: I tried this stored procedure three times. The one I want to inspect in detail is event_id 421. Let's look at all nested statement events that came from 421:

        CONCAT(TIMER_WAIT/1000000000,"ms") AS time   
     FROM events_statements_history_long   
     WHERE nesting_event_id=421 ORDER BY event_id;   
 | EVENT_NAME               | SQL_TEXT                          | time      |  
 | statement/sp/stmt        | SET @i = first                    | 0.0253ms  |   
 | statement/sp/stmt        | SET @i = @i + 1                   | 0.0155ms  |   
 | statement/sp/stmt        | INSERT INTO a VALUES (@i,MD5(@i)) | 45.6425ms |   
 | statement/sp/jump_if_not | NULL                              | 0.0311ms  |   
 | statement/sp/stmt        | SET @i = @i + 1                   | 0.0297ms  |   
 | statement/sp/stmt        | INSERT INTO a VALUES (@i,MD5(@i)) | 4.9695ms  |   
 | statement/sp/jump_if_not | NULL                              | 0.0726ms  |   
 | statement/sp/stmt        | SET @i = @i + 1                   | 0.0365ms  |   
 | statement/sp/stmt        | INSERT INTO a VALUES (@i,MD5(@i)) | 6.8518ms  |   
 | statement/sp/jump_if_not | NULL                              | 0.0343ms  |   
 | statement/sp/stmt        | SET @i = @i + 1                   | 0.0316ms  |   
 | statement/sp/stmt        | INSERT INTO a VALUES (@i,MD5(@i)) | 9.9633ms  |   
 | statement/sp/jump_if_not | NULL                              | 0.0309ms  |   
 | statement/sp/stmt        | SET @i = @i + 1                   | 0.0274ms  |   
 | statement/sp/stmt        | INSERT INTO a VALUES (@i,MD5(@i)) | 5.6235ms  |   
 | statement/sp/jump_if_not | NULL                              | 0.0308ms  |
 16 rows in set (0,06 sec)   
Now we have the statements that were executed in the stored procedure "massinsert(400,405)" with their individual execution times and in order of execution. We have all other information available as well, not only execution time. We can access number of rows affected, sql error text, used algorithms, ... All information that performance_schema offers for statement events. This is a great way to analyze your stored procedures. find the most costly statements and improve performance of your stored programs. That is really a great enhancement to performance_schema.

Friday, March 13, 2015

Auto Generated Columns in MySQL 5.7: Two Indexes on one Column made easy

One of my customers wants to search for names in a table. But sometimes the search is case insensitive, next time search should be done case sensitive. The index on that column always is created with the collation of the column. And if you search with a different collation in mind, you end up with a full table scan. Here is an example:

The problem

 mysql> SHOW CREATE TABLE City\G  
 *************************** 1. row ***************************  
 Table: City  
 Create Table: CREATE TABLE `City` (  
 `Name` char(35) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,  
 `CountryCode` char(3) NOT NULL DEFAULT '',  
 `District` char(20) NOT NULL DEFAULT '',  
 `Population` int(11) NOT NULL DEFAULT '0',  
 KEY `CountryCode` (`CountryCode`),  
 KEY `Name` (`Name`),  
 1 row in set (0,00 sec)  

The collation of the column `Name` is utf8_bin, so case sensitive. Let's search for a City:
 mysql> SELECT Name,Population FROM City WHERE Name='berlin';  
 Empty set (0,00 sec)  
 mysql> EXPLAIN SELECT Name,Population FROM City WHERE Name='berlin';  
 | id | select_type | table | partitions | type | possible_keys | key  | key_len | ref   | rows | filtered | Extra |  
 | 1  | SIMPLE      | City  | NULL       | ref  | Name          | Name | 106     | const |  1   |  100.00  | NULL  |  
 1 row in set, 1 warning (0,00 sec)  

Very efficient statement, using the index. But unfortunately it did not find the row as the search is based on the case sensitive collation.
Now let's change the collation for the WHERE clause:
 mysql> SELECT Name,Population FROM City WHERE Name='berlin' COLLATE utf8_general_ci;  
 | Name   | Population |  
 | Berlin |    3386667 |  
 1 row in set (0,00 sec)  
 mysql> EXPLAIN SELECT Name,Population FROM City WHERE Name='berlin' COLLATE utf8_general_ci;  
 | id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra       |  
 | 1  | SIMPLE      | City  | NULL       | ALL  | Name          | NULL | NULL    | NULL | 4108 |  10.00   | Using where |  
 1 row in set, 3 warnings (0,00 sec)  

The result is what we wanted but the query creates a full table scan. Not good. BTW: The warnings point you to the fact that the index could not be used.

The solution

Now let's see how auto generated columns in the new MySQL 5.7 Development Milestone Release can help us. First let's create a copy of the Name column but with a different collation:
 mysql> ALTER TABLE City ADD COLUMN Name_ci char(35) CHARACTER SET utf8 AS (Name) STORED;  
 Query OK, 4079 rows affected (0,50 sec)  
 Records: 4079 Duplicates: 0 Warnings: 0  

"AS (Name) STORED" is the new stuff: In the brackets is the expression to calculate the column value. Here it is a simple copy of the Name column. The keyword STORED means that the data is physically stored and not calculated on the fly. This is necessary to create the index now:
 mysql> ALTER TABLE City ADD INDEX (Name_ci);  
 Query OK, 0 rows affected (0,13 sec)  
 Records: 0 Duplicates: 0 Warnings: 0  

As utf8_general_ci is the default collation with utf8, there is no need to specify this with the new column. Now let's see how to search:

mysql> SELECT Name, Population FROM City WHERE Name_ci='berlin';
| Name   | Population |
| Berlin |    3386667 |
1 row in set (0,00 sec)
mysql> EXPLAIN SELECT Name, Population FROM City WHERE Name_ci='berlin';nbsp;
| id | select_type | table | partitions | type | possible_keys | key     | key_len | ref   | rows | filtered | Extra |
|  1 | SIMPLE      | City  | NULL       | ref  | Name_ci       | Name_ci | 106     | const |    1 |   100.00 | NULL  |
1 row in set, 1 warning (0,00 sec)  

Now we can search case sensitive (...WHERE Name=...) and case insensitive (WHERE Name_ci=...) and leverage indexes in both cases.


Use auto generated columns in MySQL 5.7 to create an additional index with a different collation. Now you can search based on different indexes.

Wednesday, March 19, 2014

MySQL Cluster on Raspberry Pi - Sub-second failover

MySQL Cluster claims to achieve sub-second failover without any data loss for commited transactions. And I always wanted to show this in a demo. Now we created that demo finally. See Mark's blog and Keith's blog for setting up MySQL Cluster on RaspberryPi.
The nice thing about the RPis is that you can easily pull the plug to test failover. Ok, that is only one possible failure scenario but for sure the most obvious and more impressive than "kill -9".

That demo application is constantly using the database for storing new lines, removing old lines and reading all line data for the graphical view. There is no caching. It uses JDBC directly.
To document the setup here is the config.ini file for MySQL Cluster:

[ndbd default]


I made the cluster diskless so it will not write any logs and table spaces to disk. The SD card performance was not great and it does not affect failover behavior.
I also reduced the HeatbeatIntervallDbDb so that nodes detect immediately (well, 10ms) if a heartbeat is missed. After a few missed heartbeats cluster reconfigures and the remaining node takes responsibility and service continues.
BTW: Pulling the plug is nice but every now and then I had to manually fsck the root-fs during reboot.