Thursday, April 9, 2015

Secondary Indexes on XML BLOBs in MySQL 5.7

When storing XML documents in a BLOB or TEXT column there was no way to create indexes on individual XML elements or attributes. With the new auto generated columns in MySQL 5.7 (1st Release Candidate available now!) this has changed! Let me give you an example. Let's work on the following table:
 mysql> SELECT * FROM country\G  
 *************************** 1. row ***************************  
 docid: 1  
  doc: <country>  
     <name>Germany</name>  
     <population>82164700</population>  
     <surface>357022.00</surface>  
     <city name="Berlin"><population></population></city>  
     <city name="Frankfurt"><population>643821</population></city>  
     <city name="Hamburg"><population>1704735</population></city>  
 </country>  
 *************************** 2. row ***************************  
 docid: 2  
  doc: <country>  
     <name>France</name>  
     <surface></surface>  
     <city name="Paris"><population>445452</population></city>  
     <city name="Lyon"></city>  
     <city name="Brest"></city>  
     <population>59225700</population>  
 </country>  
 *************************** 3. row ***************************  
 docid: 3  
  doc: <country>  
     <population>10236000</population>  
     <name>Belarus</name>  
     <city name="Brest"><population></population></city>  
 </country>  
 *************************** 4. row ***************************  
 docid: 4  
  doc: <country>  
     <name>Pitcairn</name>  
     <population>52</population>  
 </country>  
 4 rows in set (0,00 sec)  

The table has only two columns: docid and doc. Since MySQL 5.1 it is possible to extract the population value thanks to the XML functions like ExtractValue(...). But sorting the documents by the population of a country was impossible because population is not a dedicated column in the table. Starting with MySQL 5.7.6 DMR we can add an auto generated column that contains only the population. Let’s create that column:

 mysql> ALTER TABLE country ADD COLUMN population INT UNSIGNED AS (CAST(ExtractValue(doc,"/country/population") AS UNSIGNED INTEGER)) STORED;
  Query OK, 4 rows affected (0,21 sec)   
  Records: 4 Duplicates: 0 Warnings: 0   
  mysql> ALTER TABLE country ADD INDEX (population);   
  Query OK, 0 rows affected (0,22 sec)   
  Records: 0 Duplicates: 0 Warnings: 0   
  mysql> SELECT docid FROM country ORDER BY population ASC; 
  +-------+   
  | docid |   
  +-------+   
  |     4 |   
  |     3 |   
  |     2 |   
  |     1 |   
  +-------+   
  4 rows in set (0,00 sec)  

The population value is extracted automatically from each document, stored in a dedicated column and the index is maintained. Really simple now. Note that the population value of the cities is NOT extracted.

What happens if we want to look for city names? Each document may contain several city names. First let’s extract the city names with the XML function and store it in an auto generated column again:

 mysql> ALTER TABLE country ADD COLUMN cities TEXT AS (ExtractValue(doc,"/country/city/@name")) STORED;  
 Query OK, 4 rows affected (0,62 sec)  
 Records: 4 Duplicates: 0 Warnings: 0  
 mysql> SELECT docid,cities FROM country;  
 +-------+--------------------------+  
 | docid | cities                   |  
 +-------+--------------------------+  
 |     1 | Berlin Frankfurt Hamburg |  
 |     2 | Paris Lyon Brest         |  
 |     3 | Brest                    |  
 |     4 |                          |  
 +-------+--------------------------+  
 4 rows in set (0,01 sec)  

The XML function ExtractValue extracts the name attribute of all cities and concatenates these with whitespace. That makes it easy for us to leverage the FULLTEXT index in InnoDB:

 mysql> ALTER TABLE country ADD FULLTEXT (cities);  
 mysql> SELECT docid FROM country WHERE MATCH(cities) AGAINST ("Brest");  
 +-------+  
 | docid |  
 +-------+  
 |     2 |  
 |     3 |  
 +-------+  
 2 rows in set (0,01 sec)  

All XML calculations are done automatically when storing data. Let’s add another XML document and query again:

 mysql> INSERT INTO country (doc) VALUES ('<country><name>USA</name><city name="New York"/><population>278357000</population></country>');  
 Query OK, 1 row affected (0,00 sec)  
 mysql> SELECT * FROM country WHERE MATCH(cities) AGAINST ("New York");  
 +-------+----------------------------------------------------------------------------------------------+------------+----------+  
 | docid | doc                                                                                          | population | cities   |  
 +-------+----------------------------------------------------------------------------------------------+------------+----------+  
 |     5 | <country><name>USA</name><city name="New York"/><population>278357000</population></country> |  278357000 | New York |  
 +-------+----------------------------------------------------------------------------------------------+------------+----------+  
 1 row in set (0,00 sec)  

Does this also work with JSON documents? There are JSON functions available in a labs release. These functions are currently implemented as user defined functions (UDF) in MySQL. UDFs are not supported in auto generated columns. So we have to wait until JSON functions are built-in to MySQL.
UPDATE: See this blogpost. There is a first labs release to use JSON functional indexes.

What did we learn? tl;dr

With MySQL 5.7.6 it is possible to automatically create columns from XML elements or attributes and maintain indexes on this data. Search is optimized, MySQL is doing all the work for you. And Brest is not only in France but also a city in Belarus.