Monday, March 12, 2012

Why should I consider memcached plugin?

My last post explained what to expect from memcached plugin in MySQL 5.6 (labs release). But I want to take a step back and think about "why" first. Why is memcached plugin of interest at all? What can I gain from using this instead of plain SQL?

First: I don't see this as a replacement for memcached. If you want memory caching with memcached then use memcached.

But the memcached plugin to MySQL is a replacement or addition to the SQL interface to MySQL. So instead of using SQL queries in your application to persist or retrieve data from MySQL you can use the memcached interface. And what are the benefits?
  • Much higher performance
  • Easier scalability via sharding
  • Simpler application coding

1. Performance

Performance is always good. But there are two different aspects of performance: Latency (or runtime) of a specific query and throughput. And the relation of these two measures is not easy and simple to explain. So we need to discuss both:

1.1 Latency

Latency is the time it takes to execute an individual query. If using SQL the server needs to parse that SQL query (which is not so easy because SQL is quite a complex language), then optimize the execution plan of a query and finally execute it. Memcached interface is much simpler so parsing is close to nothing and accessing a single row by an index does not require any optimization. Only the real query execution is the same. I did a test in a VM with one virtual disk. (Not the best benchmark environment ;-) I inserted single rows in multiple threads via C-API (using SQL) and via libmemcachd (using the new plugin). The mean latency for a single query was 20%-30% higher for SQL queries. Considering that my IO configuration was bad, in a better environment the share of query execution would be smaller so the benefit of memcached plugin (no parsing, no optimization) would be even bigger.

1.2 Throughput

Usually lower latency helps improving the throughput. If each query is finished earlier, the server can already start working on the next query. This is true only if the bottleneck is really the CPU or RAM. If latency is mainly due to disk IO, it is not so helpful. During my test I achieved up to 50% higher throughput with memcached plugin. Again, a suitable IO config can probably improve this advantage even further.

2. Scalability

If a single server is no longer enough one solution may be sharding. With SQL you usually have two options: Switch to MySQL cluster, which offers auto-sharding. The other alternative is to implement sharding on your own in your application, which requires some effort.
With memcached plugin it is easier: For connecting to the database you would usually use some library like libmemcached. And many of these libraries offer ... a sharding feature. This is a natural feature for memcached libraries as the goal of memcached is to offer a distributed cache. Depending on the key value (or a hash of it) the library determines which instance to talk to. There is no need for you to extend your application code and manage different database connections. In libmemcached you can simply define your connectstring as "SERVER=donald:11211 SERVER=dagobert:11211 SERVER=daisy:11211". And that's it. Libmemcached will automatically shard your data to all three database servers.
Well, one thing must be mentioned: If you want to query your data via SQL and do aggregation, you have to aggregate in your application. The SQL interface is still not auto-sharded. If you need this kind of functionality, you should consider MySQL Cluster. Cluster will shard your data automatically, wether coming via memcached API or SQL. See more in Andrew's blog. And for MySQL Cluster the memcache interface is already recommended for production use in MySQL Cluster 7.2.

3. Ease of use

The typical architecture is very often application - memcache layer - MySQL database. And for read access the application talks to memcache first, and if this results in a "miss", the application will re-do the query but with a different protocol to the MySQL server. You must implement both protocols: memcache API and SQL. If you use the memcached plugin, you can now access the database directly with the same API functions that you already use for memcache. That makes development much easier.
The sharding feature in the memcached library will make it easier to scale the database. You do not have to implement it on your own.
"Easy of use" only applies if your application is happy with doing only low level data access to a key value store. Once you also need higher level data access like JOINs or aggregation queries, you will again use memcache API and SQL in the same application. So nothing will change to before.

4. Availability?

Not a huge argument. By using the memcached plugin you essentially work with MySQL and you will leverage all possible HA options that are available for MySQL to make your data highly available. It is worth mentioning that memcached plugin supports replication. You have to define daemon_memcached_enable_binlog = 1. That's it.

Code examples

Just for your reference here is the shortcut version of the code that I used to test SQL and memcache API in connecting to MySQL server.

C-API

#include
...
MYSQL *mysql;
char stmt[256];
int key = 0;
mysql = mysql_init(NULL);
if (mysql_real_connect(mysql,"localhost","user","password",
"schema",3306,NULL,0L) == NULL) {
goodbye();
}
for (;;) {
sprintf (stmt, "INSERT INTO kvstore VALUES('key%d','%s',0,0,0)",
key++,value);
if (mysql_query (mysql, stmt) != 0) {goodbye();}
}

Libmemcached

#include
...
char *config_string = "--SERVER=localhost:11211";
memcached_st *memc;
memcached_return_t rc;
memc = memcached(config_string, strlen(config_string));
if (memc == NULL) {goodbye();}
for (;;) {
rc= memcached_set(memc, key, strlen(key), value, strlen(value),
(time_t)0, (uint32_t)0);
if (rc != MEMCACHED_SUCCESS) goodbye();
}

2 comments:

  1. I think a huge benefit you're discussing, but not naming separately, is consistently. With memcached, you have two copies of the data, and the man with two watches never knows what time it is. With a memcached interface to MySQL, you have only one copy -- and it is consistent. This is a huge win.

    ReplyDelete
  2. Hi Mario

    You're right that your IO setup doesn't really reflect a realistic setup. From use of HandlerSocket and Galera, we find that the SQL parsing overhead is about half of query processing. In other words even with a write-only workload (ie one where there is always still some I/O) you could expect to get 2x benefit from this kind of NoSQL interface.

    On a read-only and in-memory workload (no I/O at all) the benefit from HandlerSocket was 7x more throughput.

    The interesting feature with HandlerSocket was that they ended up being faster than Memcached itself. This is probably also true for MySQL Memcache API. You should try it one day.

    ReplyDelete