Magento Config Cache Stampeding Race Condition

Published: October 30, 2017

Tags:

It started with an alert.

CPU usage had climbed to above 80% on the web servers. Additionally, average response time had spiked.

New Relic response time spike as a response of cache stampeding

In New Relic APM Pro, we could see in some slow transaction traces that the server was busy loading XML configurations…which should’ve been coming from cache.

After a few hours of investigation, we found that the issue was due to a core bug which can result in cache stampeding due to a race condition when the site is under high load.

Let’s dive into the issue.

How It Happens

The opportunity for cache stampeding is present in Mage_Core_Model_Config::saveCache(). Here’s the full method as of 1.9.3.6…

public function saveCache($tags=array())
{
    if (!Mage::app()->useCache('config')) {
        return $this;
    }
    if (!in_array(self::CACHE_TAG, $tags)) {
        $tags[] = self::CACHE_TAG;
    }
    $cacheLockId = $this->_getCacheLockId();
    if ($this->_loadCache($cacheLockId)) {
        return $this;
    }
    if (!empty($this->_cacheSections)) {
        $xml = clone $this->_xml;
        foreach ($this->_cacheSections as $sectionName => $level) {
            $this->_saveSectionCache($this->getCacheId(), $sectionName, $xml, $level, $tags);
            unset($xml->$sectionName);
        }
        $this->_cachePartsForSave[$this->getCacheId()] = $xml->asNiceXml('', false);
    } else {
        return parent::saveCache($tags);
    }
    $this->_saveCache(time(), $cacheLockId, array(), 60);
    $this->removeCache();
    foreach ($this->_cachePartsForSave as $cacheId => $cacheData) {
        $this->_saveCache($cacheData, $cacheId, $tags, $this->getCacheLifetime());
    }
    unset($this->_cachePartsForSave);
    $this->_removeCache($cacheLockId);
    return $this;
}

The method uses a lock, which should ideally prevent stampeding…

$cacheLockId = $this->_getCacheLockId();
if ($this->_loadCache($cacheLockId)) {
    return $this;
}

However, not only is this not atomic, but it also executes a bunch of code before it saves the lock…

$cacheLockId = $this->_getCacheLockId();
if ($this->_loadCache($cacheLockId)) {
    return $this;
}

// Save cache sections...

$this->_saveCache(time(), $cacheLockId, array(), 60);

Due to these faulty locking mechanics, the following race condition is possible…

Under high throughput, the same pattern repeats itself ad nauseam with Request C, Request D, Request E and so on.

While the race condition is not ideal, the real problem is the cache stampeding it causes. This happens because Mage_Core_Model_Config::saveCache() clears the config cache before it saves…

$this->removeCache();

As a result, under high throughput a backlog of requests will repeatedly clear the cache creating a stampeding effect preventing the config cache from working at all.

Redis DEL throughput spike due to cache stampeding race condition

What Can I Do?

In the heat of the moment, to resolve the issue while it was in progress we temporarily commented out the removeCache call in Mage_Core_Model_Config::saveCache() to allow the cache to save, and then reverted our changes. Response time immediately stabilized.

New Relic response time stabilizing after resolving cache stampeding

Upon further research, however, I found that Colin Mollenhour, author of several significant parts of Magento core code including Cm_Cache_Backend_Redis and Cm_RedisSession published a patched version of Mage_Core_Model_Config that fixes this issue.

His patch was packaged up as a composer installable component by Aligent, which was subsequently forked by magehost. The magehost fork includes compatibility with Magento 1.9.2.3 and SUPEE-7405 and is what you should apply at the time of publishing this.

I’m not clear sure at this point if Magento has an official SUPEE patch for this issue. As such, I’d recommend applying Colin’s patch to your code base, especially if your expecting high traffic this holiday season.

Max ChadwickHi, I'm Max!

I'm a software developer who mainly works in PHP, but also dabbles in Ruby and Go. Technical topics that interest me are monitoring, security and performance.

During the day I solve challenging technical problems at Something Digital where I mainly work with the Magento platform. I also blog about tech, work on open source and hunt for bugs.

If you'd like to get in touch with me the best way is on Twitter.