There’s been a lot of discussion about PSR-6, the php-fig caching interfaces, so I thought it was time to step in and describe what this system is all about. Be prepared to read far more about caching interfaces than you probably thought possible.
Why is PSR-6 needed?
This subject is covered nicely in the PSR-6 Meta Docs, which summarize the issue as such:
Caching is a common way to improve the performance of any project, making caching libraries one of the most common features of many frameworks and libraries. This has lead to a situation where many libraries roll their own caching libraries, with various levels of functionality. These differences are causing developers to have to learn multiple systems which may or may not provide the functionality they need. In addition, the developers of caching libraries themselves face a choice between only supporting a limited number of frameworks or creating a large number of adapter classes.
This is a burden on several major groups-
- Frameworks have to build adaptors for just about every library that uses a cache.
- Library/Component developers are stuck reinventing the wheel in order to give their library caching capabilities.
- App developers (the consumers of Frameworks and Libraries) need to configure and manage multiple systems or just forgo caching in some components altogether.
- Operations engineers have very little insight or ability to configure and tune the resulting conglomeration of caching systems.
We don’t think we can solve all of those problems directly (at least not without creating a lot more around it). If we can give all of the library and component developers a common caching interface, however, the other problems start to solve themselves. Frameworks won’t need to build adaptors and can focus their development resources elsewhere, and the app and library developers can incorporate caching into their projects without caching becoming a project. With that problem solved developers can focus on other ones, such as expanding their caching library’s configuration and tuning system for the operations people.
Alternative Proposals
Before diving into the current standard I think it’s important to look at the alternatives that currently exist.
- Evert Pot is actually the one who started the caching discussion in the group, and his original proposal is a good example of a driver model cache.
- The Doctrine representative proposed a far more complex option, which we’ve been referring to as the Caching Framework model.
- Rasmus Schultz recently proposed an extended driver model that tried to take on some of the issues solved by the Pool/Item model.
Starting Point- The Driver Model
The simplest method of caching is what we've been referring to as the Driver Model. It is pretty bare bones and down to the metal approach that's based off of the interfaces often released by caching systems (such as APC).
interface CacheInterface { public function exists(); public function get($key); public function set($key, $value, $ttl = null); public function multiGet(array $key); public function multiSet(array $keys, array $values, $ttl = null); public function clear($key); }
At it's core this seems okay, and you can see examples of it in the wild where people have slapped together a caching system as part of their project.
Over time though this presents some issues.
How do you handle storing nulls or other values that may look "false"?
Sometimes developers want to store negative values such as false, null or 0. This is perfectly acceptable behavior and a lot more common than people think.
The naive approach is to say that you look for one value (say, null) in most situations but in the situation where you want null as an accepted output you look for something else (like false). This ignores times where both results are acceptable, but that may be rare enough to not worry about. However, it does make for sloppy API design as developers will be expecting different return values in different places to represent the same thing. That being said, PHP has never gotten flack from people over inconsistent API expectations so there's no point in worrying about that now right?
What’s this about race conditions?
A solution to the "null" problem is to stop relying on the return value itself for whether there was a hit or miss on the cache request. I took the liberty of adding it into this example already as the "exists" function. There's one problem though- between when "exists" and "get" are called (regardless of which order) there could be a change in the cached value's status. The value could be retrieved by get when it's a miss (and thus null), but then another process could insert it resulting in "exists" returning true and an improper value being used. So that solution isn't much of a solution.
Can we address the limited extensibility?
One of the biggest- and unfortunately most subtle- issues comes around extending the caching systems themselves. Part of the reason why this is hard to grasp is that it is, in many ways, invisible to the users of the libraries- but it's vitally important to the people developing them and the frameworks that use them. I'll expand on a few examples, such as dealing with stampede protection or multiset/get functions below- but on a more general basis this is about caching libraries ability to experiment.
Can we make multi functions more scalable?
Many caching systems have a way to send and retrieve more than one value at a time. When dealing with batches of data this can reduce quite a bit of overhead, and regardless of what API design you do with it's trivial to emulate on systems that don't support it. The Driver Model supports it straight out of the box by taking in an array of items to be set as a parameter.
This design forces a major decision on the developers that quite simply do not belong with them- how many items should be in this array? More values buffered to be saved means more memory being used but also means less overhead for the caching transaction itself. What that means though is completely different based not just on the library but on the circumstances that library is running on. Someone testing an application on an AWS micro instance with on-machine caching is going to have a completely different answer than someone running software on a high memory dedicated box which talks to a memcached server.
These values need to be set by someone familiar with the systems they're running, which means those settings better be exposed for configuration. With the driver model that impetus is going to be on each and every library that's using the multi functions, and the frameworks are going to need to expose that through their configuration systems.
How are misses handles with multiGet?
When doing multiGets you can't just assume all of these values will come back, and unfortunately the Driver method does not have a clear way to handle that- you could conceivably just leave them out (relying on finding missing entries from the array of keys), sort them, assign them null or false values when they're being filled, or one of many other options.
Evolving to the Pool Model
Those of you who have already read the standard know where we're going, but it's important to understand how we got there. The above issues have a number of solutions but we've ultimately looked to solve it in ways that are provable, simple, and flexible enough to fit the needs of future systems. We didn't leap from one model to another overnight, but took a series of steps that I've outlined below.
Introducing a Value Wrapper
Over time many users of caching systems end up doing something interesting- they start wrapping their values with other objects. Oftentimes it's a pretty simple wrapper, but this wrapper allows developers to easily distinguish between any of the miss states (false, null, 0) and an actual miss without any bothersome race conditions.
It turns out this method has some other benefits as well, but before getting to those lets take a look at what this would actually look like as interfaces.
interface CachePoolInterface { public function getItem($key); public function getItems(array $keys); public function saveItem($item); public function saveItems(array $items); public function clear(array $keys); }
interface CacheItemInterface { public function isHit(); public function get(); public function set($ttl); }
Compared to the Driver Model-
- get and multiGet turn into getItem and getItems respectively.
- set and multiSet turn into setItem and setItems respectively.
- exists turns into isHit and is placed on the new Item interface.
- Two new "get" and "set" functions exist as value wrappers in the new Item interface.
Any value of any type can be stored and retrieved, there will be no ambiguous code (with it's potentially hard to trace bugs), and we're not introducing bugs to fix that. This one incredibly minor change suddenly removes problems and provides for a much cleaner API, all without substantially changing the workflow of the average developer.
We refer to this as the Pool model due to it's division of the cache into a repository (the pool) and item responsibilities.
Scaling Multi Support
Unfortunately our work is not done, as we are still putting a lot of work on the libraries using multi functions.
It turns out that with our current Item Model this is remarkably easy to do. In fact we don't have to make any changes to how we multGet calls at all! Since the getItems returns an array it's trivial for calling libraries to create an ArrayObject that runs the "multiGet" call in the background as needed to fill it's buffer.
Setting multiple items at once is a little more complicated. With the previous method we were stuck forcing calling libraries to handle the buffer size themselves, but that can easily be passed to the caching library. To do this we toss our saveItems function out the window and replace it with saveItemDeferred.
interface CachePoolInterface { public function getItem($key); public function getItems(array $key); public function saveItem($item); public function saveItemDeferred($item); public function commit(); public function clear($item); }
This new function saves the passed Item in it's own buffer, and then clears that buffer at a rate that can be configured. Calling libraries can flush the buffer to cache with the new commit function. For caching systems that don't support multi functions at all a buffer of "one" can be set and applications won't be forced to use extra memory for no advantage.
This takes more work away from the calling libraries and gives the frameworks and higher level applications more control over how caching should function.
Misses with MultiGet
The final issue to resolve with Multi functions is what to do in the event of a cache miss on an Item, and it turns out that's the easiest answer of all- nothing! The only difference in the Pool model between retrieving an Item and a set of Items is that in one case you have an array- the isHit and get functions do exactly what you need from them.
There is one enhancement we can make to really ease development, and that's to add a function to the Item to allow developers to retrieve the key directly from it.
interface CacheItemInterface { public function isHit(); public function getKey(); public function get(); public function set($ttl); }
Putting this all together makes for a pretty straight forward way to deal with groups of values.
$items = $pool->getItems($keyList); foreach($items as $item) { if(!$item->isHit()) { $item->set(expensiveFunction($item->getKey())) $pool->saveItem($item); } $data = $item->get(); // Do thing you need to do }
A Matter of State
One of the major invisible differences between the two proposals is that of state.
The driver model approaches each action as a stand alone action- in fact, if it wasn’t for configuration needs you could make the driver model a set of static functions or even just procedural code without noticing any real difference. When actions occur for an item (it’s added or removed via the driver, for example) then the only way to save any state about that is to do it in the driver itself where it will then have to be managed with other item data.
Added item state has powerful implications. It makes solving the null/exists problem ridiculously simple, as has already been shown. With the ability to store item state, caching libraries can also attack other problems that are invisible to the outside developer.
When implementing stampede protection, for instance, there is often a lock or flag that is set to let other processes know that one is working on the refresh so they don’t also do it and overload the system. In cases where there’s an error and the flag isn’t reset (such as when an exception is thrown by the refresh code) it can have serious repercussions for the system as all of those processes wait for that refreshed value. With the Pool/Item model the developer can put something to clear the lock right in the Item’s destructor so it gets cleared right when the item is out of scope.
There are also benefits to performance monitoring. By having an Item record the time between when it’s status is checked and when it’s given a new value it’s possible to calculate the length of time spent on regenerating that value. That info can be stored to show how effective the caching system is.
All of those ideas can be emulated with the driver model, but it presents complications. To solve the locking issue the driver would have to maintain a record of all of it’s locks and clear them on destructions, or on a scheduled basis for long running processes, or else it’s no better than relying on the TTL and you get the performance hit. The performance monitoring can be solved by another index. In both cases it’s extremely important to monitor the size of those indexes, as they won’t automatically clear when the cache value is no longer in use like they will if the Item is cleared when using the Pool model. This gets particular dangerous when using multifunctions.
This is just a couple of quick examples, and are not even things that are required by the standard. The point is though that the standard shouldn’t get in the way of a library that wants to implement things like this, and that’s exactly what the driver model does.
Fear, Uncertainty and Doubt
Myth: Exists doesn’t save memory and therefore shouldn’t exist.
This one was already mentioned above, but to just drive the point home: this was never meant to be a memory preservation device, and any claims otherwise are red herrings. This function is here to keep a clean consistent API with regards to checking for cache misses.
Myth: This is all just too complex.
This is by far one of the things I hear most about this approach, and I have to admit it puts me at a loss. To boil this one down, there are people out there who feel that splitting the caching library from one class to two we are increasing the complexity of the standard to a level that is unacceptable.
On one hand you have a driver, and on the other you have a repository and items from the repository. In both cases you’re looking at the same functions as before, but with the addition of a few more that ease development. In fact the driver model can be converted to the pool model with amazing ease.
Myth: The standard doesn’t follow OOP guidelines or take advantage of composite classes.
Another of my favorite red herrings is that this standard ignores OOP and doesn’t use composite classes. This is typically followed with examples of extending the driver model to do things like add namespaces, with the argument that other features inside of this standard are easily added this way.
The reason I consider this a red herring is that none of the examples listed are unavailable to either model. This proposal was specifically designed to be a minimal required spec, so things like Tags, Namespaces and Stacks were ripped out of it to be placed in a future PSR specifically designed for optional features (don’t judge that proposal too harshly, it been ignored while work on PSR-6 goes forward). There is no difference between the way the driver or pool model would handle this when it comes to OOP modeling- in both cases the classes that can support the features implement interfaces announcing that they do so.
The driver model, while great as a direct interface to a caching backend, fails at some of the basic principles of object orientated design. The driver is responsible for all things- connecting with the service, setting and retrieving individuals, item invalidation, clearing or flushing the caching pool. The driver is responsible for the caching system, the items returned from it, and every idea that the caching library developers have to improve performance. By taking advantage of the Single Responsibility Principle and creating a class for the caching system as a whole and another for the individual entries in it we take advantage of some real solid API design.
Could this be split up more? Probably, but when asked the group decided that their minimum standards should all be included in the standard. The reason for this is pretty simple- if they’re minimal standards that have to be supported then adding interface checks to make sure those specific features are present (even though they’re already required so we know they are) would be unnecessary boilerplate. That being said any pull requests to improve that would be looked at seriously.
Myth: The Pool/Item model isn’t used anywhere.
Stash has been using this model since it’s inception, and as good ideas form in the PHP FIG group about this model they’ve been incorporated back in. With just shy of 100k on Packagist this is a decent number. Stash itself has seen contributions from numerous large sites (including major contributions from one of the top three adult sites in the world, just so you know it scales).
Beyond crap I’ve written you can see this model in numerous places as a value wrapper around cached items. Many of the projects in FIG, such as Drupal and eZ Publish, support this model.
Outside of the PHP world this is even more common. The two most popular Java caching systems, ehache and oscache, are based on this model (ehache uses a Cache and Element while oscache uses Cache and CacheEntry). The Spring Framework utilizes a Cache ValueWrapper object. This model is pretty much all over the java world. Anyone familiar with Microsoft programming will also immediately recognize this, as the .Net framework uses this model as well.
Simply put, the only way someone could claim that this model was made up by FIG is if they’ve literally never looked at caching libraries in other languages. You can not spend any time at all building caching libraries or services without running into this, especially if you’ve ever worked on enterprise level software.
Myth: This is just standardizing Stash.
This is another favorite of mine, as it has absolutely no technical reasoning or even criticism of the exact proposal. This is like claiming PSR-0 and PSR-4 were for composer, that PSR-3 simply standardized monolog, or that PSR-7 is going to standardize Guzzle. On one hand you have people screaming that proposals don’t match the real world, but then if they’re too close to the real world the other hand tries to slap you for it.
Stash and PSR-6 are very similar, but if you look at their history over time you’ll see that they’ve both grown- and that PSR-6 has in fact influenced Stash quite a bit. Although I do love Stash I know it’s not perfect, and in the process of making PSR-6 lots of people brought up great ideas and view points that got incorporated into the standard but were different than Stash. Rather than wait for the standard to be finished those ideas were incorporated into Stash where appropriate. Some ideas are just too good to be ignored.
In Closing
This was a broad overview of PSR-6 that should get people up to speed on the concepts behind it. It should be noted that there are differences between the full proposal and what I discussed here, as there were smaller changes and ideas that PHP-FIG members wanted incorporated.
This standard has a lot of potential, and the backing of many major caching libraries and projects. If anyone has questions please feel free to ask!