User:Terrye/ACM Stategies

= ACM Strategies =

Caveat

 * This page is very much a brainstorm dump of some thoughts for discussion, mainly for the dev team so I am assuming a detailed knowledge of the ACM and caching design. Once we've had this, I'll write up a general overview as part of the tuning tutorial set for phpBB administrators and then archive or delete this.

Introduction
I've already laid out some of my groundwork in my other sandbox Use Case Discussion  and I/O Performance Impacts papers. This is based on analysis of the SQL and Apache logs from the OOo and VBox live systems, plus a couple of VM test instances which run full copies of these live sites, plus lots of little Perl scripts and spreadsheets. In terms of design points for the caching, I think that the main points are:
 * There are two main installation use cases. The first is shared webhosting service (SWS), and with this you are basically in the lap of the Gods: you might actually experience reasonable performance, but in general the service provider makes no SLA commitments and such services are at a high contention ratio on their hosting servers.   They can therefore have poor cache hit ratios; this will in turn result in relatively high physical I/O rates on physical HDDs, and again the high contention can lead to device queuing.  The user will have little or no flexibility to optimise file, MySQL or PHP caching so the main infrastructure performance tuning is outside their control and the only performance boost will be achieved at an application level by application caching and avoidance of logical I/O that could result in physical I/O.  With this sort of BB installation the typical transaction rates will be 1 webpage per minute or less and the typical number of online users will be a 1-few.  The only ACM option available is the acm_files module.
 * Comment: I have such an entry level service for a wiki that I run and looking at the top, iostat and MySQL stats, the actual contentions are very reasonable and I could get reasonable phpBB performance off this. Hit and miss.'


 * The second case is the Private Virtual Server (PVS) and here you do have control over the Apache, PHP and MySQL tuning and therefore can materially effect the level of file, MySQL or PHP caching.  Such services typically have an SLA guaranteeing RAM resources and contention levels.  Xcache, APC and memcache options are available and these will result in major performance dividends, so in this case it make no sense to use the default acm_files module.  (A dedicate host service is just an extreme case of this where the contention is zero and the guaranteed resources are the entire host hardware.)  This sort of PVS can run at 60+ webpages per minute and the DHS more than this.  Typical numbers of online users will be 10s to hundreds.  This type of service is also more likely to be regularly indexed by the main bots.


 * It therefore makes sense to optimise the ACM files module for the SWS / 1-few concurrent users case and the ACM xcache, apc and memcache modules for the PVS 10/100+ concurrent users case. They are very different animals.


 * The majority of webpage transactions are viewtopic (~75-80%), viewforum (7-10%), with logons, board index, search and posting all in the 2-5% range. The rest of the functions are crumbs in hits terms (e.g. UCP and ACP); on my servers they barely register in the volumetrics.


 * If the service is being indexed by the major bots, then it is quite possible that bot access will for the major webpage load on the service.

I now want to discuss the design of the various caching implementation first in general terms from my analysis of live logs and static code analysis, and then specifically for these two scenarios: the file based cache the SWS profile and the memory accelerators cache against the PVS one.

General Observations

 * The general concept of caching the expensive SQL queries and the current implementation is broadly a sound one, so most of my suggestions are as a result of refactoring based on volumetrics or similar tweaks.
 * I am taking as a given some of the changes that Chris already has in SVN such as the introduction of the ACP xcache, apc and memcache modules; the change of the files module to write non-PHP files, ...
 * The current architecture divides SQL queries into five TTL domains based on their volatility:
 * 1 year TTL. There are a bunch of board configuration parameters that are essentially piecewise static because they are they are flushed as require by the ACP following configuration changes.  They held as instance properties to the ACM cache object: global, _acl_options, _bots, _cfg_imageset_ , _cfg_template_ , _cfg_theme_ , _disallowed_usernames, _extensions, _hooks, _icons, _modules_acp, _modules_mcp, _modules_ucp, _ranks, _role_cache, _word_censors (more accurately as elements of the associative array $this->vars).
 * 1 hr TTL . Other reference data from the forum relating to the forum hierarchy and its configuration, style structure, bbcodes and the moderator cache.
 * 5-10 min TTL . ACP views of the forum structure, smileys, another style view.
 * 1 min TTL. Currently only one query is in this band and that is a count of online guests (all or by forum to be displayed optionally as an "ONLINE" footer to the board and viewforum webpages.
 * Not Cached. The queries are not cached by the phpBB application. Note that these queries are also cached in the MySQL Qcache.  This actually uses a similar algo to phpBB for deleting cached entries: all entries depending on table X are deleted when X is updated.  But this does mean that most of the queries on low volatility tables would be cached in MySQL anyway.
 * Functionality Note: I have analysed the cases where multiple logged-on sessions have the same IP, and these are more commonly commonly different users behind the same corporate or ISP proxy than the same user logged on through two browsers. This being the case, we have no reason to assume anything different for the guests and these should be counted as separate guests.  I would therefore recommend COUNT(s.session_id) as a better estimator of num_guests than COUNT(DISTINCT s.session_ip).


 * In aggregate caching of queries can save seconds of response time for each page view on busy systems.
 * The rationale for the 1-year, 1-hour, 5-10-mins split of cache lifetimes seems very unclear to me.  I suspect that it's as much an artefact of the coding of different areas by different developers who chose different TTL values.  Anyway, most phpBB admins probably do the same as me anyway:  because of funny phpBB gremlins, I've learn to clear the cache through the ACP main page whenever I do change the bbcodes, the forum structure or styles (say once per month).  Given that the cost of an odd query every hour is like 0.001% of the SQL processing load, then why not just stick with one standard TTL for this quasi-static data, say 1 hr, unless there are clear business reasons for deviating from this.  The 1-minute case is a separate one which I discuss below.
 * The current algorithm for tagging queries by table is quite complex for both the files and accelerator cached versions. Yet flush activities are actually very rare, so there is no practical advantage in implementing the per table flush.  (Incidentally, the current algorithm for cleaning up queries by table fails also fails in the case of LEFT JOIN queries, though this a case of two wrongs making a right in that the only cached LEFT JOIN query only needs to be flushed on change of the first table).
 * Functionality Note: Switch from a per table purge of the SQL cache to a purge of the entire variable cache. This simplifies the main path coding in the accelerator cache version as the sql_  variables no longer need to be maintained. This is at the cost of a few seconds processing perhaps once per day spread over 3-4 webpage queries.

Design Points on the ACM Files Cache

 * The current globbing of the $this->vars variables into the single file object global_data makes a lot of sense in performance terms as this consolidates what would otherwise be a dozen or so separate file reads (or writes) with the extra I/Os needed. This also improves the "hotness" of this file making it more likely to be cached.  The CPU overhead is minimal.  So a definite "+1" on this.
 * The change from bulk read write to fopen+fgets(s)+fclose saves absolutely nothing in I/O terms. The main path is to read the entire file content and even if we didn't, the files are so small that we wouldn't save any physical I/Os if we did.  It is far better and easier to use file_get_contents and file_put_contents as these are the optimum method of loading and saving entire files when you want to process the entire glob in one go.  These functions are also atomic with respect to other PHP processes so there is no chance of getting inconsistent results.
 * The one thing that you do need to guard against (which the code current doesn't) is that these file operations certainly on Windows are not blocking; that is if there is a collision where a sister process has the file open, this file open for read or write will fail with a ERROR_SHARING_VIOLATION (and AFAIK this is also the case on Linux also).  This is a soft error, in that the changes of this is very quite small but on a busy system you could be rolling this dice a million times a day if you use the ACM files cache.  It will happen from time to time especially if your system has gone into disk queue overload.  (We had an account where a system developed by one of our subcontractors was failing because this bug and it ended up costing us a fortune — like take my annual salary and start adding noughts at the end — so this one is burned into my subconscious :lol:)  The easiest thing to do here is to wrap file_get_contents and file_put_contents with a retry say 2 two times (with a small random delay between them).  Alternatively or we look at the end-user use case for when this happens and adopt the strategy of requesting the browser to retry the URI.
 * I also question the added length check. As I said, file_(get|put)_contents are atomic. Also the algorithm for unserializing the datastream would fail in this case into into the code path that deletes the cached copy anyway.  I suspect that you are doing this because you seem to get the occasional error writing to the cache directory on busy systems — see my previous bullet.
 * If you do this then to should be able to reduce the size of _read and _write by about 3 x.
 * If you implement my suggested "one purge fits all strategy" then _destroy table option is a lot simpler — you just call purge.
 * I can't understand rationale for doing a phpbb_chmod($file, CHMOD_READ | CHMOD_WRITE) after writing a file to the cache, I can't think of a scenario where you could get here and need this unless you had a umask of 577 or the like and that would be just bizarre. More to the point this function definitely doesn't do what its name implies.  If I comment this line out then my files have the chmod mask 644 which is what I expect, but after this function they end up with a mask 620.  Bizarre. What is the point?
 * Caching the count of guests (on its own) seems an odd thing to do. For low volume SWS systems this will rarely save the query as the % of viewtopics which repeats hits on any given forum within 1 min will be won't be high.  It's also a very cheap SQL query when the session table is small (as in this case), so there seems to be no valid case for caching this one.  My recommendation is not to cache this and thus remove all the consequential writes to the cache that almost certainly have a higher system overhead than caching this query saves.
 * Probably the simplest method of conditionally caching would be to introduce another static configuration variable, say query_minimum_TTL, which would default to 120s say. Any TTL values less than this minimum would be zeroed and therefore not cached.

Design Points on the ACM Memory Accelerator Caches

 * I don't have any major comment on the ACM memory accelerator based variants in regards to existing functionality, apart from the tweaks and changes that I've either discuss above or previously (especially dropping the per table purge so that we no longer need to maintain the sql_ variables).
 * I do think that we could do with a mysqli cache because this would really help to fill in that gap between the current SWS and HVS sweet spots for higher performing SWS where the files cache is proving a limitation. I am just upgrading my OOo forums to 3.0.4 and I'll but updating mine here.
 * There is one area of functional extension that I do think merits discussion is the use of this "1 minute TTL" category for high hit rate forums, those typically hosted on HVS or DHS offerings. But here the drill down merits separate discussion for the three main hitters: viewpage, viewforum and index, reviewing all queries to see which ones might help being cached at this sort of lifetime. For example the entire "WHO IS ONLINE" summary could be on a 60s cache, since this is quite an expensive adjunct and true coherency is not an application issue here.