Coping with heavy server loads
By Flib
2009-05-19
Category: Architecture
The Problem
Not every site has been dugg or slashdotted, but when they do its a frequent occurance that the server is unable to cope with the sheer number of requests that are fired at it. However all is not lost, there are many little tricks that together can mean the difference between being able to weather the storm or being trampled by it.
The Solution
Caching PHP Output
The average PHP script on most non-trivial site may use 5 or more queries to build the page, so if many users hit your server at once the load on the database server will increase very quickly.
For many pages, the actual content of the page might only change fairly infrequently once it is published and even then, when it does change it probably doesn't matter if a user receives a version that is 5 or 10 minutes old (and even if it does we can stop that happening if we are careful). If this is the case, then PHP output caching can probably work for you.
Caching doesn't need to be done on a whole page basis, if you have a block of code that retrieves a list of the most up to date articles on your site, then this probably only needs to be updated once an hour. Where as other content on the page may need to be fresh every request.
Here is an example:
function tpl_rss($feedid=FALSE){
//start of cache code
require_once "Cache/Lite.php";
$options = array(
'cacheDir' => 'cache/tpl_rss/',
'lifeTime' => 300
);
$Cache_Lite = new Cache_Lite($options);
$id='tpl_rss';
if ($data=$Cache_Lite->get($id.$feedid)) {
return '<!-- Cache Hit -->'.$data;
}
//end of cache code
if (!$feedid) {
$query="select * from feed,feedentry where feed.feedid=feedentry.feedid and display=1 order by feed.feedid,feed.priority, feedentry.entrydate DESC";
} else {
$query="select * from feed,feedentry where feed.feedid=feedentry.feedid and feed.feedid=$feedid order by feedentry.entrydate DESC limit 6";
}
$result=mysql_query($query) or die('error in query');
$oldid=0;
$firstrss=TRUE;
$rss='<div class="rss">';
while ($line=mysql_fetch_assoc($result)) {
if ($line['feedid']!=$oldid) {
$rss.='<h3><a href="'.$line['feedlink'].'">'.htmlentities($line['feedname']).'</a></h3>';
$oldid=$line['feedid'];
}
$rss.='<p><a href="'.$line['link'].'">'.htmlentities($line['entrytitle']).'</a><br /><span class="rssdate">'.$line['entrydate'].'</span></p>';
}
$rss.='</div>';
//save a copy to the cache
$Cache_Lite->save($rss,$id.$feedid);
return '<!-- Cache Miss -->'.$rss;
}
The code above uses the PEAR Cache_Lite module to cache the output of function for 300 seconds (5 minutes). If the cache doesn't exist or has expired, the function is ran normally.
The same concept can be used with Memcache.
function tpl_rss($feedid=FALSE){
//start of cache code
$memcache = new Memcache;
$memcache->connect('localhost', 11211) or die ("Could not connect");
$id='tpl_rss';
$cacheexpire=300;
if ($data = $memcache->get('key')) {
return '<!-- Cache Hit -->'.$data;
}
//end of cache code
if (!$feedid) {
$query="select * from feed,feedentry where feed.feedid=feedentry.feedid and display=1 order by feed.feedid,feed.priority, feedentry.entrydate DESC";
} else {
$query="select * from feed,feedentry where feed.feedid=feedentry.feedid and feed.feedid=$feedid order by feedentry.entrydate DESC limit 6";
}
$result=mysql_query($query) or die('error in query');
$oldid=0;
$firstrss=TRUE;
$rss='<div class="rss">';
while ($line=mysql_fetch_assoc($result)) {
if ($line['feedid']!=$oldid) {
$rss.='<h3><a href="'.$line['feedlink'].'">'.htmlentities($line['feedname']).'</a></h3>';
$oldid=$line['feedid'];
}
$rss.='<p><a href="'.$line['link'].'">'.htmlentities($line['entrytitle']).'</a><br /><span class="rssdate">'.$line['entrydate'].'</span></p>';
}
$rss.='</div>';
//save a copy to the cache
$memcache->set($id.$feedid', $rss, false, $cacheexpire) or die ("Failed to save data at the server");
return '<!-- Cache Miss -->'.$rss;
}
The only real functional difference between the two is that with memcache the expiry is set when the cached data is saved. Cache_Lite tests it when the cache data is retrieved.
Make output cachable
At its simplest, simply providing an extra couple of extra lines in your script can reduce your load by a massive amount.
For example
//set HTTP/1.0 expires for 1 day
header('Expires: '.date('r',mktime(0, 0, 0, date("m"), date("d")+1, date("Y"));
//set HTTP/1.1 expiry for 1 day
header('Cache-Control: max-age='.(3600*24));
This is an unconditional caching directive since we don't tell the client or proxy that it must ask the server to see if there have been any changes to the webpage. It may result in the same content being seen by the user until the cache expires or they force a reload on their browser. For this reason it is best used for output that may change but if the user gets an old version its no big deal.
Some browsers will ignore an expires or Cache-control header when a Last-modified or Etag is sent. In addition, if you use PHP sessions, by default the session.cache_limiter directive is set such that a Pragma: no-cache header is sent if the session is started. This is a sensible default, but not useful if you are trying to get your site data cachable.
We can request that the client or proxy ask the server if the content has changed using the following changes to the code.
//set HTTP/1.0 expires for 1 day
header('Expires: '.date('r',mktime(0, 0, 0, date("m"), date("d")+1, date("Y"));
//set HTTP/1.1 expiry for 1 day
header('Cache-Control: max-age='.(3600*24).', must-revalidate);
This requires a little more work on the server-side though. You need to tell the client and any proxies what the Last-modified date of the file is or provide an Etag. In addition to this, you need to be able to process If-modified-since headers (for Last-modified) or If-none-match headers (for Etag) and if there has been no change tell the client (and proxies) using
header('HTTP/1.1 304 Not Modified');
Reduce connections
Connections are a problem in a few different ways.
- A client will generally limit the maximum simultaneous connections it makes to any particular hostname
- A server has a maximum number of connections it will service in parallel before it starts to queue (or even reject) additional connections
- Each connection takes time and server resources to setup, so reducing connections may result in faster apparant speed.
There are many ways to deal with each problem depending on which problem you are atttempting to deal with.
If the problem is the maximum number of client connections, then simply using additional hostnames can result in more parallel requests. For example, images.example.com and www.example.com will generally be able to have more combined connections than www.example.com alone.
If the maximum number of connections on the server is the problem, then you can add additional servers, raise the limits or attempt to stop any single connection from monopolising a connection. (See For details on the KeepAlive and MaxKeepAliveRequests directives and many of the directives )
One option that many people rarely even consider is to start to combine files. For example, its fairly easy to simply append CSS and javascript files to each other and produce one larger file. This can have drawbacks if the larger file isn't cachable or if only a small amount of the file is actually used on each page, but overall can reduce the number of requests by a large amount if done well.
Compress your output
Often the disk and network on a webserver are fairly heavily loaded, but the cpu is running almost idle. This is increasingly common as servers get faster and faster. One option for utilising this spare capacity is to compress your output.
//$data contains the page content, maybe from output buffering.
if (isset($_SERVER['HTTP_ACCEPT_ENCODING']))
if (strstr($_SERVER['HTTP_ACCEPT_ENCODING'], 'gzip')) {
header("Content-Encoding: gzip");
echo gzencode($data);
}
} else {
echo $data;
}
Leave expensive processing to cronjobs
If you have a user tracking, you want to record every page hit. The way many people do this is to insert the data into a table in the database on every page load, then every time a report is requested to generate it on the fly. This can be inefficient. Often it is more efficient to record the details in the lightest way possible, perhaps a simple log file on disk, and then every few minutes collate and insert the aggregate data into the database and create any necessary reports from it.
Optimise your queries
Its often the case that less that 20% of your queries occupy far more than 80% of your server resources. By optimising (or omitting) these queries you could potentially do up to five times more work on the same server. (for more information see Optimising MySQL)
Install a PHP op-code Cache (APC/Zend Accelerator)
Everytime the PHP interpreter loads a script it needs to compile it into code it can run. Most of the time on a production server, this script will not change from execution to execution.
For this reason the op-code cache was developed. The basic concept is fairly simple. The work the interpreter does in compiling the script to a usable form can be saved by simply caching the output from execution to execution.
Add noatime to your fstab
Without noatime as a mount option, Linux will record the access time every time a file is used. This can slow down file access to a large extent. Where access time is unimportant, you can turn off this feature and gain a speed boost for disk use.
/dev/VolGroup01/data /mnt/data ext3 defaults,noatime 1 1
/dev/md0 /boot ext3 defaults 1 2
#rest of file
Turn off non-essential logging
Where disk access is the limiting factor on a server, saving logs can reduce the available resources for your scripts. Often, especially if you have a front end proxy or use external analytics services, these aren't needed.
With Apache, logging can be turned off at the vhost level, so its not an all or nothing option.
To do this you can use something like
<VirtualHost *:80>
ServerAdmin webmaster@example.com
DocumentRoot /www/docs/www.example.com
ServerName example.com
ServerAlias www.example.com
ErrorLog /dev/null
CustomLog /dev/null common
</VirtualHost>