Distributing PHP processing with Gearman
Introduction to Gearman
Gearman is a generic framework to distribute processing jobs to separate processes in the same machine or other machines in a cluster.
http://gearman.org/
It allows your application to perform tasks in parallel, balance the processing load, and even invoke code written in other languages.
The "Gearman" word is an anagram of "manager". Its purpose is solely to dispatch jobs that need to be executed, but Gearman just by itself does not do anything useful.
- Gearman architecture
A Gearman based architecture has three types of elements:
* Job servers
Job servers are instances of the manager program. They behave as proxies by forwarding job requests from a client to a worker that is currently available.
* Clients
The clients talk to a job server in order to request the execution of job functions.
* Workers
Workers are processes that handle requests to jobs by executing functions of code. Each worker makes available a set of functions of code by registering them with a job server.
This architecture can be made fault tolerant by running more than one job server task per machine managing many workers.
Another interesting feature of Gearman, is that workers functions may be written in many different programming languages, such as C, PHP, and other well known scripting languages. This way you can choose the right programming language for the task.
For instance you can write a program in C or C++ that executes an heavy task and make it register itself as worker daemon. This way you can call C/C++ libraries from PHP without having to turn them into PHP extensions, thus avoiding to deal with thread safety issues.
* Synchronous tasks
There is a kind of tasks that some Web development applications need to execute, which are relatively expensive in terms of RAM and and CPU usage but just take a relatively short period of time to finish. That is the case for instance of image processing, download pages or files from the Web.
If these tasks are executed directly from inside the Web server process, they spend memory from the Web server machine. If the Web server memory is exhausted, it may halt the machine and prevent it from serving more Web server requests.
Alternatively, if you delegate the execution of heavy tasks to workers running on a different machine, the Web server machine resources are saved and it can handle more simultaneous Web requests.
While the workers handle these heavy job requests synchronously, the clients wait until the jobs are finished without spending any CPU.
Additionally, a worker process can handle many synchronous requests, one after another. A subsequent execution of a worker job may reuse resources computed during a previous run of the same job. This may save precious time when running repeated requests that need the same resources.
This possibility can be used to create database connection pools. The first run of a worker process may open a database connection and execute some queries.
A second run of the same worker job does not need to deal with the database connection opening overhead, as it is already opened since the first run.
* Asynchronous tasks
Gearman also supports dispatching worker jobs asynchronously. This means that the client process does not wait for the worker job to finish.
Gearman keeps an internal job queue and assigns pending jobs to workers as they become free.
This may be useful for long duration tasks, such as sending e-mail notifications, posting Tweeter messages, etc..
It can also be useful to implement a map/reduce approach, splitting large tasks and many small sub-tasks. A good example of this approach is a decentralized log analyzer.
* Distributing jobs to multiple servers
You can deploy multiple job servers in different machines to make the architecture more fault tolerant. This avoids an important point of failure of the Gearman architecture.
A client application can use a list of job servers. Then Gearman client API picks one job server. If the picked job server becomes unavailable when it is called, the client will pick another job server automatically.
* Persistent job queues
By default, Gearman keeps all job queues stored in memory. This means that if a job server crashes or restarts without finishing all queued jobs, those jobs will be never executed.
Alternatively, Gearman can use persistent queues for managing asynchronous job requests to solve this problem.
This is important is because an asynchronous job is not attached to any client. So it would not be possible for the client that started the job to detect a failure and request the job again.
The persistent queue module of Gearman saves the job request details in a persistent storage, usually a database.
The job server waits for free workers that can execute the requested jobs in the queue. When the job is finished, it is removed from the queue.
If a job server crashes, when it restarts the persistent queue is used to reload the internal queue with the jobs pending to be run.
- Using Gearman with its PHP extension
There is already an extension to use Gearman directly from PHP scripts. It is available in PECL PHP extension repository.
When using a regular PHP installation, you can use the "pecl" command to install this extension like this:
pecl install gearman
If you do not have the pecl command available in your PHP installation, you can install the Gearman PHP extension manually. First you download the latest version from here:
http://pecl.php.net/package/gearman
Then you need to extract the archive using the following Linux/Unix shell commands. The make install command may need to be executed by the root user.
tar xfvz gearman*
cd gearman*
phpize
./configure --with-gearman
make
make install
There are also some pure PHP Gearman clients. However, I recommend that you to use the PECL extension instead, not only because it is faster, as it is written in C, but also because it is a direct wrapper of the libgearman library provided by the Gearman developers.
- Starting a Gearman job server with a persistent job queue
Starting a Gearman job server is easy. But if you need to make it use a persistent job queue, you need to take some extra steps.
First you need to setup the queue storage. For instance, if you want to use MySQL as persistent queue storage, you need to create the queue table like this:
CREATE TABLE gearman_queue(
`unique_key` VARCHAR(64) PRIMARY KEY,
`function_name` VARCHAR(255),
`priority` INT,
`data` LONGBLOB
);
Then you need to run Gearman manager program, passing several parameters to configure the connection to the MySQL server with the database that has the queue table:
./gearmand -q libdrizzle --libdrizzle-host=10.1.1.1 --libdrizzle-user=gearman --libdrizzle-password=secret --libdrizzle-db=some_db --libdrizzle-table=gearman_queue --libdrizzle-mysql
Gearman also support storing persistent queues in Memcached servers and SQLite databases.
- Distributing PHP applications by setting up PHP worker and client processes
The first thing you should do is create PHP functions that will be called to execute the worker jobs. Then you need to register the job handling functions with a Gearman job server. Finally you need to start accepting requests to handle any jobs requests.
Here follows a simple example of a PHP worker process script:
<?php
$worker = new GearmanWorker();
$worker->addServer('localhost');
$worker->addServer('10.1.1.1');
$worker->addFunction("reverse", "reverse_fn");
while($worker->work())
{
if ($worker->returnCode() != GEARMAN_SUCCESS)
{
syslog(LOG_ERR, "return_code: " . $worker->returnCode());
break;
}
}
function reverse_fn($job)
{
$param = $job->workload();
return strrev($param);
}
?>
A job client script can look like this:
<?php
$client = new GearmanClient();
$client->addServer('localhost');
$client->addServer('10.1.1.1');
$return = $client->do('reverse', 'Hello World!');
var_dump($return);
?>
As you may see, it is easy to deploy workers and clients with Gearman.
Of course the above example is pretty dumb. Now, lets take a look at a more interesting example: a simple crawling works which is executed on the background and provides status report.
<?php
$worker = new GearmanWorker();
$worker->addServer('localhost');
$worker->addServer('10.1.1.1');
$worker->addFunction("crawler", "crawler_fn");
while($worker->work())
{
if ($worker->returnCode() != GEARMAN_SUCCESS)
{
syslog(LOG_ERR, "return_code: " . $worker->returnCode());
break;
}
}
function crawler_fn($job)
{
$param = $job->workload();
$urls = unserialize($param);
$count = count($urls);
for($i=0; $i < $count; $i++)
{
$url = $urls[$i];
$content = file_get_contents($url);
/* this function might save the retrieved content into a database */
save_content($url, $content);
$job->sendStatus($i, $count);
sleep(1); /* pause a little while */
}
}
?>
As you may notice, the worker function does not return any value. That is because it is meant to run in the background, and so there is no client waiting for a return value.
The client script may look like this:
<?php
session_start();
$client = new GearmanClient();
$client->addServer('localhost');
$client->addServer('10.1.1.1');
$urls = array(
"http://www.google.com/",
"http://www.phpclasses.org/",
"http://crodas.org/"
);
$jobid = $client->doBackground('crawl', serialize($urls));
$_SESSION['jobid'] = $jobid;
$_SESSION['urls'] = $urls;
print "Your pages were queued for downloading\n";
?>
The client does not wait for the job to finish because it will run on the background. The job identifier is stored in a session variable, so it can be retrieved later when it is necessary check the status of the job, for instance by sending an AJAX request to a PHP script, which may look like this:
<?php
session_start();
if (!isset($_SESSION['jobid']))
exit;
$client = new GearmanClient();
$client->addServer('localhost');
$client->addServer('10.1.1.1');
$status = $client->jobStatus($_SESSION['jobid']);
ob_start();
if ($status[0] == true
&& $status[1] == false)
{
echo json_encode( array('status' => 'The job is still on the queue.'));
}
elseif($status[0] == $status[1] && $status[1] == true)
{
$percent = $status[2] / $status[3] * 100;
echo json_encode( array('status' => 'Done $percent %'));
}
else
{
echo json_encode( array("status" => "Done"));
foreach ($_SESSION['urls']) as $url)
{
/* Do something with the crawled pages */
}
}
echo json_encode( array('results' => ob_get_clean()) );
?>
That's it, very simple! There is a Web page which can start asynchronous jobs. Another page can show the status of the jobs any time.
Until not a long time ago we would do something like this using the crontab to schedule jobs and other not very nice tricks.
- Application ideas
Gearman could be really useful in real world applications. Here follow a few of ideas for possible applications:
* Newsletter delivery
Sending newsletters to many subscriber is a very complicated task. It takes a considerable amount of time that is proportional to the number of users subscribed to the newsletter.
Therefore, the delivery must be done in the background once you start getting many newsletter subscribers.
As demonstrated above, Gearman serves well for this purpose, as it provides background processing, and it is possible to keep track of the delivery progress.
First it is necessary to create and register a worker job script that will send that newsletter messages to a list of subscribers.
We could create a main process, which could be background task created by Gearman as well, that creates a set of tasks assigning them a set of e-mails to submit.
It can detect failed delivery attempts and resend the message only to failed addresses, as the messages should not be sent more than once to each subscriber.
This can be achieved by sending a list of subscriber identifiers to the worker process. The worker must fetch the information from a database and update the subscriber delivered status when it succeeds.
* Application server
Gearman itself is very fast, as it is written in C and is multi-threaded. Also it is as safe as Apache, since the workers run in separate processes, like with Apache in pre-fork mode. I think it Gearman could eventually be used as a replacement for Apache, as it can also act as an HTTP server.
I started working in something similar, for now just for fun. It was basically a NGinx module which speaks directly to Gearman. I want to deploy it as a Web service server, and if it performs well, why not replacing Apache some day?
I am also writing a module for Kannel Open Source SMS server for a private company that I work for. This module submits every incoming SMS to an Gearman worker.
http://www.kannel.org/
* Map/Reduce
Another interesting use for Gearman is to write your own implementation of Map/Reduce algorithm. It can be used for distributing the processing of large data sets over a cluster of servers.
It might be interesting and even fast, since it could use Gearman with a local database instead of a distributed file system.
To do this, we could setup a master job worker process which would distribute a list of data values and the source code for the Map and Reduce functions to be applied to every data.
It is pretty straightforward to do it since PHP is a dynamic scripting language, and there is the eval() function to execute dynamically passed code. ;-)
- Other ideas and comments
Gearman has great potential. Above you may find just a few ideas. Feel free to post more ideas or questions about this article by posting a comment here.
Gearman is a generic framework to distribute processing jobs to separate processes in the same machine or other machines in a cluster.
http://gearman.org/
It allows your application to perform tasks in parallel, balance the processing load, and even invoke code written in other languages.
The "Gearman" word is an anagram of "manager". Its purpose is solely to dispatch jobs that need to be executed, but Gearman just by itself does not do anything useful.
- Gearman architecture
A Gearman based architecture has three types of elements:
* Job servers
Job servers are instances of the manager program. They behave as proxies by forwarding job requests from a client to a worker that is currently available.
* Clients
The clients talk to a job server in order to request the execution of job functions.
* Workers
Workers are processes that handle requests to jobs by executing functions of code. Each worker makes available a set of functions of code by registering them with a job server.
This architecture can be made fault tolerant by running more than one job server task per machine managing many workers.
Another interesting feature of Gearman, is that workers functions may be written in many different programming languages, such as C, PHP, and other well known scripting languages. This way you can choose the right programming language for the task.
For instance you can write a program in C or C++ that executes an heavy task and make it register itself as worker daemon. This way you can call C/C++ libraries from PHP without having to turn them into PHP extensions, thus avoiding to deal with thread safety issues.
* Synchronous tasks
There is a kind of tasks that some Web development applications need to execute, which are relatively expensive in terms of RAM and and CPU usage but just take a relatively short period of time to finish. That is the case for instance of image processing, download pages or files from the Web.
If these tasks are executed directly from inside the Web server process, they spend memory from the Web server machine. If the Web server memory is exhausted, it may halt the machine and prevent it from serving more Web server requests.
Alternatively, if you delegate the execution of heavy tasks to workers running on a different machine, the Web server machine resources are saved and it can handle more simultaneous Web requests.
While the workers handle these heavy job requests synchronously, the clients wait until the jobs are finished without spending any CPU.
Additionally, a worker process can handle many synchronous requests, one after another. A subsequent execution of a worker job may reuse resources computed during a previous run of the same job. This may save precious time when running repeated requests that need the same resources.
This possibility can be used to create database connection pools. The first run of a worker process may open a database connection and execute some queries.
A second run of the same worker job does not need to deal with the database connection opening overhead, as it is already opened since the first run.
* Asynchronous tasks
Gearman also supports dispatching worker jobs asynchronously. This means that the client process does not wait for the worker job to finish.
Gearman keeps an internal job queue and assigns pending jobs to workers as they become free.
This may be useful for long duration tasks, such as sending e-mail notifications, posting Tweeter messages, etc..
It can also be useful to implement a map/reduce approach, splitting large tasks and many small sub-tasks. A good example of this approach is a decentralized log analyzer.
* Distributing jobs to multiple servers
You can deploy multiple job servers in different machines to make the architecture more fault tolerant. This avoids an important point of failure of the Gearman architecture.
A client application can use a list of job servers. Then Gearman client API picks one job server. If the picked job server becomes unavailable when it is called, the client will pick another job server automatically.
* Persistent job queues
By default, Gearman keeps all job queues stored in memory. This means that if a job server crashes or restarts without finishing all queued jobs, those jobs will be never executed.
Alternatively, Gearman can use persistent queues for managing asynchronous job requests to solve this problem.
This is important is because an asynchronous job is not attached to any client. So it would not be possible for the client that started the job to detect a failure and request the job again.
The persistent queue module of Gearman saves the job request details in a persistent storage, usually a database.
The job server waits for free workers that can execute the requested jobs in the queue. When the job is finished, it is removed from the queue.
If a job server crashes, when it restarts the persistent queue is used to reload the internal queue with the jobs pending to be run.
- Using Gearman with its PHP extension
There is already an extension to use Gearman directly from PHP scripts. It is available in PECL PHP extension repository.
When using a regular PHP installation, you can use the "pecl" command to install this extension like this:
pecl install gearman
If you do not have the pecl command available in your PHP installation, you can install the Gearman PHP extension manually. First you download the latest version from here:
http://pecl.php.net/package/gearman
Then you need to extract the archive using the following Linux/Unix shell commands. The make install command may need to be executed by the root user.
tar xfvz gearman*
cd gearman*
phpize
./configure --with-gearman
make
make install
There are also some pure PHP Gearman clients. However, I recommend that you to use the PECL extension instead, not only because it is faster, as it is written in C, but also because it is a direct wrapper of the libgearman library provided by the Gearman developers.
- Starting a Gearman job server with a persistent job queue
Starting a Gearman job server is easy. But if you need to make it use a persistent job queue, you need to take some extra steps.
First you need to setup the queue storage. For instance, if you want to use MySQL as persistent queue storage, you need to create the queue table like this:
CREATE TABLE gearman_queue(
`unique_key` VARCHAR(64) PRIMARY KEY,
`function_name` VARCHAR(255),
`priority` INT,
`data` LONGBLOB
);
Then you need to run Gearman manager program, passing several parameters to configure the connection to the MySQL server with the database that has the queue table:
./gearmand -q libdrizzle --libdrizzle-host=10.1.1.1 --libdrizzle-user=gearman --libdrizzle-password=secret --libdrizzle-db=some_db --libdrizzle-table=gearman_queue --libdrizzle-mysql
Gearman also support storing persistent queues in Memcached servers and SQLite databases.
- Distributing PHP applications by setting up PHP worker and client processes
The first thing you should do is create PHP functions that will be called to execute the worker jobs. Then you need to register the job handling functions with a Gearman job server. Finally you need to start accepting requests to handle any jobs requests.
Here follows a simple example of a PHP worker process script:
<?php
$worker = new GearmanWorker();
$worker->addServer('localhost');
$worker->addServer('10.1.1.1');
$worker->addFunction("reverse", "reverse_fn");
while($worker->work())
{
if ($worker->returnCode() != GEARMAN_SUCCESS)
{
syslog(LOG_ERR, "return_code: " . $worker->returnCode());
break;
}
}
function reverse_fn($job)
{
$param = $job->workload();
return strrev($param);
}
?>
A job client script can look like this:
<?php
$client = new GearmanClient();
$client->addServer('localhost');
$client->addServer('10.1.1.1');
$return = $client->do('reverse', 'Hello World!');
var_dump($return);
?>
As you may see, it is easy to deploy workers and clients with Gearman.
Of course the above example is pretty dumb. Now, lets take a look at a more interesting example: a simple crawling works which is executed on the background and provides status report.
<?php
$worker = new GearmanWorker();
$worker->addServer('localhost');
$worker->addServer('10.1.1.1');
$worker->addFunction("crawler", "crawler_fn");
while($worker->work())
{
if ($worker->returnCode() != GEARMAN_SUCCESS)
{
syslog(LOG_ERR, "return_code: " . $worker->returnCode());
break;
}
}
function crawler_fn($job)
{
$param = $job->workload();
$urls = unserialize($param);
$count = count($urls);
for($i=0; $i < $count; $i++)
{
$url = $urls[$i];
$content = file_get_contents($url);
/* this function might save the retrieved content into a database */
save_content($url, $content);
$job->sendStatus($i, $count);
sleep(1); /* pause a little while */
}
}
?>
As you may notice, the worker function does not return any value. That is because it is meant to run in the background, and so there is no client waiting for a return value.
The client script may look like this:
<?php
session_start();
$client = new GearmanClient();
$client->addServer('localhost');
$client->addServer('10.1.1.1');
$urls = array(
"http://www.google.com/",
"http://www.phpclasses.org/",
"http://crodas.org/"
);
$jobid = $client->doBackground('crawl', serialize($urls));
$_SESSION['jobid'] = $jobid;
$_SESSION['urls'] = $urls;
print "Your pages were queued for downloading\n";
?>
The client does not wait for the job to finish because it will run on the background. The job identifier is stored in a session variable, so it can be retrieved later when it is necessary check the status of the job, for instance by sending an AJAX request to a PHP script, which may look like this:
<?php
session_start();
if (!isset($_SESSION['jobid']))
exit;
$client = new GearmanClient();
$client->addServer('localhost');
$client->addServer('10.1.1.1');
$status = $client->jobStatus($_SESSION['jobid']);
ob_start();
if ($status[0] == true
&& $status[1] == false)
{
echo json_encode( array('status' => 'The job is still on the queue.'));
}
elseif($status[0] == $status[1] && $status[1] == true)
{
$percent = $status[2] / $status[3] * 100;
echo json_encode( array('status' => 'Done $percent %'));
}
else
{
echo json_encode( array("status" => "Done"));
foreach ($_SESSION['urls']) as $url)
{
/* Do something with the crawled pages */
}
}
echo json_encode( array('results' => ob_get_clean()) );
?>
That's it, very simple! There is a Web page which can start asynchronous jobs. Another page can show the status of the jobs any time.
Until not a long time ago we would do something like this using the crontab to schedule jobs and other not very nice tricks.
- Application ideas
Gearman could be really useful in real world applications. Here follow a few of ideas for possible applications:
* Newsletter delivery
Sending newsletters to many subscriber is a very complicated task. It takes a considerable amount of time that is proportional to the number of users subscribed to the newsletter.
Therefore, the delivery must be done in the background once you start getting many newsletter subscribers.
As demonstrated above, Gearman serves well for this purpose, as it provides background processing, and it is possible to keep track of the delivery progress.
First it is necessary to create and register a worker job script that will send that newsletter messages to a list of subscribers.
We could create a main process, which could be background task created by Gearman as well, that creates a set of tasks assigning them a set of e-mails to submit.
It can detect failed delivery attempts and resend the message only to failed addresses, as the messages should not be sent more than once to each subscriber.
This can be achieved by sending a list of subscriber identifiers to the worker process. The worker must fetch the information from a database and update the subscriber delivered status when it succeeds.
* Application server
Gearman itself is very fast, as it is written in C and is multi-threaded. Also it is as safe as Apache, since the workers run in separate processes, like with Apache in pre-fork mode. I think it Gearman could eventually be used as a replacement for Apache, as it can also act as an HTTP server.
I started working in something similar, for now just for fun. It was basically a NGinx module which speaks directly to Gearman. I want to deploy it as a Web service server, and if it performs well, why not replacing Apache some day?
I am also writing a module for Kannel Open Source SMS server for a private company that I work for. This module submits every incoming SMS to an Gearman worker.
http://www.kannel.org/
* Map/Reduce
Another interesting use for Gearman is to write your own implementation of Map/Reduce algorithm. It can be used for distributing the processing of large data sets over a cluster of servers.
It might be interesting and even fast, since it could use Gearman with a local database instead of a distributed file system.
To do this, we could setup a master job worker process which would distribute a list of data values and the source code for the Map and Reduce functions to be applied to every data.
It is pretty straightforward to do it since PHP is a dynamic scripting language, and there is the eval() function to execute dynamically passed code. ;-)
- Other ideas and comments
Gearman has great potential. Above you may find just a few ideas. Feel free to post more ideas or questions about this article by posting a comment here.
Comments