Improving performance by avoiding LIMIT in large data sets

This is a repost of http://www.king-foo.com/2015/09/improving-performance-by-avoiding-limit-in-large-data-sets/.

Whether you are using MySQL, MongoDB or another data store, you will have some kind of paging mechanism to scroll through your data set.

This mostly boils down to:

  • Order the data set in a way;
  • Skip the first X records (the offset);
  • Select the X following records (the limit);
SELECT * FROM users LIMIT 14500,500;

You get the impression that only a slice of the data set is used/returned, but on the server side, a lot of work has been done.

This works quite well for the beginning of the data set, but performance will quickly drop once you start to skip 100.000 rows and select the next 100 records. For paginating search results this is acceptable, because most users won’t get to page 6, but for iterating over a complete data set, this is disastrous.

The solution

Instead of using paging, you can implement some kind of scrolling or cursoring.

If you look at Twitter, they use a max_id to enable this and Elasticsearch has a scan and scroll mechanism for this.

Now you have two options to implement this for your local database:

  • Completely do the selection with a WHERE clause;
  • Or combine a WHERE clause with a LIMIT;

WHERE clause

The easiest way to scroll through your data set, is to call the MAX(id), and fetch pages until you reach that max id.

SELECT MAX(id) FROM users;
SELECT * FROM users WHERE id >= 0 AND id < 500;
...
SELECT * FROM users WHERE id >= 14500 AND id < 15000;
...

Using this approach, you can only stop fetching once you reach the max id, because some of the pages can be empty..

This approach does not need an ORDER clause, because you set the range you want back.

WHERE clause combined with LIMIT

This approach can be used to display pagination on a website, because you cannot have empty pages.

SELECT COUNT(*) FROM users; --- Optional, but needed if you want to show how many pages you have
SELECT * FROM users LIMIT 500 ORDER BY id; --- Last ID is here 543
SELECT * FROM users WHERE id > 543 ORDER BY id LIMIT 500 ORDER BY id; --- Last ID is here 2003
SELECT * FROM users WHERE id > 2003 ORDER BY id LIMIT 500;
...

The drawback here is that you have to sort the data set, but this is still much quicker that the OFFSET,LIMIT approach.

PHP TestFest 2017

Prepare for PHP Test Fest 2009

It’s Wednesday 7/6/2017 and Ben Ramsey emails the UG-ADMIN mailinglist with the subject “PHP TestFest 2017”. I instantaneously think about the 2009 edition which I attended (at Combell, organized by PHPBelgium).

I had a great time during that edition (I even managed to get some test code into PHP source) and was wondering if we could join this years edition with a couple of local user groups.

I co-organize PHP-WVL and currently work for Combell (based in Ghent), so after some chatting with the folks of Ghent PHP, we agreed to join forces and organize this years edition back at Combell.

As the date we’d picked came closer, I had to figure out how to test the code and get them into PHP Source these days, so I could “mentor” the visitors. I still remembered how to write phpt files, but had to figure out the rest of the flow.

I started by listening to PHP Roundtable episode 65 in the car, to get up to speed. Next up: the PHP TestFest website and the excellent video series of Sammy.

For specific questions, the mailing list is perfect, which helped me with some questions about code coverage.

I figured out there is a fork of the PHP source where all pull requests with tests should be created: https://github.com/phpcommunity/phptestfest-php-src/pulls.

Figuring out what to test.

To find code that needs testing, you have a couple of options:

When you look for uncovered code on the gcov site, there are some things you need to take into account:

  • It is possible that the gcov server does not have all PHP modules loaded (making it appear lots of code fragments are uncovered), see — SKIPIF —;
  • Some tests use “exec” or “popen”, making it “untrackable” for the coverage tool (the cli_server test appear to be uncovered for this reason);
  • Not all testfest tests have been merged.

To solve the last problem, I created a fork of the phptestfest fork, where I merged all the approved pull requests in a branch. This way, I can render code coverage for a specific path, including all the phptestfest tests from other users.

Our contribution.

After 2 weeks, we managed to submit 13 pull requests, where most were approved!

#phptestfest

Resources

git-flow, DTAP & semver

Using git-flow in a DTAP environment.

I’m a big fan of Git and I tend to use git-flow in most of my projects. A lot of people don’t like it, because it can be very tedious, but even if you only use develop and master branches, you’re already benefiting from it.

If you have multiple environments to deploy to, you can leverage git-flow logic to automatically deploy to these environments (in our case DTAP: development, testing, acceptance and production). Using a CI/CD server (like Gilab CI), we can auto deploy to specific environments when code gets pushed to the server.

Definitions

Let’s have a look at some definitions first.

git-flow

http://nvie.com/img/git-model@2x.png

  • develop
    • The main development branch
    • Ideally this branch only contains merge commits
  • master
    • This branch represents code in production
    • This branch must only contain merge commits
  • feature/*
    • New development is done in a feature branch
    • This branch is started from develop and will be merged back into develop when it’s done
    • By using a feature branch, you can leverage merge request
  • release/*
    • This branch sits between develop and master (getting ready for a new production release)
    • This branch is short-lived and totally optional
  • hotfix/*
    • When something is wrong in production, a fix can be produced via a hotfix branch
    • This branch is started from master and will be merged into both master and develop

DTAP

  • Development: local development
  • Testing: beta server
  • Acceptance: alpha server
  • Production: live

Semantic versioning

Semantic versioning is a way of applying version to your software so it’s clear what impact it may have. As described on the homepage (semver.org):

Given a version number MAJOR.MINOR.PATCH, increment the:

  • MAJOR version when you make incompatible API changes,
  • MINOR version when you add functionality in a backwards-compatible manner, and
  • PATCH version when you make backwards-compatible bug fixes.

Additional labels for pre-release and build metadata are available as extensions to the MAJOR.MINOR.PATCH format.

When applied to git-flow, version are added in the form of git tags, and (for now) only the master branch receives tags.

When you merge develop or a release branch into master, you add a MAJOR.MINOR tag to it (only increment the MAJOR number is the change is big enough). When you merge a hotfix branch into master, increase the PATCH version.

In practice

Now we can configure our CI/CD system to start when we push code to a specific branch. In the Gitlab CI case, it is possible to limit jobs to branches with “only”.

  • develop: auto deploy to Testing
  • release/*: auto deploy to Acceptance
  • master: auto deploy to Production

Tag based CI/CD

You could even limit job execution by using tags. Instead of starting jobs when you push code to a specific branch, you could let developers push to specific branches and kickstart a job by tagging it accordingly (which can be used to throttle builds and/or limit target environments).

Semantic versioning allows additional labels for pre-release and build metadata which we can leverage (1.0.0-alpha < 1.0.0-alpha.1 < 1.0.0-alpha.beta < 1.0.0-beta < 1.0.0-beta.2 < 1.0.0-beta.11 < 1.0.0-rc.1 < 1.0.0).

If you have extra environments (i.e. Education, Backup, Staging, etc…) you now can use even more pre-release tags:

Bonus points

Since we’re using semantic versioning and a build system, we can auto append build metadata to our version numbers. In the case of Gitlab CI, you can write a file containing a version number in the form of “1.0.0-rc.2+15” by using the git tag and the CI_JOB_ID variable.

Sort Composer packages for your git merging pleasure

A lot of git merge conflicts occur when multiple developers add lines to the end of a file/list. This also happens in composer.json, AppKernel.php, translation files, config/application.config.php, CSS files, CHANGELOG, etc…

diff --cc composer.json
index 62e875e,0c526d8..0000000
--- a/composer.json
+++ b/composer.json
@@@ -10,6 -10,6 +10,10 @@@
      "require": {
          "ramsey/uuid": "^3.0",
          "roave/security-advisories": "dev-master",
++<<<<<<< HEAD
 +        "beberlei/assert": "^2.3@dev"
++=======
+         "miljar/php-exif": "dev-master"
++>>>>>>> exif
      }
  }

A nice solution is that every developer adds lines or blocks to random places in a list or a file, but a better solution is to sort these lines (especially for lists like composer.json, translation files, etc…). This way you insert new lines in “random” places, keeping it clear for everyone how things are added.

diff --git a/changelog.rst b/changelog.rst
index d9db648..886b168 100644
--- a/changelog.rst
+++ b/changelog.rst
@@ -19,7 +19,9 @@ June, 2015
 New Documentation
 ~~~~~~~~~~~~~~~~~

+* `#5423 <https://github.com/symfony/symfony-docs/pull/5423>`_ [Security] add & update doc entries on AbstractVoter implementation (Inoryy, javiereguiluz)
 * `#5401 <https://github.com/symfony/symfony-docs/pull/5401>`_ Added some more docs about the remember me feature (WouterJ)
+* `#5384 <https://github.com/symfony/symfony-docs/pull/5384>`_ Added information about the new date handling in the comparison constraints and Range (webmozart, javiereguiluz)
 * `#5382 <https://github.com/symfony/symfony-docs/pull/5382>`_ Added support for standard Forwarded header (tony-co, javiereguiluz)
 * `#5361 <https://github.com/symfony/symfony-docs/pull/5361>`_ Document security.switch_user event (Rvanlaak)

Full diff on Github

Some tools and hacks that assist you with this:

Something that does not work for composer.json, but something I highly recommend, is to add trailing commas in PHP arrays and to not align code (see the blogpost by Made With Love about “How clean are your diffs?”). This eases up merging too!

Happy merging!

Using reStructuredText to generate an ERD

Recently I had to create an ERD for a project, and I didn’t wanted to use a binary (and mostly a proprietary) format. The reasons for this is that I don’t want to bind myself to a specific program, that I want to be able to use the data in different places/formats, and most importantly that I can version control it (and diff it as text).

Next to Visio, Google Draw, Omnigraffle, etc… there are new players like Skipper, that can be used to easily generate Doctrine models, but these are still a bit to limited for me.

Because I like to be verbose in decisions I make when designing models and how they fit together, I usually start with a basic text file listing entities and there properties. Those text files evolved to MarkDown files, and recently I started using reStuctured text (.rst) files, because it is a very powerfull format (and it has a lot of parsers, which will help me achieve my goals I stated in the beginning of this blog post).

A simple file could start like this:

Entity-Relationship Diagram
===========================

This file is used to describe all entities, their properties and how they relate.

User
----

A user entity is used to...

Fields
++++++

:username: The username a user will use to login to his/her account
:email: ...

Group
-----

A group of users can have specific roles in the application.

Fields
++++++

:name: The group name.
       This should be...
:roles: An list of roles

Using this format, we can add a lot of background information and annotations, while keeping a structured format we can use later.

A lot of viewers (GitHub included) can present the reStructuredText quite good already, including references in the document. If your document gets quite large, the link can be very handy:

...

Group
-----

A group of User_ entities can have specific roles in the application.

...

Now that we use references (the User_ format), we can take it one step further and start listing the relations of the entities:

...

User
----

A user entity is used to...

Fields
++++++

:username: The username a user will use to login to his/her account
:email: ...

Relations
+++++++++

:`Group`_: A user can be part of one group.

...

Because we are using the reference notation again, we can already click link to navigate to the different relations.

Now that we have a base document, we can try to generate diagrams of it.

DocUtils

To easily read the information from the RST files, I recommend using DocUtils to generate an XML version of the source file.

rst2xml.py source.rst > ERD.xml

Graphviz DOT files

The Graphviz DOT language is an powerfull format to generate graphs from a text (dot) file.

digraph graphname {
     a -> b -> c;
     b -> d;
 }

The code above will generate the following graph:

A directed graph (source: wikipedia.org)

Putting it all together

Now we can write some code to parse the XML file and generate a dot file out of it.

When we pipe the output to the Graphviz library, we have our final result!

Win!

This image can get quite large, but Graphviz will find a way to position all the entities.

Some notes/thoughts/improvements:

  • Use `Field Lists`_ for fields.
  • Always include “one” or “multiple” in the relation descriptions.
  • Use {timestamp}, {string} notations in the field descriptions to indicate the type.

Controlling a receipt printer with ZeroMQ and PHP

When you create a POS system as a web app, you have the problem that not all the devices are controllable from your server.

In my case I have the following hardware connected to the workstation:

The input hardware is easy, because a barcode scanner is seen as a keyboard, and both devices can communicate to the server through the web browser.

The tricky part is when a sale is completed, and the customer wants a receipt.

Since the printer is connected to the (dummy) workstation and not to the (remote) server, we cannot print the receipt directly.

The solution I used for this problem is ZeroMQ.

On the server I have a PHP process running which binds to 2 ZeroMQ sockets. One (ZMQ::SOCKET_PULL) is waiting for incoming print request from the web app, one (ZMQ::SOCKET_PUB) is publishing a print request to all subscribing workstations.

On the workstation (in my case a Windows laptop with an HP receipt printer connected to it) I have another PHP process running which is waiting for print jobs (on a ZMQ::SOCKET_SUB socket).

<?php
/**
 * Receive a print request and print.
 */
$context = new ZMQContext();
$sub = $context->getSocket(ZMQ::SOCKET_SUB);
$sub->setSockOpt(ZMQ::SOCKOPT_SUBSCRIBE, '');
$sub->connect('tcp://server:5566');

while (true) {
    $printJob = $sub->recv();
   
    $html = file_get_contents($printJob);
    $file = get_temp_dir() . '/printJob_' . sha1($html) . '.html';
    file_put_contents($file, $html);

    // http://windowsitpro.com/systems-management/how-can-i-print-usb-printer-command-prompt
    exec('print /d:LPT2: ' . $file);
}

I can add the final piece of the puzzle to the web app, by opening a ZMQ::SOCKET_PUSH socket and sending the URL of the receipt page.

<?php
/**
 * Send a print request to the print-dispatcher.
 */
$context = new ZMQContext();
$push = $context->getSocket(ZMQ::SOCKET_PUSH);
$push->connect('tcp://server:5567');

$push->send('http://server/sale/receipt/' . $receiptId);

Now I have created a lightweight system for controlling the receipt printer, instead of using cronjobs (or scheduled tasks) to check for print jobs.

Some remarks:

  • I could remove the print-dispatcher part, and let the web app connect to the print receiver directly, but I prefer to have a stable part (the binding sockets) on the known server (so both connecting sockets know the host to connect to).
  • The HTML page could be transported over ZeroMQ, but I like the extra request so the web app is sure the receipt is printed.

Finding memory leaks in PHP objects

I recently spent quite some time figuring out why a (cli) php script was eating all the memory. In PHP, memory leaks mostly show up in long running scripts. In my case, it was doing calculations on (a lot of) database records.

The script crashed all the time, because it was running against the memory_limit. After trying to figure out what was going wrong (using Xdebug’s function traces, PHP’s garbage collection, the unset trick of Paul M. Jones, etc…), I turned to a simple but effective manner to inspect the (problem) object.

Write your object to a file from time to time, and diff it.

Dumping your object:

<?php
file_put_contents(
    '/tmp/myobject_' . time() . '.txt',
    print_r($object, true)
);

// do other logic

file_put_contents(
    '/tmp/myobject_' . time() . '.txt',
    print_r($object, true)
);

Now you can see how big your object is getting (just by watching the filesize):

$ ls -al /tmp/myobject_*

And now you can pick two files to actually see what is causing the problem:

$ diff /tmp/myobject_1357140320.txt /tmp/myobject_1357140450.txt
1236a1237,1247
>             [931277dc8ecbbb394b7f5454f64c5d0c] => Array
>                 (
>                     [hash] => 4143484432
>                     [mtime] => 1357137505
>                     [expire] => 1357141105
>                     [tags] => Array
>                         (
>                         )
>
>                 )

In my case Zend_Db_Profiler was enabled and was keeping track of all the queries in memory.

Using multiple databases in phpunit/dbunit with composer

I’m using multiple databases in most of my projects, so having access to multiple databases in my test suite is a must.

phpunit/dbunit is excellent, but you are stuck with one database. The guys at Etsy created very good extensions to fix this problem (MultipleDatabase), but it took me a while to figure out how to use it.

Because PHPUnit is now available via Composer, you can fetch all dependencies with a single command.

The composer.json file to include all dependencies looks like this:

{
    "repositories": [
        {
            "type": "package",
            "package": {
                "name": "hamcrest/hamcrest",
                "version": "1.1.0",
                "dist": {
                    "type": "zip",
                    "url": "https://hamcrest.googlecode.com/files/hamcrest-php-1.1.0.zip"
                },
                "include-path": ["Hamcrest-1.1.0/"],
                "autoload": {
                    "psr-0": { "Hamcrest_": "Hamcrest-1.1.0/" },
                    "files": ["Hamcrest-1.1.0/Hamcrest/Hamcrest.php"]
                }
            }
        },
        {
            "type": "package",
            "package": {
                "name": "etsy/phpunit-extensions",
                "version": "0.6.0",
                "dist": {
                    "type": "zip",
                    "url": "https://github.com/etsy/phpunit-extensions/archive/v0.6.0.zip"
                },
                "autoload": {
                    "psr-0": { "PHPUnit_": "" }
                }
            }
        }
    ],
    "require": {
        "php": ">=5.3.2"
    },
    "require-dev": {
        "hamcrest/hamcrest": "1.1.0",
        "mockery/mockery": ">=0.7.2",
        "etsy/phpunit-extensions": "0.6.0",
        "phpunit/phpunit": "3.7.*",
        "phpunit/dbunit": "1.2.*"
    },
    "include-path": ["vendor/mockery/mockery/library"]
}

Install the dependencies by (installing and) running composer:

cd $PROJECT_ROOT
curl -s https://getcomposer.org/installer | php
php composer.phar install --dev
vendor/bin/phpunit --version

Now you can create a “parent” class to register the databases (mine is called DatabaseTest). Make sure you create a getDatabaseConfigs method (which is required and should return an array of PHPUnit_Extensions_MultipleDatabase_Database). For the fixtures, I use Xml Datasets, which look like this.

I’ve added a getConnection method, so I can use the same assertions as the normal dbunit testcase (see Database Assertions API):

$this->assertEquals(0, $this->getConnection('dbname')->getRowCount('user'), "Pre-Condition");
$user = new User();
$user->save();
$this->assertEquals(1, $this->getConnection('dbname')->getRowCount('user'), "Inserting failed");

The magic about to happen is quite cool. PHPUnit will read these database configs and use them to make sure all databases and tables are in a known state before every test and does this in following order:

  1. Connect to all databases
  2. TRUNCATE all tables supplied in the database fixture file
  3. Insert all rows supplied in the fixture file
  4. Execute the test
  5. TRUNCATE all tables supplied in the database fixture file

Now you’ll be able to run tests for code making changes in your database without affecting other tests.

Some improvements:

  • Add composer support to Hamcrest (once Hamcrest PHP has moved to GitHub).
  • Add composer support to the Etsy extensions (I’ve submitted a PR, but it won’t work until Hamcrest PHP is on GitHub).
  • See if the getConnection() method can be done in a better way.

Installing PEAR

This post is just a quick notice/warning for everyone wanting to install PEAR (especially on Windows): always download the latest go-pear.phar from http://pear.php.net/go-pear.phar.

Before running the famous php -d phar.require_hash=0 go-pear.phar command, make sure the timestamp of the phar file is somewhere this year. For some reason PHP (and in my case Zend Server CE) always ships with the phar file from 2008…

It will save you a lot of time.

make test every PHP version and send feedback

Since David Coallier’s talk during PHPBenelux, I realized the importance of running make test on all PHP releases and send feedback to PHP.

It is so easy, that I will run it from now on every time a new release is announced.

If you’re running on Mac, you might want to install Xcode, so you can run “make” on command line. If you’re on Linux, you’re all set to go.

How to do it:

  • Open a shell
  • Create a directory (e.g. mkdir ~/src/php)
  • Download the latest version into the directory (e.g. wget http://downloads.php.net/stas/php-5.4.0RC6.tar.gz)
  • Untar the file (e.g. tar -xzf php-5.4.0RC6.tar.gz)
  • Go into the directory and run ./configure, you should get output like:
    checking for grep that handles long lines and -e... /usr/bin/grep
    checking for egrep... /usr/bin/grep -E
    checking for a sed that does not truncate output... /usr/bin/sed
    checking build system type... i386-apple-darwin11.2.0
    checking host system type... i386-apple-darwin11.2.0
    checking target system type... i386-apple-darwin11.2.0
    ...
    Thank you for using PHP. 
  • After that, you can run make and should get a lot of output ending in:
    Build complete.
    Don't forget to run 'make test'.
  • After that, you should do as the previous command suggests and run make test. You should see all tests passing by.
  • After all tests are run, you will get a summary. In my case, I get the message: “You may have found a problem in PHP.”
    Bug #55509 (segfault on x86_64 using more than 2G memory) [Zend/tests/bug55509.phpt]
    Sort with SORT_LOCALE_STRING [ext/standard/tests/array/locale_sort.phpt]
  • Whether you get an error or not, you should always send the report back to PHP. You can do that by just answering Y to the question: “Do you want to send this report now?”
  • Enter your email address and PHP will thank you.

How hard was that?