django-cachalot/docs/introduction.rst

Introduction
------------

Should you use it?
..................

Django-cachalot is the perfect speedup tool for most Django projects.
It will speedup a website of 100 000 visits per month without any problem.
In fact, **the more visitors you have, the faster the website becomes**.
That’s because every possible SQL query on the project ends up being cached.

Django-cachalot is especially efficient in the Django administration website
since it’s unfortunately badly optimised (use foreign keys in list_editable
if you need to be convinced).

However, it’s not suited for projects where there is **a high number
of modifications per minute** on each table, like a social network with
more than a 30 messages per minute. Django-cachalot may still give a small
speedup in such cases, but it may also slow things a bit
(in the worst case scenario, a 20% slowdown,
according to :ref:`the benchmark <Benchmark>`).
If you have a website like that, optimising your SQL database and queries
is the number one thing you have to do.

There is also an obvious case where you don’t need django-cachalot:
when the project is already fast enough (all pages load in less than 300 ms).
Like any other dependency, django-cachalot is a potential source of problems
(even though it’s currently bug free).
Don’t use dependencies you can avoid, a “future you” may thank you for that.

Features
........

- **Saves in cache the results of any SQL query** generated by the Django ORM
  that reads data. These saved results are then returned instead
  of executing the same SQL query, which is faster.
- The first time a query is executed is about 10% slower, then the following
  times are way faster (7× faster being the average).
- Automatically invalidates saved results,
  so that **you never get stale results**.
- **Invalidates per table, not per object**: if you change an object,
  all the queries done on other objects of the same model are also invalidated.
  This is unfortunately technically impossible to make a reliable
  per-object cache.  Don’t be fooled by packages pretending having
  that per-object feature, they are unreliable and dangerous for your data.
- **Handles everything in the ORM**. You can use the most advanced features
  from the ORM without a single issue, django-cachalot is extremely robust.
- An easy control thanks to :ref:`settings` and :ref:`a simple API <API>`.
  But that’s only required if you have a complex infrastructure.  Most people
  will never use settings or the API.
- A few bonus features like
  :ref:`a signal triggered at each database change <Signal>`
  (including bulk changes) and
  :ref:`a template tag for a better template fragment caching <Template tag>`.

Comparison with similar tools
.............................

This comparison was done in October 2015.  It compares django-cachalot
to the other popular automatic ORM caches at the moment:
`django-cache-machine <https://github.com/django-cache-machine/django-cache-machine>`_
& `django-cacheops <https://github.com/Suor/django-cacheops>`_.

Features
~~~~~~~~

======================================================== ========= ============= =========
Feature                                                  cachalot  cache-machine cacheops
======================================================== ========= ============= =========
Easy to install                                          ✔         ✘             quite
Cache agnostic                                           ✔         ✔             ✘
Type of invalidation                                     per table per object    per table
CPU & memory performance                                 optimal   bad           terrible
Reliable                                                 ✔         ✘             quite
Handles ``QuerySet.count``                               ✔         ✘             ✔
Handles empty queries                                    ✔         ✘             ✔
Handles multi-table inheritance                          ✔         probably not  ✘
Handles proxy models                                     ✔         ✘             ✔
Handles many-to-many fields                              ✔         ✘             ✔
Handles transactions                                     ✔         probably not  ✘
Handles ``QuerySet.aggregate``/``annotate``              ✔         probably not  ✘
Handles ``QuerySet.bulk_create``/``update``/``delete``   ✔         probably not  ✘
Handles ``QuerySet.select_related``/``prefetch_related`` ✔         partially     ✘
Handles ``cursor.execute``                               ✔         ✘             ✘
Handles GeoDjango                                        ✔         maybe         ✔
Handles django.contrib.postgres                          ✔         maybe         partially
======================================================== ========= ============= =========

To find if a package supports a feature, I searched in the documentation,
the issues, the tests and the code.
I really tried to avoid writing “maybe”, “probably not”, etc.
Unfortunately, the absence of tests for such cases and sometimes the confusion
of the authors themselves about these features makes it difficult to know
whether they support a feature or not.

Explanations
~~~~~~~~~~~~

Of course, I can’t just throw a table with such
“Reliable” and “CPU & memory performance” lines without explanation.
My goal is not to start another stupid open source conflict, nor
to be pretentious about my work.  I’m just trying to inform users here, so they
can fully grasp the consequences of using one or another tool.
I actually used django-cache-machine in production for a week
and django-cacheops for a month.  On both solutions, I faced a lot
of invalidation issues, and the bigger the cache became,
the worst the performance was.

I now know the reason of these issues: in short, this is due to
their invalidation systems.  Read the following paragraphs for more detail.

django-cache-machine
''''''''''''''''''''

django-cache-machine is using “flush lists” to remember which SQL queries are
linked to which objects.  This is the approach I chose when I created
a prototype of django-cachalot, except it was invalidated per table,
not per object like django-cache-machine does. Unfortunately, there are several
important issues due to this approach that lead me to drop it.

The smaller issue is that each time you execute a new SQL query,
django-cache-machine needs to fetch the “flush list” from the cache,
update it and add it back to the cache.  This means we have to make two
cache calls in addition of the cache call to store the SQL query results.
It may seem tiny, but when your cache size increases,
the “flush lists” start becoming huge (a list of hundreds of cache keys
for each database object), leading to an exponentially growing cache size
and a longer time to fetch the always-growing “flush lists”.
So **bad memory and CPU usage when reading data**.

The second issue is only linked to the per object invalidation.
When django-cache-machine invalidates an object, it also needs to invalidate
the queries of the related objects, otherwise they may contain stale data.
Django-cache-machine invalidates foreign keys only, not many-to-many
or generic foreign keys (because… I don’t know).  **This degrades performance
of each writing operation to the database**, because it needs to fetch
related objects, fetch “flush lists” and delete these cache keys.
And of course it can’t invalidate basic queries such as count or empty queries
(probably aggregations too, but I’m not sure).

And at last but not least: a critical issue.  It simply proves that the
django-cache-machine team **doesn’t know how caches work**.
Caches are fast because they are stupid: when your cache is full and
needs room, it randomly fetches a few keys, selects the older ones if possible
then deletes them.  This means that **a cache key with a 1 year timeout
can be deleted before a cache key with a 1 minute timeout**.
But django-cache-machine assumes its “flush lists” will always stay longer
in cache than the saved query results will, because they have the same timeout
and “flush list” are saved a few milli-seconds after query results.
Until the cache is full, this is kind of true because no cache key is deleted.
But when it is full, the “flush list” can be removed at any moment,
so the other cache keys will never be invalidated until they are deleted.

**To sum up, django-cache-machine has bad memory and CPU performance
and is absolutely not reliable.**

django-cacheops
'''''''''''''''

django-cacheops uses
`a debug feature from Redis, KEYS, <http://redis.io/commands/KEYS>`_
to invalidate cache keys (that’s why it only supports Redis).
It’s a feature that becomes linearly slower as your cache size grows.
I measured, one single call of this command by django-cacheops
slows down any database save by 50 ms to 3.5 seconds,
depending on your database and cache sizes.
The problem is also that django-cacheops runs this command several times
at each save.  Suppose you have a model with 3 many-to-many and you save
an object with 3 related objects per many-to-many. django-cacheops
will therefore run the Redis ``KEYS`` command at least 10 times!  If you have
a large cache and database, it means **you can wait 30 seconds
while this object is saved!**

Another bad consequence of that use of the ``KEYS`` command is that Redis jumps
to a 100% CPU usage when the command is running, degrading performance for
other users or even blocking them until the command is finished.

In a general way, the workflow of django-cacheops is totally unoptimised.
When an object is modified, an ``invalidate_obj`` function is called,
calling an ``invalidate_dict`` function, calling the ``manage.py invalidate``
command with a serialized version of the object (!?)
calling an ``invalidate_model`` function that calls the Redis ``KEYS`` command
to get all the cache keys from that model then delete them.
And as I said above, it executes all that N times,
N being the number of related objects to the current object,
even though multiple objects have the same model and we therefore
don’t need to invalidate the model multiple times.

**To sum up, django-cacheops has a terrible performance when modifying data,
and is reliable on what it handles.**
But you probably need features it doesn’t handle, such as
transactions (used by Django admin),
multi-table inheritance, or
``cursor.execute`` (the three features being used by Wagtail and django CMS)…

Number of lines of code
~~~~~~~~~~~~~~~~~~~~~~~

Django-cachalot tries to be as minimalist as possible, while handling most
use cases. Being minimalist is essential to create maintainable projects,
and having a large test suite is essential to get an excellent quality.
The statistics below speak for themselves…

============ ======== ============= ========
Project part cachalot cache-machine cacheops
============ ======== ============= ========
Application  743      843           1662
Tests        3023     659           1491
============ ======== ============= ========