Thursday, May 19, 2016

Performance of Redis sorted set operations

I was working on a feature recently, which involved several Redis "remove from sorted set" operations. Target production environment sorted sets are expected to be small and most of these calls would be trying to remove items, which do not exist in the sets. Although ZREM operation has documented LON(N) time complexity, ZSCORE has documented constant time complexity. This led me to believe, that Redis might have constant time complecity for ZREM calls when values to be removed do not exist in the set (if ZSCORE "knows" if item is NOT in the set in constant time, ZREM can do the same, right?).

Not in this case, apparently. ZSCORE documented constant time complexity is actually misleading (as many cases of asymptotic time complexity documentation for small data sets). Redis stores small sorted sets (up to 128 items by default) as "ziplists", which are essentially linked lists, optimized for memory consumption. Browsing through Redis source code confirms the worst - there is no additional data structure hidden somewhere, which would make ZREM or ZSCORE operation on ziplists constant time or even LOG(N) - finding value in a set involves walking entire linked list! This (walking entire list) is exactly what happens, when you try to delete non-existing item in a small sorted set.

Following node.js script shows it in a small performance measurement (it is actually a good example of "callback hell" :) ). The script fills sorted set with certain number of values (5, 120, 1000, 10000) and then calls ZSCORE on this set in a loop. It may be surprising to see that calling ZSCORE on a set of 1000 items is faster than on a set of 5 items and much faster than on a set of 200 items. On the other hand, this script confirms that ZSCORE really becomes constant time operation on sets of more than 128 items.

Of course, real life is much more complex and ziplist use of CPU memory caches should be better on average. Still this example shows that one should take asymptotic time complexity documentation with a pinch of salt.

Tuesday, February 16, 2016

Finding random subsets

Suppose you have a large set of records and you need to process them in random batches over longer period of time. By "random batches" I mean subsets, containing random elements from the full set. The solution we've found working good for us is based on the following steps:
  1. Load unprocessed record ids into memory;
  2. Periodically extract a random batch of ids;
  3. Process extracted records and persist them as processed.
The tricky part of the process is step 2). How do you efficiently find random subset of a big set? It turns out there is a ready-made algorithm for this - Following is the JavaScript implementation, which proved to be useful in the context of the aforementioned task:

Wednesday, January 6, 2016

Finding random geographical point around given coordinates

In my current project, we are using Mongo 2dsphere index to sort records by distance from certain geographical point. It's very useful to sort large data sets by distance with good performance, but unfortunately it is not possible to use it if you need to extend your sorting rules (that is - sort by other fields as well).
What we needed was to randomize search results in a certain way, while retaining "sort by distance first" rule. Below is a function (in TypeScript), which proved to be very useful for this task. It returns a random geographical point anywhere around given coordinates with range limit in km. It was very interesting to find our that longitude and latitude are just sphere vector angles (, so a bit of trigonometry is expected.

Thursday, December 3, 2015

Nginx secure link with node.js

Serving static files is a natural task for web servers. They, especially ones, having asynchronous architecture (like Nginx), are very good at such tasks.
However, usually there is an additional security logic, which should restrict access to files you've published. IIS, for example, offers deep integration with application layer, which allows custom .NET "middleware" logic injection into the request pipeline.
Node.js applications are very often published behind Nginx for various reasons and, with the help of Nginx "Secure Link" module, it's possible to offload static file serving tasks from node.js to Nginx, even if the files are not public. This module uses "shared secret" string (known to Nginx and the application) and expects a hash, based on this secret, to be present in the request to decide whether to proceed or return an error.

Secure Link module may work in 2 alternative modes (
  1. Simpler mode, based on "secure_link_secret" directive. Hash value in the request is based on concatenation of link to file and the secret. In this mode we can only check whether client was given hash for particular link
  2. More complex mode, based on 2 directives ("secure_link" and "secure_link_md5"). In this mode, we can restrict validity time of the hash and some other client-specific parameters, known to Nginx (like IP address or a header value)
This article focuses on the second mode, since it is more powerful and therefore more useful. Following is an example of Nginx configuration, using this mode:

It expects requests like /img/file.doc?h=[hash]&e=[expire time]. Lines in the configuration should be read as follows:
(1) When accessing /img/.. links, Nginx will first perform the following checks
(2) "secure_link" directive describes the way Nginx should extract generated hash value and (optionally) link expire time from the request. In this example, we specify that hash value and link expire time should be extracted from query string arguments "h" and "e" respectively.
(3)"secure_link_md5" directive describes what hash value Nginx expects to see in the request. "$secure_link_expires" variable contains expire time extracted from the request (value of "e" argument in our case). So, to generate correct hash value, we should hash concatenation of link expire time (seconds since "Epoch time"), request uri, space and "shared secret" ("6q3R9jhzG5" in our case)
(5-7) if hash comparison fails, return 404 error status
(9-11) if hash comparison succeeds, but expire time is less than server time, return 404 error status
(13) otherwise, continue with the request

Before browser (or other user agent) can request a file from Nginx, application server needs to generate a hash to be passed to the web server (along with expire time, used in hash generation). Following is a node.js function which generates such hash:

It's important to note that generated hash is expected to be in base64url format, which is ensured by the last line of the function.


Nginx secure link module may not have the best documentation, but once you understand it, it's very easy to use. I have found it useful, because it allowed to significantly decrease number of requests to node.js server and database hits.

MD5 algorithm has a bad reputation in security and, of course, its a bad idea to use it to store passwords. However, if you have long-enough secret (which should be longer, than in this example), it is absolutely impractical to use any of primitive techniques (like brute force or rainbow tables) to find secret out of hash value. Collision attack is not relevant either, since user agent does not present a secret to be hashed, but hash value itself. That means, that using md5 is a reasonably secure solution for secure links. Also, its performance is very important for this module to be useful.

Thursday, September 24, 2015

Profiling JavaScript memory in the browser and on the server


Recently we were executing a memory usage analysis in my current (node.js) project. This article summarizes some of our findings, which may be useful to other projects. It does not describes the very basics of the mark-and-sweep garbage collection algorithm. Instead, it focuses on the details of the particular tools we used.

Chrome developer tools 

Chrome developer tools have great heap snapshot section, which can be used to analyze heaps of both browser and server (node.js) applications. It can be used to record JavaScript heap snapshots of a running process or to load existing snapshots, recorded by other tools. It is available already for several years and is extensively described on the Internet. Still, there are parts of this tool, which don’t have proper description and, overall, it may be a bit overwhelming in the beginning. So here is a very simple HTML document example, which may help in understanding Chrome heap snapshot views.

A script in this document simply creates an instance of an object, using constructor function and assigns it to global scope. If you click on “Action” and then “Set and clear” button to force browser do its initial compilation work, then record first heap snapshot, then click on “Action” button and then record second heap snapshot, you could see something similar to the picture below. Just select the second snapshot in the left menu, then select “Comparison” view and then select the first snapshot as comparison target in the dropdown, next to class filter input box.

Chrome developer tools / Profiles tab

Upper table shows the difference between two snapshots. All objects (either new or deleted in the second snapshot) are grouped by constructor function name (just as in Summary view).
First column (“Constructor”), is a tree view. On the first level you can see constructor function name (object instance grouping parameter). On the second level, you can see instances in the selected group (each instance has a unique id with “@” prefix). On the third level, you can see properties of the instances, then, on the lower level – properties of the corresponding objects and so on.
Further columns in the upper table describe difference in the number of instances and total size of each group (on the first level). On the second level, you can see whether a particular instance is new or has been deleted between the compared snapshots.

The lower (“Retainers”) table is also a tree view. On the first level, you can see list of paths from the selected instance to the root object, which causes this instance not to be marked as garbage by the garbage collector. The tree under each top level node describes the path to this root. In the picture above, you can see that instance of “MyClass” with ID “@115975” is “retained” through a property “myInst” on root object “Window”. Distance “1” means that this instance is directly referenced by a root object. It is interesting to see that there is also another, system retaining path for this instance (second top level record in the “Retainers” table). It is not directly visible to JavaScript, but it is shown in this view nonetheless.

Analyzing node.js heaps 

What is great about node.js is that not only you can build server-side using the same language (JavaScript, CoffeeScript, TypeScript, etc.) and libraries as you use for “browser-side”, but also that you can use the same tools to debug and analyze it.
There are a lot of articles, which describe how you can attach a debugger to a new or running node.js process and then either debug or record heap snapshots. What is slightly less known, is that you can actually programmatically record heap snapshots without the need to attach a debugger. That is very easy to do using “heapdump” NPM module (
For example, you can simply create a new route, which can be called any time and will create a new heap snapshot file on disk. It can even be used in production (if you closed it from unauthorized access, of course):

Using this heap dumping technique, we have very easily found that many of our processes store big unused “geoip” data files in RAM, since corresponding library loads them into memory non-lazily. Simply removing these files, greatly reduced memory consumption of our application.

Thursday, July 16, 2015

Using Redis Lua scripting for complex queues

We've just published a short summary of how we are using Redis Lua scripts to build high-performance queues, which ensure data consistency. It's here, on IPR blog:

Thursday, July 2, 2015

Returning cached value you haven't cached yet

In my current project, we use JavaScript promises all over the place. We use built-in angular.js $q service on the client and the great bluebird node.js promises library on the server. In fact, we don't even have any callbacks, all the "standard" node.js APIs are always "promisified".

The virtues of promises have been extolled by many people. I personally like this article a lot: I especially like the part, which tells about how promises allow "programming with values" in asynchronous contexts.

Recently I've found a very interesting use case of this "programming with values" concept. Imagine, you have a service, which returns a value (a promise of value, to be more precise), by making an HTTP request. Then you decide you want to cache this value and make sure that many requests to this method don't make extra HTTP requests before original HTTP request returned the value. That is, you need to return a cached value, which you haven't cached yet.
Doing this with promises is so simple and elegant, you just cache the original HTTP promise and return it, you don't need to check if the value has been returned (promise resolved)!

This is an excerpt of actual angular.js service (in TypeScript):