The kmos3 Implementation

How the kmos3 KMC Algorithm Works

Kmos3 asks you to describe your model to the processor in seemingly arcane ways. It can save model descriptions in XML but they are basically unreadable and a pain to edit. The API has some glitches and is probably incomplete: so why learn it?

Because it is fast (in two ways).

The code it produces is commonly faster than naive implementations of the KMC method. Most straightforward implementations of KMC take a time proportional to 2*N per KMC step, where N is the number of sites in the system. However, the code that kmos3 produces is O(1) until the RAM of your system is exceeded. As benchmarks have shown this may happen when 100,000 or more sites are required. However, tests have also shown that kmos3 can be faster than O(N) implementations from around 60-100 sites. If you have different experiences please let us know but we think this gives some rule of thumb.

Why is it faster? Straightforward implementations of KMC scan the entire system twice per KMC step. First, to determine the total rate, then, to determine the next process to be executed. The present implementation does not. Kmos3 keeps a database of available processes, which allows to quickly pick the next process. It also updates the database of available processes, which costs additional overhead. However, this overhead is independent of the system’s size and only scales with the degree of interaction between sites, which is hard to define in general terms.

The second reason why it is fast is because you can formulate processes in an intuitive fashion and let kmos3 figure how to make fast running code out of it. So we save in human time and CPU time, which is essentially human time as well. Yay!

To illustrate just how fast the algorithm is, the graph below shows the CPU time needed to simulate 1 million KMC steps on a simple cubic lattice in 2 dimensions with two reacting species and without lateral interactions. You can see that the CPU time spent per KMC step is nearly constant for up to about $10^5$ sites.

../../_images/benchmark.png — Benchmark for a simple surface reaction model. All simulations have been performed on a single CPU of an Intel I7-2600K with 3.40 GHz clock speed.

The kmos3 O(1) Solver

../../_images/data_structures.svg — The data model underlying the kmos3 solver. The central component is the avail_sites array, which stores for each elementary step the sites for which it is executable. Secondly, it stores the location in memory, where the availability of the site is stored for direct access. The array of rate constants holds the numeric rate constant and only changes, when a physical parameter is changed. The nr of sites array holds the total number of sites for each process and needs to be updated whenever a process becomes available or unavailable. The accum. rates has to be updated once per KMC step and holds the accumulated rate constant for each processes. That is, the last field of accum. rates holds $k_{\mathrm{tot}}$ –the total rate of the system.

The data model underlying the kmos3 solver. The central component is the avail_sites array, which stores for each elementary step the sites for which it is executable. Secondly, it stores the location in memory, where the availability of the site is stored for direct access. The array of rate constants holds the numeric rate constant and only changes, when a physical parameter is changed. The nr of sites array holds the total number of sites for each process and needs to be updated whenever a process becomes available or unavailable. The accum. rates has to be updated once per KMC step and holds the accumulated rate constant for each processes. That is, the last field of accum. rates holds $k_{\mathrm{tot}}$ –the total rate of the system.

So what makes the KMC solver so furiously fast? The underlying data structure is shown in the picture above. The most important part is that the solver never scans the entire system for available processes except at program initialization.

Please have a look at the sketch of data structures above. Given that all arrays are initialized and populated, in each KMC step the following things happen:

In the first step we need to identify the next process and site. To do so, we draw a random number $R_{1} \in [0, 1]$ . This number has to be scaled to $k_{\mathrm{tot}}$ , so we multiply it with the last field in accum. rates. Next, we simply perform a binary search for the corresponding process in accum. rates. Having determined the process, we pick a site using a second random number $R_{2}$ , which is constant in time since avail sites is filled up with the available sites for each process from the left.

Totally independent of this we calculate the duration of the current step with another random number $R_3$ using

$\Delta t = \frac{-\log(R_{3})}{k_{\mathrm{tot}}}.$

So, while the determination of process and site is extremely straightforward, the CPU intensive part just starts now. The proclist module is written in such a way that for each elementary step it updates the avail sites array only in the local neighborhood of the site, where the process is executed. It is furthermore heuristically optimized in order to require only a minimal number of if-statements to figure out which database updates are necessary. This will be explained in more detail in the next subsection.

For the current description it is sufficient to know that for all database updates by the proclist module

the nr of sites array is updated as well and

adding or deleting an available site only takes constant time, since the number of available sites as well as the memory addresses are always updated. Thus, new sites are simply added at the end of the list of available sites. When a site has to be deleted, the last site in the array is moved to the memory slot available now.

Thus, once all local updates are finished the accum. rates array is simply updated once and we are ready for the next KMC step.

Todo

Describe translation algorithm.