RollingCache.ccc performance debugging and tuning … How?

2025.01.02 - Living on the Edge of the speed of light and at the mercy of the CDN Origin

As a goose I have a very simplistic view about this entire computer thing. I believe it is nothing else but:

  • copy Data from Source to Target (load-store)
  • if true then goto address (conditional jump)

The rest of the magic must be an endless permutation of the above, at gigantic scale and speed.
In these times of the AI-blockchain cult(ur) you might call my believes almost Turing-ian … but then … so be it.

FS2024 is not Tetris. So I find it correct that the FS2024 team was and is stressing the fact, that:

  • “Data streaming” is the only future for the Flight Simulator.

With this assumption the logical next step, and here I will try to explain in more detail “why” I hold that position, is:

  • FS2024 needs the most advanced local data caching solution that is possible.

As a slowly flying goose I know: distance matters … even to light and electric signals. So the “copy Data” part is best, if there is less data over a shorter distance. Really short would be like: “L1 cache to CPU register” short. But we know that for FS2024 the origin of our data is in some large geographically distributed Azure data storage service, that contains terabytes of 3D object model data and petabytes of planet landscape data.

FS2024 is not Tetris. At its core FS2024 is a very long data copy pipeline.

Every picture is wrong, but some are useful … so they say. So allow me to expand and morph the “copy Data” part (A) a little until it fits the FS2024 setup somewhat better:

A) Source -> Target
B) Origin -> Edge
C) Origin -> Cache     -> Edge
D) Origin -> Cache     -> Cache    -> Cache     -> Cache     -> Edge
E) Origin -> CDN Cache -> RC.ccc   -> RAM Cache -> GPU Cache -> Edge GPU
F) Source -> Copy -> … -> Copy     -> Copy      -> Copy      -> Target

(B) simply introduces the terminology which I will show in a picture below. “Edge” is a more trendy word for client computer.

(C) wants to highlight that in real data life there is always some “cache in the middle(ware)” between the Origin and the Edge client. And (D) tries to push that point even more, as there are usually multiple caches.

(E) tries to name the important FS2024 parts. The content delivery networks (CDN) are basically just big rolling caches on someone else’s computers. We have the RollingCache.ccc file and a corresponting cache in the RAM used by FS2024. “Finally” the GPU caches textures, shaders and geometry in VRAM before they get pushed over the Edge, to the GPU cores which paint the magic into pixels (which are stored in a frame buffer, which basically is another form of cache).

(F) is just a return to the terms of the initial oversimplified case (A). At a technical level the amount of necessary copy operations and involved registers, buffers, queues, caches, file systems, etc. is mind blowing … from my perspective as a simple old goose.

If you want to be really fast, try to do nothing … so they say. In practical terms “distance always matters” and Asobo needs to try to keep the “copy distances” as short as possible. We all want Ultra quality at ultra FPS.

G)                                                 GPU Cache -> Edge GPU
H)                                    RAM Cache -> GPU Cache -> Edge GPU
I)                        RC.ccc   -> RAM Cache -> GPU Cache -> Edge GPU
J)           CDN Cache -> RC.ccc   -> RAM Cache -> GPU Cache -> Edge GPU
K) Origin -> CDN Cache -> RC.ccc   -> RAM Cache -> GPU Cache -> Edge GPU

(G) becomes possible once all necessary 3D data is inside the VRAM (caches). No copy is the fastest copy. However, VRAM is very limited and very expensive.

(H) is possible if the sim already has cached the data inside its RAM, but it still needes to be copied over a DDR and PCI bus into GPU VRAM. Pretty short and pretty fast … but not as good as “doing nothing”. However, RAM is less limited and less expensive.

(I) becomes necessary if the RAM cache does not have the necessary data but it can be found on a local storage device. The data will need to get copied e.g. over a “NAND” and PCI and DDR bus and then step (H) will follow. SSD memory is cheaper than RAM memory. Sadly “cheaper” always comes with “slower” too. But, as I have shown in a previous test, on my system in theory that is still at a 160 MB/s (1280 Mbps) level.

I designated the RollingCache.ccc as the source of path (I) because this thread is about the RC. However, I consider the RC as just another file on the filesystem. From my point of view all files are a form of data cache. However most of them are immutable, while the RC is designed for constant mutation.

(J) allows access over the TCP/IP network to data in cases where local file caches cannot answer the call. The copy distance becomes extremely long and thereby again slower … at least one order of magnitude slower (50 to 100 Mbps). However, on the up side, a CDN provider has bigger SSD arrays for its rolling cache than any local computer.

(K) is the worst case copy path. Basically every cache layer failed to “hit” the data but rather produced a “miss”. The data now needs to get copied all the distance from the “Origin” to the “Edge” to satisfy its needs (desires?).

The cache hit-miss ratio matters … a lot … at every step of the data copy path.

I will do a more detailed analysis of the cache hit-miss ration of the RC soon. But after the silly conceptual rant from above, I think I should turn to a more concrete and practical example: the CDN servers.

As an old goose I can remember that back in 1999 the company Akamai, and the idea of a CDN, was the “new hot tech”.

FS2024 is heavily based on Microsoft Azure for data storage and data delivery. Azure has a partnership with Akamai since 2019:

During the “mea culpa” developer stream after the launch of FS2024 the CDN topic was front and center … basically the blame was on the CDN. While the following slide was presented, the discussion was on the HTTP status codes and the (small) number of failures.

Now the dominant HTTP code is 206, and that is interesting, because it indicates that binary data is delivered by Azure in multiple chunks. However, far more interesting to me than the HTTP status codes are: The “Edge to Origin” ratios.

The entire point of a CDN is to reduce the stress on the Origin. During a major media streaming or software release event the “Edge to Origin” ratio can be way above 1 Million to 1. HTTP video streaming protocols even get designed in such a way to support as much CDN caching as possible.

In the FS2024 slide above the highest “Edge to Origin” ratio is around 33 to 1 (hit to miss).

Now, Asobo has not told us what this slide is really showing. So as usually I am just guessing and I might be completely wrong.

But to me that dashboard screenshot indicates that even the CDN has a hard time to take away stress from the Origin. And I want to make it as clear as possible: I am not blaming the FS2024 team for this. It is simply the nature of the problem of “flying everywhere on the planet”. It is a problem that does not really fit the CDN approach.

With so many different ideas about where to fly and how to fly on this large planet, only few users will request the exact same data at roughly the same time. CDN caching in FS2024 helps, but it can not do magic … most likely also because the cache size at the CDN is only a tiny tiny fraction of the Origin data.

All this for me leads to the conclusion that local “on Edge” data caching is extremely important. And the RC needs to play a major role here, especially for the pilots how have to live with very limited local storage (like on Xbox).

To summarize my claims from above:

  • Distance matters. So the longer distance our data have to travel, the less “Ultra” the results will be, as bandwidth becomes ever more limited.
    • Local caching is the key to streaming solutions.
  • Someone else’s computers do not come for free.
    • We as FS2024 users have to pay for CDN and Origin servers in the rainy cloud in one way or the other.
  • FS2024 needs to work as hard as possible to reduce the stress on the “Origin” servers,
    • … which can otherwise (and today do) get overwhelmed by 10,000 to 100,000 plus simulataneous users.
  • FS2024 needs the most advanced local data caching solution that is possible.
    • It is not there yet.
13 Likes