Good morning, @SPowell42
First of all: your scenario of “flying around (spawning at different locations, with different characteristics (photogrammetry, high geometry count, high count of AI traffic, …)) and detecting performance issues (and/or memory leaks)” is of course doable. Automated tests are not “limited” in that respect (but let’s not dive into Turing completeness and the like ;))
And yes, I was using the term “integration test”, fully aware that I was opening a can of worms here: there are indeed multiple levels at which automated tests can be applied, starting from unit tests, integration tests, service tests, UI tests, … and depending on who you ask the same tests are sometimes called differently - but typically the following “test pyramid” is a generally accepted “test model”: The Practical Test Pyramid
But again, this is a flight simulator forum and I did not want to dive too deep into automated testing, but you are completely correct: I should have used the term automated test, let’s call it a “family name” for all the different tests that exist.
My actual point is: while your test scenario seems absolutely reasonable in practice it is little helpful. Why? Let’s focus for a moment on the “when” you are going to run this test. Being “automated” it should be run “as often as possible”, best before a given code change is going to be integrated on the “main branch” (or any “stable branch”).
Now here is problem number one: if your automated test runs for hours, it is completely impractical to delay every code change for the same amount of time. What if the test fails? The programmer needs to detect the problem, fix it, check in again… and wait another couple of hours. But more often than not you also get “false rejects”, because some test simply crashed (yes, welcome to reality™ - tests themselves are also “just code” and have bugs etc.), the integration server was rebooted or whatever Murphy’s Law dictates just before the “go release”…
So the “solution” to problem number one is to run the automated tests (the long running ones) “overnight”, aka “asynchronously”. But what if tests have failed the next morning? Which of the dozens of changes that has been checked in the day before (or even the day before that, in case the test was not running yesterday night etc.)? And the finger pointing game starts… sure: with a dozen or so changes (only) it is reasonable to expect to filter out the responsible change (not all changes are going to modify “shader code” or are modifying “geometry construction”).
But what if a designer decided that “8K resolution textures simply look better”? And now due to limited VRAM the game engine is constantly swapping texture data from RAM to VRAM, creating a high load on the memory bus of the GPU? My point is: sometimes a silly change which has been done “far, far away from the problem hot spot” might be responsible.
And now you still need to go over all those dozens of changes from yesterday because a given performance test failed (and again, what if the performance test simply failed because Windows 10 decided to run an automated test (*) in the background? Sure, you might disable automatic updates on test machines, but just as an example that there are dozens of operating system processes that you may not be able to control 100%…).
(*) UPDATE: Hi hi, I was so into testing… of course I meant “update” here. But I leave my typo uncorrected here, for the laugh of it (and who knows what Windows 10 is really doing in the background, after all ;))
Which leads us to the bigger problem number 2: so let’s assume you made a “test flight” between some well-defined locations. And now the test claims that “the average frames per second is 20% lower than it should”. Now what? Is it constantly lower? At one given “hot spot”? And most importantly: why? Is it the brand new 8K livery texture the designer team integrated? The new LOD algorithm which has a problem with certain geometry data? The new shader code which renders depth of field effects? The new autopilot code? The new… you get the idea!
Such a “fly for one hour”-test would tell you almost nothing about the root cause! And whenever you would fix a certain aspect (“maybe it was the 8K texture after all. Let’s try…”) you would have to run the same test for hours again! Just to figure out “no, wasn’t the 8k texture. Let’s try disabling the autopilot this time…”.
In other words: an automated test shall:
- be quickly reproducible (best in the order of millisecond for unit tests, seconds for integration tests…)
- test a well-defined aspect (= functionality, module, algorithm, …) where…
- … the test input is as limited as possible and…
- … the expected outcome is well defined
Which leads us again to the aforementioned test pyramid: unit tests test a very specific functionality, with a granularity as low as “per method (function)”. They usually run very quickly and so you want to have a lot of unit tests. E.g. you would test the function which returns “triangles, based on surface points and their distance to the camera” (the “LOD algorithm”)".
An “integration test” would perhaps simply test whether a given glass display would be rendered correctly etc.
But in any case: every automated test needs to have a very specific test data input and a well defined output, so that one can quickly reproduce the test and immediatelly know “which system is affected”. E.g. when the “LOD algorithm” fails you know that it can’t be because of the 8K textures, because the livery textures are not any input of the LOD algorithm (and the LOD algorithm can’t even access the livery textures etc.).
That’s what I meant by saying that “flying around for hours and testing performance regressions is not very practical”.
BUT: There is one more test category, which is called “smoke test” (that term really comes from plumbing): there you do not test a specific functionality, but rather “the entire system” (or “the overall behaviour” etc.). So I agree that your test scenario could certainly fall into that category…