Improve call counting mechanism by kouvel · Pull Request #1457 · dotnet/runtime (original) (raw)

@noahfalk: How this changes our memory usage relative to before?

The latest commit removed the Completed stage and instead deletes the call counting infos, freeing that memory sooner and bringing it a bit closer to getting a fair comparison between deleting and not deleting stubs.

When not deleting stubs, I also made a local change to not create forwarder stubs. This is not included in the change, since I didn't intend changes to the config var to change too much of what happens underneath.

Here's some data on memory usage before the change, and after the change with and without deleting stubs where applicable. Committed memory was averaged over 8 runs after removing upper outliers and following a forced GC.

Call 10,000 empty nonvirtual methods, repeat 50 times in total with 100 ms gaps in-between

| | Before | No delete | Diff | Delete | Diff | | | -------------- | --------- | ----- | ------ | ----- | ---- | | Committed (KB) | 21128 | 21337 | 209 | 21119 | -10 |

~210 KB higher when not deleting stubs, no change when deleting stubs. Stubs are deleted once.

Stub counts:

| | No delete | Delete | | | ---------------- | ------ | ----- | | Count total | 10382 | 10418 | | Completed total | 10052 | 10052 | | Count at end | 10382 | 87 | | Completed at end | 10052 | 4 | | Active at end | 330 | 83 | | Forwarders | 0 | 36 |

CscRoslynSource, repeat 16 times in the same process

| | Before | No delete | Diff | Delete | Diff | | | -------------- | --------- | ----- | ------ | ----- | -- | | Committed (KB) | 68503 | 69169 | 666 | 68561 | 58 |

~660 KB higher when not deleting stubs, ~60 KB higher when deleting stubs. Stubs are deleted once.

Stub counts:

| | No delete | Delete | | | ---------------- | ------ | ----- | | Count total | 17279 | 19882 | | Completed total | 12380 | 13493 | | Count at end | 17279 | 2654 | | Completed at end | 12380 | 862 | | Active at end | 4902 | 1792 | | Forwarders | 0 | 5643 |

The slight memory overhead when deleting stubs probably would linger, as it would likely not reach the threshold for deleting stubs again for a while at least. The heuristics for deleting can be improved later if desired by using a timer for coalescing and decreasing the threshold.

Msix-catalog shortly followiing startup.

This is a bit interesting because the point when the app starts is also the point when call counting begins, so not much gets a chance to complete call counting by then and stubs are not deleted immediately.

| | Before | After | Diff | | | -------------- | ----- | ----- | --- | | Committed (KB) | 81922 | 82587 | 664 |

~660 KB higher after the change (with or without deleting stubs).

Stub counts:

| | After | | | --------------- | ----- | | Count total | 13081 | | Completed total | 3 | | Forwarders | 2845 |

Using the app slightly causes most of the stubs to be deleted fairly quickly. After that, the total stub count ranges between 2000 and 6000. Improving the deletion heuristics as described above might keep the memory overhead lower if desired. However, the committed memory was not stable after using the app a bit, so I didn't include numbers for that. It would probably be measurable with many samples and some analysis, scaling with the total stub count, but it seems to fall within error range.

@davidwrighton: Could you explain the logic around DeleteAllCallCountingStubs
Why do we do this
what data drove doing that

Added some data above. It was done to make a reasonable attempt to keep memory utilization similar to before and not leak the usually temporary memory from call counting stubs, as the memory overhead from stubs can be noticeable. In a larger app with ~100,000 methods being called, the overhead would reach into the several-MB range. The main and intended tradeoff is that methods that have not completed counting but are still being called (just less frequently) will have to go through the prestub again to reinstall the call counter.

under what situations

Every time the background JIT queue empties, if the count of call counting stubs that have completed is >= 4 K, call counting stubs are deleted. It occurs fairly sparsely and only once if at all in the larger benchmarks that I have run.

what stress modes need to be written to ensure that deleting stubs won't put us in a bad state

I think the larger perf tests like CscRoslynSource might be a good candidate for stressing, running multiple iterations in a process, and multiple processes over for period of time, with thresholds decreased to delete stubs more frequently and to transition tiers more quickly. I'll kick that off for a sanity test.

For instance, I don't have complete confidence that we have tracking for all cases where a virtual function pointer is captured somewhere, and we could potentially be in a situation where we end up deleting code underneath ourselves. How do I force this to happen for a set of methods and test that the deleting didn't cause a problem.

The main issues I was concerned with were:

The vtable slot value may be fetched already in a GC-safe point (possible on arm64/arm32, may be possible on x64/x86 depending on code gen)

This is prevented by using forwarder stubs (precodes) for vtable methods. The vtable slot never points to a call counting stub and instead points to a forwarder stub. Since the runtime cannot be suspended inside a forwarder stub or a call counting stub, suspension in the caller or callee in any place would allow deleting stubs.

The processor may have cached precode targets

This is prevented by issuing the necessary barriers after targets are updated and before resuming the runtime. Threads in preemptive GC mode use a lock for synchronizing the relevant data.

For any case that requires a multi-callable entry point to a vtable method, a FuncPtrStub is used for tiering purposes. For single-callable cases, the call goes through the vtable slot.

@noahfalk: How this changes the startup perf of an app running on a single core?

MusicStore affinitized to 1 processor

Similar to when not affinitized except call counting happens a bit later (delay is increased) and with much larger spikes. The y-axis is on a log10 scale.

Although the graph only shows two samples, there is no change to startup perf in this mode over multiple runs.

The call counting overhead is decreased, but otherwise it's not much better or worse than before. There are a couple of improvements I have in mind for these types of cases for the future.

Some normal startup numbers

| | Before (ms) | After (ms) | Diff % | | | -------------- | ---------- | -------- | ------ | | MusicStore | 963.250 | 964.250 | 0.1% | | CscRoslyn | 1675.733 | 1674.090 | -0.1% | | Msix-catalog | 1076.580 | 1076.011 | -0.1% |

@noahfalk: You and @brianrob probably want to agree if any additional performance regression testing is necessary.

@brianrob, thoughts?

@noahfalk if you haven't already, I would appreciate if you could review the last 3 commits.

Hopefully I have answered all of the questions above. Please let me know if I missed anything. Thanks!