Guards all StepRegistry mutations to make them thread-safe.#287
Guards all StepRegistry mutations to make them thread-safe.#287jensakejohansson wants to merge 5 commits into
Conversation
| @@ -37,7 +37,10 @@ public Task<StepValidateResponse> Process(int stream, StepValidateRequest reques | |||
| { | |||
| isValid = false; | |||
| errorType = StepValidateResponse.Types.ErrorType.DuplicateStepImplementation; | |||
| errorMessage = string.Format("Multiple step implementations found for : {0}", stepToValidate); | |||
| var implementations = _stepRegistry.MethodsFor(stepToValidate); | |||
| var locations = string.Join("\n", implementations.Select(m => | |||
| $" {m.ClassName}.{m.Name} in {m.FileName}:{m.Span.StartLinePosition.Line + 1}")); | |||
| errorMessage = $"Step: {stepToValidate}\n{locations}"; | |||
There was a problem hiding this comment.
Seems there is a check-then-act possible race condition here since I don't think there is any mutex and I believe this validation could be happening while there are changes being made to to the steps in parallel? is that right?
If so, we might want to consider locking on the stepRegistry mutex directly, or change this to an atomic operation such as StepRegistry.GetMultipleImplementations(stepToValidate) per my other comment so we can lock here too?
| // Guards all StepRegistry mutations. gRPC dispatches CacheFileRequest messages concurrently on separate threads, | ||
| // all sharing this singleton. Without synchronization, concurrent ReloadSteps calls can interleave their | ||
| // RemoveSteps/AddStep operations — one thread's dictionary rebuild overwrites another's, leaving stale entries | ||
| // that cause false "duplicate step implementation" warnings in the IDE. | ||
| private readonly object _registryLock = new object(); |
There was a problem hiding this comment.
Would it be sufficient to lock on the registry object itself rather than using a discrete lock; so we can also lock from other places such as the StepValidationProcessor?
| { | ||
| _stepRegistry.RemoveSteps(filepath); | ||
| LoadStepsFromText(content, filepath); | ||
| } |
There was a problem hiding this comment.
I would suggest that we would be better to split this into two steps, so we can lock as little as possible, load all of the files in as atomicish/quick an operation as possible, and then push the responsibility of synchronization for the bulk operation to the StepRegistry itself rather than using external locking.
i.e (pseudocode)
var stepValues = LoadStepsFromText(content, filepath);
_stepRegistry.ReplaceSteps(filepath, stepValues);And then
public void ReplaceSteps(string filepath filepath, Dictionary<string, GaugeMethod> stepValues)
{
lock (_registry)
{
RemoveSteps(filepath);
foreach (var (stepValue, gaugeMethod) in _stepValues)
{
AddStep(stepValue, GaugeMethod);
}
}
}Then we can move all of the locks into the StepRegistry so it is self-contained, and we can potentially move it to use a ConcurrentDictionary internally where we can further reduce the locking necessary, improve iteration safety and fix the check-then-act issues the code currently has?
We may still have some issues around concurrent file changes while it is processing that are impossible to synchronise on due to external file system changes; but if possible I think it might be easier to reason about without the big external lock, and we can hope for eventual consistency to prevail.
The challenge with 'big locks' is that where you can't actually prevent something changing with the lock (e.g the external file content), in some cases it might make it worse when you can't interleave things, or if there is a huge number of changes made, and lots of events get queued up in quasi-indeterminate orders. This is just theoretical intuition from many years of dealing with concurrency like this though.
Am I missing something? WDYT?
|
Thanks for your insightful feedback @chadlwilson! I agree that locking on StepRegistry would be a better approach and less of a 'quick-fix'. I've updated the PR to the best of my understanding. It became a rather big change. Main things that I tried to address here:
I'm personally not too worried about events getting queued up in this context. As far as I understand this is all very quick in-memory operations and the number of files/steps to process is most likely not that large. What do you think? |
|
@chadlwilson Unclear to me why the FT-ubuntu tests failed. Can you please trigger the workflow again when you have time to re-check? I suspect some intermittent glitch unrelated to this change. Works on Windows both in workflow and locally. |
|
I reran and they are OK. I think the failure here is a regular one which I have not been able to track down. The functional tests are trying to find "Successfully generated html-report" after checking other things, but in some cases it's not there in the console logs - which either indicates a race condition in the design, or some kind of console flushing problem. I really wish I could fix it, as it drives me nuts doing re-runs across the projects. The following is orthogonal to this PR, so ignore it unless you feel like exploring :-) For posterity the raw Can ignore the stuff like this, it's just caused by a best-efforts attempt to take screenshot on the (intentionally) failing spec: Despite its terrible naming and lack of context, the error here is intentional, the test is testing handling of errors thrown in hooks during parallel execution: The actual problem seems to be here We should see the outcome of these two lines from logger.Infof("Successfully generated html-report to => %s\n", filepath.Join(reportDir, "index.html"))
...
logger.Debugf("Done generating HTML report using theme from %s", t)So I guess Gauge is either
|
|
I have no real insight into the test implementations or any idea were to start searching, so I'll have to add this on the todo-list for another time 😅 although unreliable regression is a serious headache. There is already the slow FT tests on Windows and I also noted that if a log-line is over a certain length Gauge fails to handle it properly and logs incorrectly; it logs the line as a raw json string using incorrect log-level - some sort of fallback occurs. Just to name a couple of things. The more I dig, the more I find... but I'll try to contribute the best I can. It's really a shame ThoughtWorks stopped to actively fund this project since the fundamental idea/design principles in my opinion were absolutely brilliant compared to the major 'competitors'. Anyhow, what would be the next step here? Do we feel comfortable enough to merge this? @sriv - you are usually knowledgeable and helpful too - if you have time, do you see any obvious 'red-flags' with this? |
|
I haven't had time to take another proper look, but will get to it. |
I think this flakiness comes from occassional failure to install html-report. I suspect there is some rate limit to the downloads from GH when there's too many concurrent requests. html-report plugin is not mandatory for gauge functionality, but is included as a default plugin, and I believe this message was included to listen on test execution completion. A hack, I admit! Will think of. away to make this better. I think we subsequently added support for hooks which we should now consider using. |
This could be an issue, but i can see html report starting here (just not finishing per the logs) so don't think it's this in the case ive looked at? |
|
Just a friendly reminder, if you've had time to look at this? I've been running this version in my dev. env. since the PR was created without noticing any issues and have not seen any false duplicate step warnings since. Worth noting is that I don't run any parallel executions. |
|
I haven't had the time to go through this in depth. But considering that you've been running this without problems, I will merge this off. |
|
The branch needs to be rebased @jensakejohansson . Please, could you sync this up and we can merge this |
|
Yeah, sorry, my PR review queue across projects got a bit out of control. Since you mentioned you don't run with parallel, I wanted to think through a bit more carefully about whether deadlock was possible, with all the locks - especially as it got a bit more involved than I expected. |
Signed-off-by: jensakejohansson <jens.johansson@systemverification.com>
Signed-off-by: jensakejohansson <jens.johansson@systemverification.com>
…o an single atomic snapshot. Signed-off-by: jensakejohansson <jens.johansson@systemverification.com>
….log Signed-off-by: jensakejohansson <jens.johansson@systemverification.com>
Signed-off-by: jensakejohansson <jens.johansson@systemverification.com>
ae951d9 to
f488037
Compare
Done! |
|
My thoughts while doing this was that StepRegistry has no nested locks. Locking methods does not rely on external callbacks / awaits anything, but should be quick in-memory operations. Also, most locks occurs when editing files - that's when the original problem of not locking was evident - not while executing. Regarding parallelism: From my understanding the procedure of "loading of steps" happens during runner startup and is not affected by parallelism. During actual execution it's the LookupStep-method that will be called intensively - once per step per parallel stream - and may theoretically cause performance problems since it's a hot-path that will be serialized. However, it's a quick memory look-up and a small allocation so for real world scenarios I'm not worried. It would require heavy parallelism and steps that are executed very quickly to cause any noticeable 'delay' and in those situations I guess it still would not be noticeable in relation to other things of greater concern in the gauge/runner design. |
In VSC Gauge shows false positives about 'Duplicate step implementation'.
I've have seen the issue for years across different projects & environments (macOS/Windows etc). It was a nightmare to debug - I added logging in many places - finally it occurred to me it's likely a race condition.
My understanding is that VSC sends gRPC messages for opened files and the these gets dispatched on separate threads. All requests share the same singleton StaticLoader and its StepRegistry - which has no synchronization.
ReloadSteps (called for Opened/Changed events) performs a two-step compound operation:
These two steps were not atomic. When two CacheFileRequest messages are processed concurrently for different files unpredictable results may happen resulting in VSC reporting false "duplicate step implementation" errors for steps that only have a single implementation.
Fix: Lock in StaticLoader: Added a lock (_registryLock) around all public methods that mutate the registry (ReloadSteps, RemoveSteps, LoadStepsFromText). This makes the compound remove-then-add operation in ReloadSteps atomic — no other thread can read or modify the registry between the removal and re-addition of a file's steps.
It's hard to write a deterministic unit test due to the nature of the issue (at least I don't know how to do it). I've used the fix locally and it seems promising.
For the sake of user-friendless I also enriched the error message so VSC can show to the user where the duplicates are actually found. At the moment this message is not used by Gauge(!) - it's part of the proto contract, but ignored and not relayed to VSC. I will do a Gauge PR for that (getgauge/gauge#2837).