Performance: Common Subexpression Eliminator Speedup#16741
Conversation
|
First of all, thanks! Looks very neat, I hope we can merge this within a week or so, just like your other PR. Scheduled the run on our benchmarking setup, https://github.com/argotorg/solc-bench/ I'll post the compare table generated by it, here, soon. |
|
Looks good from a perf perspective :) Only regression is minor, on peak RSS usage, which is still low, and quite minimal. A clear win. |
| auto& candidates = m_replacementCandidates[*_value]; | ||
| if (std::find(candidates.begin(), candidates.end(), _variable) == candidates.end()) | ||
| candidates.emplace_back(_variable); |
There was a problem hiding this comment.
This will affect the order in which candidates are iterated in, so could/will affect the bytecode in the sense that it may now pick a different (although still valid) (sub)expression to be substituted. There are no guarantees on our side as to what the bytecode will actually look like as long as it's semantically correct, so I'm pretty sure this isn't a problem. @cameel can you double check please - this is the only change where I'm slightly uncertain?
There was a problem hiding this comment.
candidates here is an order preserving vector, so the iteration and subsequent emplace_back is deterministic.
m_replacementCandidates is the std:unordered_set with a key type of expression and value type of vector. We just use the unordered_set to find the vector of previously found candidates. Then, this code is just appending to that ordered vector when a new candidate is found. So this particular code should be safe.
Even if the candidates vector were unordered, this check is just checking for presence, which is deterministic regardless of order.
There was a problem hiding this comment.
We're still iterating over the values in the vector (previously set), and the two have different iteration orders (set sort vs insertion), which makes it possible for different (but equivalent) expressions/values to be used as replacements. I'm not saying this is wrong or undesired behaviour, just that it could result in different bytecode between the two approaches, which is why I asked @cameel to double check. From my point of view, this is perfectly fine (but I have been wrong before).
There was a problem hiding this comment.
In pseudo javascript:
m_replacementCandidates = {}
function insert(value, expression){
candidates = m_replacementCandidates[value] // plus magic to insert array if not present
if(!candidates.find(expression)){
candidates.push(expression)
}
}
There was a problem hiding this comment.
Ah - now I understand! The set might have been sorted by something different than its insertion order, even if both are deterministic.
There was a problem hiding this comment.
I think the important bit is not backwards compatible bytecode, but consistent bytecode output for this compiler version. In that case, a different sorting order would be fine, as long as it's deterministic. And this is definitely that.
That said, no tests have needed to change as a result of this, and my local benchmarking on a set of real world contracts has identical bytecode out with this change, as without it.
|
|
||
| Compiler Features: | ||
| * General: Speed up SHA-256 hashing (`picosha2`). | ||
| * General: Speed up compilation times. |
There was a problem hiding this comment.
Sorry, should probably mention that this particular improvement is due to optimizations in the CSE and dataflow analysis. In any case, let's wait until @cameel takes a look, he'll probably suggest a final version of the changelog entry phrasing.
Description
This PR speeds up large file compilation with
--via-irby 5% in my testing (100 runs for each the benchmark and this change). Output bytecode has been identical in my tests.The common subexpression eliminator is the most time consuming of all yul optimizers in my testing, it in turn spends most of its time in dataflow analysis.
The analysis visits each AST node, records changed values, and on each node visits looks through relevant previously recorded values to see if there is a match. Because this runs against every AST node, it generates a lot of reads and writes.
Currently it uses
std:setfor the hot path container that stores what has been seen and looks it up.std:set, in spite of the name, is an ordered tree structure, meaning that each "have I seen this" at that happens at each node involves pointer walking a tree. This is much slower when compared using a hash based lookups withstd: unordered_set. This same speed up of hash vs order preserving tree happens on writes as well.Secondly, when we do find a match, it is overwhelmingly likely to be a single match, not a gigantic list of matches. By returning return a simple vector of results, we remove the overhead of dealing with a tree structure for every result return.
Lastly when clearing values, the previous code first created a new combined ordered and deduplicated set, before removing entries. This involved the overhead of creating a new ordered set, and the writes involved in combining data. Now that clearing is cheap, it's faster to just loop each input and remove if needed. The cost of duplicate ignored erases is less than the cost to build a new deduplicated tree in the first place.
Checklist
AI Disclosure
Codex 5.5