A merge of a bunch of different changes for pytorch.#14
Conversation
stellaraccident
left a comment
There was a problem hiding this comment.
I don't think we need to carry the compiler support forward (likely you just got this from the merge but it was deleted in this branch).
As discussed, since you are flushing your perma-branches, I am ok with this going in as-is. But this needs unit and/or CTS tests, especially for things like the ELF/CCOB manipulation. Please file an issue so we don't lose track and can get this test coverage in. Not blocking this one so we can get out of branch debt.
| @@ -0,0 +1,69 @@ | |||
| // Copyright 2026 The HRX Authors | |||
There was a problem hiding this comment.
I think we dropped this entirely (and the cc file).
This is a roll-up of a significant number of changes that are required for getting pytorch running on top of hrx.
03ca5f8 to
6ff95b5
Compare
zjgarvey
left a comment
There was a problem hiding this comment.
I tried to read the bits which seemed important for me to understand better: e.g., kernel launching, and command buffer/stream interplay handling.
I have a few other questions, but could probably wait.
|
|
||
| while (remaining > 0) { | ||
| size_t this_chunk = remaining < chunk_size ? remaining : chunk_size; | ||
| iree_status_t iree_status = iree_hal_device_transfer_h2d( |
There was a problem hiding this comment.
Isn't this API fully synchronous? Why is this section labelled Async Stream Transfers? I guess the same below for d2h. I wanted to understand how async transfers worked between host and device and I'm kind of confused here. Are the async transfers just not implemented yet?
There was a problem hiding this comment.
Maybe this is a terminology question more than anything: does "host" literally mean "host thread" so it must inherently block on the host thread? Or can some separate thread on the CPU do the transfer and signal to the host thread when it's done? Not sure how important this is unless we are paging data for a training job?
There was a problem hiding this comment.
So I will answer both of the questions in ordedr
- Are the async transfers just not implemented yet? << Yes this is the case. This was originally used by the streaming layer, but then we ended up writing it ourselves in the streaming layer. (There is more work to be done to move more of binding/common to use the libhrx functions instead of iree functions directly.
- "host" here means "memory residing in system RAM not GPU ram"
| stream->submitted_value = signal_value; | ||
| } | ||
| } else { | ||
| status = iree_hal_command_buffer_dispatch( |
There was a problem hiding this comment.
This writes the kernel launch into the stream->command_buffer, correct?
I'm trying to understand what the lock stream->mutex is being used for (It's only used in stream_begin and stream_flush), and since this isn't under lock, I'm wondering if it possible that this could be interleaved with a stream_flush from a different thread to result in a UAF/nullptr deref or something?
There was a problem hiding this comment.
So it is legal to send commands to a hip stream from multiple threads at once. So, this is SOME work in the API to handle this, but it is incomplete.
I have filed #26 to add tests and fix the implementation here.
This is a roll-up of a significant number of changes that are required for getting pytorch running on top of hrx.