Skip to content

Cudastf #794

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 41 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
d6dc01d
Update build config to pull CUDASTF
sidelnik Nov 4, 2024
245b20f
remove const expr
sidelnik Nov 4, 2024
9b35ec8
Updates to get basic cudastf functionality working with matx
sidelnik Nov 5, 2024
7d298d4
move to void_interface
sidelnik Nov 5, 2024
154b3f9
add stf executor
sidelnik Nov 5, 2024
c8ef988
support for cgsolve operator and a few examples
sidelnik Nov 5, 2024
52b18c9
make the sync() that is part of stfexecutor call ctx.task_fence()
sidelnik Dec 3, 2024
d726b10
fix typo
sidelnik Dec 3, 2024
5e7576c
Added test case
sidelnik Dec 17, 2024
1373699
Fixes to the sync
sidelnik Dec 17, 2024
92e7204
add support for cgsolve
sidelnik Dec 17, 2024
a608f3f
update to the simple radar code
sidelnik Dec 17, 2024
b062577
minor typo fix
sidelnik Dec 17, 2024
bbf9abc
update version of stf
sidelnik Dec 19, 2024
3e831ea
cleanup constexpr case for stfexecutor
sidelnik Dec 19, 2024
702fe79
cleanup constexpr case for stfexecutor
sidelnik Dec 19, 2024
5bfe21e
add conditional support for cudagraph to the stf executor
sidelnik Dec 19, 2024
f407256
update to latest cudastf
sidelnik Jan 9, 2025
221599c
switch to use logical token
sidelnik Jan 9, 2025
7a5bb6c
update parameters for radar code
sidelnik Jan 9, 2025
0c2432f
update to radar code to work with command line args
sidelnik Jan 9, 2025
3ae267b
cleanup to support different executor
sidelnik Jan 9, 2025
6a75794
cleanup radar code to emit stf and cuda versions
sidelnik Jan 24, 2025
f1facca
test script that runs simple radar with different input sizes. output…
sidelnik Jan 24, 2025
0199e75
enable cuda graphs as a command line argument enableGraphs
sidelnik Jan 24, 2025
39b16f4
add support for the random/randomOp generator
sidelnik Jan 27, 2025
9b7c4b0
get the basic spectrogram code working with stf
sidelnik Jan 27, 2025
f9e09f1
get spectrogram cudagraph code working with stf
sidelnik Jan 27, 2025
6c9a791
add assert in the case stream capture is turned on if creating a plan
sidelnik Feb 10, 2025
a1efd1c
Merge branch 'cudastf' into cudastf_latest
sidelnik Mar 19, 2025
6437eab
Merge pull request #2 from sidelnik/cudastf_latest
sidelnik Mar 19, 2025
bbb9aae
Apps using matx with stf should get these flags
caugonnet Mar 24, 2025
e13c9b6
fix constructor
caugonnet Mar 24, 2025
7244399
fix typo/bug
sidelnik Apr 21, 2025
66f6850
update to example code to fix compile error
sidelnik Apr 21, 2025
89e2a43
update to example code to fix compile error
sidelnik Apr 21, 2025
973886b
update test script for radar code
sidelnik Apr 21, 2025
92885e7
temp fix to the allocator dtor
sidelnik Apr 21, 2025
8607840
remove warning to work with latest stf
sidelnik Apr 21, 2025
14e0985
replace logical token with token
sidelnik Apr 21, 2025
92e04d5
update version to use cccl from main
sidelnik Apr 21, 2025
File filter

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
support for cgsolve operator and a few examples
  • Loading branch information
sidelnik committed Nov 5, 2024
commit c8ef988d1e1987f320c4be256dae36621b7c02f2
9 changes: 9 additions & 0 deletions examples/cgsolve.cu
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,12 @@ int main([[maybe_unused]] int argc, [[maybe_unused]] char **argv)
auto norm = make_tensor<TypeParam, 1>({BATCH});
auto maxn = make_tensor<TypeParam>({});

#if 0
cudaExecutor exec{};
#else
stfExecutor exec{};
auto ctx = exec.getCtx();
#endif

// Simple Poisson matrix
for(int b = 0; b < BATCH; b++) {
Expand Down Expand Up @@ -83,6 +88,10 @@ int main([[maybe_unused]] int argc, [[maybe_unused]] char **argv)
(maxn = matx::max(sqrt(norm))).run(exec);

exec.sync();
#if 1
ctx.finalize();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is finalize used for vs sync? Could you hide the context in the executor so the user doesn't need it, and calling exec.sync() calls finalize()?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finalize terminates everything in the stf context, it waits for asynchronous tasks, deletes internal resources etc... you can only do it once, sync is more equivalent to a ctx.task_fence() which is a non blocking fence (it returns a CUDA stream, and waiting for that stream means everything was done).

I'd like to move finalize to the dtor of the executor, but there are some caveats if you define the executor as a static variable, is this allowed ? The caveat might be some inappropriate unload ordering of CUDA and STF libraries as usual ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I think the destructor is the right place. but does sync() work as expected?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sidelnik is it doing a task fence with a stream sync ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@caugonnet , sync() should be calling ctx.task_fence() now. I agree, I think we should place the ctx.finalize() inside the stf executor dtor

#endif

// example-end sync-test-1
printf ("max l2 norm: %f\n", (float)sqrt(maxn()));

Expand Down
33 changes: 29 additions & 4 deletions examples/fft_conv.cu
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,12 @@ int main([[maybe_unused]] int argc, [[maybe_unused]] char **argv)
{
MATX_ENTER_HANDLER();
using complex = cuda::std::complex<float>;
#if 0
cudaExecutor exec{};
#else
stfExecutor exec{};
auto ctx = exec.getCtx();
#endif

index_t signal_size = 1ULL << 16;
index_t filter_size = 16;
Expand Down Expand Up @@ -117,7 +122,11 @@ int main([[maybe_unused]] int argc, [[maybe_unused]] char **argv)
// Perform the FFT in-place on both signal and filter
for (int i = 0; i < iterations; i++) {
if (i == 1) {
#if 0
cudaEventRecord(start, stream);
#else
cudaEventRecord(start, ctx.task_fence());
#endif
}
(sig_freq = fft(sig_time, filtered_size)).run(exec);
(filt_freq = fft(filt_time, filtered_size)).run(exec);
Expand All @@ -129,18 +138,30 @@ int main([[maybe_unused]] int argc, [[maybe_unused]] char **argv)

}

#if 0
cudaEventRecord(stop, stream);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually we should mask these events behind the executor as well so the timing is the same regardless of the executor.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this makes it look like the code is very different for both executors but that timing is the sole reason especially if finalize is moved to the dtor

#else
cudaEventRecord(stop, ctx.task_fence());
#endif
exec.sync();
cudaEventElapsedTime(&separate_ms, start, stop);

for (int i = 0; i < iterations; i++) {
if (i == 1) {
cudaEventRecord(start, stream);
#if 0
cudaEventRecord(start, stream);
#else
cudaEventRecord(start, ctx.task_fence());
#endif
}
(sig_freq = ifft(fft(sig_time, filtered_size) * fft(filt_time, filtered_size))).run(exec);
}


#if 0
cudaEventRecord(stop, stream);
#else
cudaEventRecord(stop, ctx.task_fence());
#endif
exec.sync();
cudaEventElapsedTime(&fused_ms, start, stop);

Expand All @@ -153,7 +174,11 @@ int main([[maybe_unused]] int argc, [[maybe_unused]] char **argv)
(time_out = conv1d(sig_time, filt1, matxConvCorrMode_t::MATX_C_MODE_FULL)).run(exec);

exec.sync();


#if 1
ctx.finalize();
#endif

// Compare signals
for (index_t b = 0; b < batches; b++) {
for (index_t i = 0; i < filtered_size; i++) {
Expand All @@ -172,4 +197,4 @@ int main([[maybe_unused]] int argc, [[maybe_unused]] char **argv)

CUDA_CHECK_LAST_ERROR();
MATX_EXIT_HANDLER();
}
}
3 changes: 3 additions & 0 deletions include/matx/executors/stf.h
Original file line number Diff line number Diff line change
Expand Up @@ -63,10 +63,12 @@ template <typename T> constexpr bool is_matx_set_op();
stfExecutor(cudaStream_t stream) : stream_(stream) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does a stream do here? I thought STF had its own internal streams?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cliffburdick In STF you can create nested/localized contexts & streams from existing (non-STF created) streams. This allows STF mechanisms to be correctly synchronized within the existing stream ecosystem. @caugonnet correct me if I am wrong.

cuda::experimental::stf::async_resources_handle handle;
ctx_ = cuda::experimental::stf::stream_ctx(stream, handle);
//ctx_ = cuda::experimental::stf::graph_ctx(stream, handle);
}
stfExecutor(int stream) : stream_(reinterpret_cast<cudaStream_t>(stream)) {
cuda::experimental::stf::async_resources_handle handle;
ctx_ = cuda::experimental::stf::stream_ctx(reinterpret_cast<cudaStream_t>(stream), handle);
//ctx_ = cuda::experimental::stf::graph_ctx(reinterpret_cast<cudaStream_t>(stream), handle);
}

/**
Expand All @@ -75,6 +77,7 @@ template <typename T> constexpr bool is_matx_set_op();
*/
stfExecutor() : stream_(0) {
ctx_ = cuda::experimental::stf::stream_ctx();
//ctx_ = cuda::experimental::stf::graph_ctx();
}

/**
Expand Down
28 changes: 27 additions & 1 deletion include/matx/operators/all.h
Original file line number Diff line number Diff line change
Expand Up @@ -72,9 +72,35 @@ namespace detail {
return tmp_out_(indices...);
};

template <typename Task>
__MATX_INLINE__ void apply_dep_to_task(Task &&task, int perm=1) const noexcept {
/* Albert -- Scenario where the all() operator is on the RHS and sum has already
run previously. So we make tmp_out have a read permission as it will be read from */
tmp_out_.apply_dep_to_task(std::forward<Task>(task), 1);
}

template <typename Out, typename Executor>
void Exec(Out &&out, Executor &&ex) const {
all_impl(cuda::std::get<0>(out), a_, ex);
auto output = cuda::std::get<0>(out);
// stfexecutor case
if constexpr (!is_cuda_executor_v<Executor>) {
auto ctx = ex.getCtx();
auto tsk = ctx.task();
tsk.set_symbol("all_task");

output.PreRun(out_dims_, std::forward<Executor>(ex));
output.apply_dep_to_task(tsk, 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't apply_dep_to_task just part of PreRun? It looks like it's called in the same place

a_.apply_dep_to_task(tsk, 1);

tsk->*[&](cudaStream_t s) {
auto exec = cudaExecutor(s);
all_impl(output, a_, exec);
};
}
// cudaExecutor case
else if constexpr (is_cuda_executor_v<Executor>) {
all_impl(output, a_, ex);
}
}

static __MATX_INLINE__ constexpr __MATX_HOST__ __MATX_DEVICE__ int32_t Rank()
Expand Down
8 changes: 8 additions & 0 deletions include/matx/operators/cast.h
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,14 @@ namespace matx
return static_cast<NewType>(op_(indices...));
}

template <typename Task>
__MATX_INLINE__ void apply_dep_to_task(Task &&task, int perm=1) const noexcept
{
if constexpr (is_matx_op<T>()) {
op_.apply_dep_to_task(std::forward<Task>(task), perm);
}
}

template <typename ShapeType, typename Executor>
__MATX_INLINE__ void PreRun(ShapeType &&shape, Executor &&ex) const noexcept
{
Expand Down
4 changes: 2 additions & 2 deletions include/matx/operators/cgsolve.h
Original file line number Diff line number Diff line change
Expand Up @@ -90,8 +90,8 @@ namespace matx

template <typename Out, typename Executor>
void Exec(Out &&out, Executor &&ex) const{
static_assert(is_cuda_executor_v<Executor>, "cgsolve() only supports the CUDA executor currently");
cgsolve_impl(cuda::std::get<0>(out), a_, b_, tol_, max_iters_, ex.getStream());
//static_assert(is_cuda_executor_v<Executor>, "cgsolve() only supports the CUDA executor currently");
cgsolve_impl(cuda::std::get<0>(out), a_, b_, ex, tol_, max_iters_, ex.getStream());
}

template <typename ShapeType, typename Executor>
Expand Down
2 changes: 1 addition & 1 deletion include/matx/operators/fft.h
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@ namespace matx
}

template <typename Task>
__MATX_INLINE__ void apply_dep_to_task(Task &&task, int perm=1) const noexcept
__MATX_INLINE__ void apply_dep_to_task(Task &&task, [[maybe_unused]] int perm=1) const noexcept
{
/* Scenario where the matvec() operator is on the RHS and op has already
run previously. So we make tmp_out have a read permission as it will be read from */
Expand Down
28 changes: 27 additions & 1 deletion include/matx/operators/max.h
Original file line number Diff line number Diff line change
Expand Up @@ -72,9 +72,35 @@ namespace detail {
return tmp_out_(indices...);
}

template <typename Task>
__MATX_INLINE__ void apply_dep_to_task(Task &&task, int perm=1) const noexcept {
/* Albert -- Scenario where the all() operator is on the RHS and sum has already
run previously. So we make tmp_out have a read permission as it will be read from */
tmp_out_.apply_dep_to_task(std::forward<Task>(task), 1);
}

template <typename Out, typename Executor>
void Exec(Out &&out, Executor &&ex) const {
max_impl(cuda::std::get<0>(out), a_, ex);
auto output = cuda::std::get<0>(out);
// stfexecutor case
if constexpr (!is_cuda_executor_v<Executor>) {
auto ctx = ex.getCtx();
auto tsk = ctx.task();
tsk.set_symbol("max_task");

output.PreRun(out_dims_, std::forward<Executor>(ex));
output.apply_dep_to_task(tsk, 0);
a_.apply_dep_to_task(tsk, 1);

tsk->*[&](cudaStream_t s) {
auto exec = cudaExecutor(s);
max_impl(output, a_, exec);
};
}
// cudaExecutor case
else if constexpr (is_cuda_executor_v<Executor>) {
max_impl(output, a_, ex);
}
}

static __MATX_INLINE__ constexpr __MATX_HOST__ __MATX_DEVICE__ int32_t Rank()
Expand Down
34 changes: 22 additions & 12 deletions include/matx/transforms/cgsolve.h
Original file line number Diff line number Diff line change
Expand Up @@ -58,8 +58,8 @@ namespace matx
* cuda Stream to execute on
*
*/
template <typename XType, typename AType, typename BType>
__MATX_INLINE__ void cgsolve_impl(XType X, AType A, BType B, double tol=1e-6, int max_iters=4, cudaStream_t stream=0)
template <typename XType, typename AType, typename BType, typename Executor>
__MATX_INLINE__ void cgsolve_impl(XType X, AType A, BType B, Executor &&exec, double tol=1e-6, int max_iters=4, cudaStream_t stream=0)
{
using value_type = typename XType::value_type;
const int VRANK = XType::Rank();
Expand Down Expand Up @@ -120,15 +120,19 @@ namespace matx
auto pApc = clone<VRANK>(pAp, clone_shape);

// A*X
(Ap = matvec(A, X)).run(stream);
//(Ap = matvec(A, X)).run(stream);
(Ap = matvec(A, X)).run(exec);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that the same to call run(exec) and run(stream) when we have a "classic" executor ? (won't it trigger much more work ?)

// r0 = B - A*X
// p = r0
(p = r0 = B - Ap).run(stream);
//(p = r0 = B - Ap).run(stream);
(p = r0 = B - Ap).run(exec);

(r0r0 = sum(r0*r0)).run(stream);
//(r0r0 = sum(r0*r0)).run(stream);
(r0r0 = sum(r0*r0)).run(exec);

if(tol>0.0f) {
(converged = matx::all(as_int(sqrt(r0r0) < tol))).run(stream);
//(converged = matx::all(as_int(sqrt(r0r0) < tol))).run(stream);
(converged = matx::all(as_int(sqrt(r0r0) < tol))).run(exec);

cudaEventRecord(event, stream);
cudaStreamWaitEvent(d2h, event);
Expand All @@ -137,10 +141,12 @@ namespace matx
int i;
for (i = 0 ; i < max_iters; i++) {
// Ap = matvec(A, p)
(Ap = matvec(A, p)).run(stream);
//(Ap = matvec(A, p)).run(stream);
(Ap = matvec(A, p)).run(exec);

// pAp = dot(p,Ap)
(pAp = sum(p*Ap)).run(stream);
//(pAp = sum(p*Ap)).run(stream);
(pAp = sum(p*Ap)).run(exec);

// if pAp is zero then we have exactly numerically converged.
// However, this is batched so we may iterate more. Iterating
Expand All @@ -152,10 +158,12 @@ namespace matx
auto updateOp = ( r1 = r0 - (r0r0c/pApc) * Ap,
X = X + (r0r0c/pApc) * p);

(IF( pApc != value_type(0), updateOp)).run(stream);
//(IF( pApc != value_type(0), updateOp)).run(stream);
(IF( pApc != value_type(0), updateOp)).run(exec);

// r1r1 = dot(r1, r1)
(r1r1 = sum(r1*r1)).run(stream);
//(r1r1 = sum(r1*r1)).run(stream);
(r1r1 = sum(r1*r1)).run(exec);

if(tol>0.0f) {
// copy convergence criteria to host.
Expand All @@ -168,15 +176,17 @@ namespace matx
break;
}

(converged = matx::all(as_int(sqrt(r1r1) < tol))).run(stream);
//(converged = matx::all(as_int(sqrt(r1r1) < tol))).run(stream);
(converged = matx::all(as_int(sqrt(r1r1) < tol))).run(exec);

cudaEventRecord(event, stream);
cudaStreamWaitEvent(d2h, event);
}

// p = r1 + b * p
auto updateP = ( p = r1 + (r1r1c/r0r0c) * p);
(IF( pApc != value_type(0), updateP)).run(stream);
//(IF( pApc != value_type(0), updateP)).run(stream);
(IF( pApc != value_type(0), updateP)).run(exec);

// Advance residual
swap(r0r0, r1r1);
Expand Down