Profiling¶
Since the primary motivator of Celeritas is performance on GPU hardware, profiling is a necessity. Celeritas uses NVTX (CUDA), ROCTX (HIP) or Perfetto (CPU) to annotate the different sections of the code, allowing for fine-grained profiling and improved visualization.
Timelines¶
A detailed timeline of the Celeritas construction, steps, and kernel launches can be gathered using NVIDIA Nsight systems.
Here is an example using the celer-sim
app to generate a timeline:
1$ CELER_ENABLE_PROFILING=1 \
2> nsys profile \
3> -c nvtx --trace=cuda,nvtx,osrt
4> -p celer-sim@celeritas
5> --osrt-backtrace-stack-size=16384 --backtrace=fp
6> -f true -o report.qdrep \
7> celer-sim inp.json
To use the NVTX ranges, you must enable the CELER_ENABLE_PROFILING
variable
and use the NVTX “capture” option (lines 1 and 3). The celer-sim
range in
the celeritas
domain (line 4) enables profiling over the whole application.
Additional system backtracing is specified in line 5; line 6 writes (and
overwrites) to a particular output file; the final line invokes the
application.
Timelines can also be generated on AMD hardware using the ROCProfiler applications. Here’s an example that writes out timeline information:
1$ CELER_ENABLE_PROFILING=1 \
2> rocprof \
3> --roctx-trace \
4> --hip-trace \
5> celer-sim inp.json
It will output a results.json
file that contains profiling data for
both the Celeritas annotations (line 3) and HIP function calls (line 4) in
a “trace event format” which can be viewed in the Perfetto data visualization
tool.
On CPU, timelines are generated using Perfetto. It is only supported when CUDA
and HIP are disabled. Perfetto supports application-level and system-level profiling.
To use the application-level profiling, set the tracing_file
input key.
1$ CELER_ENABLE_PROFILING=1 \
2> celer-sim inp.json
The system-level profiling, capturing both system and application events,
requires starting external services. To use this mode, the tracing_file
key must
be absent or empty. Details on how to setup the system services can be found in
the Perfetto documentation. Root access on the system is required.
If you integrate Celeritas in your application, you need to create a TracingSession
instance. The profiling session will end when the object goes out of scope but it can be
moved to extend its lifetime.
1#include "corecel/sys/TracingSession.hh"
2
3int main()
4{
5 // System-level profiling: pass a filename to use application-level profiling
6 celeritas::TracingSession session;
7 session.start()
8}
Kernel profiling¶
Detailed kernel diagnostics including occupancy and memory bandwidth can be gathered with the NVIDIA Compute systems profiler.
This example gathers kernel statistics for 10 “propagate” kernels (for both charged and uncharged particles) starting with the 300th launch.
1$ CELER_ENABLE_PROFILING=1 \
2> ncu \
3> --nvtx --nvtx-include "celeritas@celer-sim/step/*/propagate" \
4> --launch-skip 300 --launch-count 10 \
5> -f -o propagate
6> celer-sim inp.json
It will write to propagate.ncu-rep
output file. Note that the domain
and range are flipped compared to nsys
since the kernel profiling allows
detailed top-down stack specification.