Profiling¶
Since the primary motivator of Celeritas is performance on GPU hardware, profiling is a necessity. Celeritas uses NVTX (CUDA), ROCTX (HIP) or Perfetto (CPU) to annotate the different sections of the code, allowing for fine-grained profiling and improved visualization.
Tracing events in Celeritas¶
Celeritas includes a number of NVTX, HIP, and Perfetto events that can be used to
trace different aspects of the code execution. These events are enabled
when the environment variable CELER_ENABLE_PROFILING
(see Environment variables) is set.
All profiling backends (CUDA, HIP, and Perfetto)
support both Timeline and Counter events detailed below, except that HIP does not support Counters.
Profiling backends allow grouping various events into “namespaces” (NVTX/HIP domains, Perfetto categories) so that users can selectively enable events they are interested in. Celeritas groups all events in the “celeritas” namespace.
Slices¶
Detailed timing of each step iteration is recorded with “slices” events in Celeritas. The step slice contains nested Slices for each action composing the step, some actions such as along-step actions contain more nested slices for fine-grained profiling.
In addition to the slices in the simulation loop, slices events are also recorded when setting up the problem (e.g. detector construction)
Counters¶
Celeritas provides a few counter events. Currently it writes:
active, alive, and dead track counts at each step iteration, and
the number of hits in a step.
Profling Celeritas example app¶
A detailed timeline of the Celeritas construction, steps, and kernel launches can be gathered, the example below illustrates how to do it using NVIDIA Nsight systems.
Here is an example using the celer-sim
app to generate a timeline:
1$ CELER_ENABLE_PROFILING=1 \
2> nsys profile \
3> -c nvtx --trace=cuda,nvtx,osrt
4> -p celer-sim@celeritas
5> --osrt-backtrace-stack-size=16384 --backtrace=fp
6> -f true -o report.qdrep \
7> celer-sim inp.json
To use the NVTX ranges, you must enable the CELER_ENABLE_PROFILING
variable
and use the NVTX “capture” option (lines 1 and 3). The celer-sim
range in
the celeritas
domain (line 4) enables profiling over the whole application.
Additional system backtracing is specified in line 5; line 6 writes (and
overwrites) to a particular output file; the final line invokes the
application.
On AMD hardware using the ROCProfiler, here’s an example that writes out timeline information:
1$ CELER_ENABLE_PROFILING=1 \
2> rocprof \
3> --roctx-trace \
4> --hip-trace \
5> celer-sim inp.json
It will output a results.json
file that contains profiling data for
both the Celeritas annotations (line 3) and HIP function calls (line 4) in
a “trace event format” which can be viewed in the Perfetto data visualization
tool.
On CPU, timelines are generated using Perfetto, which is only supported when CUDA and HIP are disabled. Perfetto supports application-level and system-level profiling. To use the application-level profiling, see Diagnostics.
1$ CELER_ENABLE_PROFILING=1 \
2> celer-sim inp.json
The system-level profiling, capturing both system and application events, requires starting external services. Details on how to setup the system services can be found in the Perfetto documentation. Root access on the system is required.
Integration with user applications¶
When using a CUDA or HIP backend, there is nothing that needs to be done on the user side. The commands shown in the previous sections can be used to profile your application. If your application already uses NVTX, or ROCTX, you can exclude Celeritas events by excluding the “celeritas” domain.
When using Perfetto, you need to create a TracingSession
instance. The profiling session needs to be explictitly started, and will end when the object goes out of scope,
but it can be moved to extend its lifetime.
1#include "corecel/sys/TracingSession.hh"
2
3int main()
4{
5 // System-level profiling: pass a filename to use application-level profiling
6 celeritas::TracingSession session;
7 session.start()
8}
As mentioned above, Perfetto can either profile application events only, or application and system events. The system-level profiling requires starting external services. Details on how to setup the system services can be found in the Perfetto documentation. Root access on the system is required.
When the tracing session is started with a filename, the application-level profiling is used and written to the specified file.
Omitting the filename will use the system-level profiling, in which case you must have the external Perfetto tracing processes started. The container in scripts/docker/interactive
provides an example Perfetto configuration for tracing both system-level and celeritas events.
As with NVTX and ROCTX, if your application already uses Perfetto, you can exclude Celeritas events by excluding the “celeritas” category.
Kernel profiling¶
Detailed kernel diagnostics including occupancy and memory bandwidth can be gathered with the NVIDIA Compute systems profiler.
This example gathers kernel statistics for 10 “propagate” kernels (for both charged and uncharged particles) starting with the 300th launch.
1$ CELER_ENABLE_PROFILING=1 \
2> ncu \
3> --nvtx --nvtx-include "celeritas@celer-sim/step/*/propagate" \
4> --launch-skip 300 --launch-count 10 \
5> -f -o propagate
6> celer-sim inp.json
It will write to propagate.ncu-rep
output file. Note that the domain
and range are flipped compared to nsys
since the kernel profiling allows
detailed top-down stack specification.