PyTorch model(x) to GPU: The Hidden Journey of Neural Network Execution

Stephen Carmody · October 10, 2025

When you call y = model(x) in PyTorch, and it spits out a prediction, it’s sometimes easy to gloss over the details of what PyTorch is doing behind the scenes. That single line cascades through half a dozen software layers until your GPU is executing thousands of threads in parallel. Exactly what those steps where wasn’t always clear to me so I decided to dig a little deeper.

The Stack: A Bird’s Eye View

Here’s the full stack at a glance. When you make a request to a ML inference service (feed input into your model), once it reaches PyTorch it flows down through a series of layers, from high-level Python code all the way down to GPU hardware:

PyTorch GPU Journey Overview

Each layer translates the work into something more concrete: your model definition becomes tensor operations, those operations become library calls or kernel launches, those become hardware commands, and finally the GPU executes them in parallel across thousands on processing units. So with that high level view let’s journey down through the stack.

PyTorch / Application Layer

At the top, PyTorch builds a computation graph, determined by the architecture of your model. Each operation you write (Linear, ReLU, Convolution, etc) adds a node that represents a tensor transformation. In fact we can visualise it 1, and you’ll see a DAG like the one below: tensors (the data) flow along the edges, with ops (transformations) happening on the nodes.

Here’s what that looks like for a simple neural network:

model = nn.Sequential(
	nn.Linear(4, 8),
	nn.ReLU(),
	nn.Linear(8, 2)
)

PyTorch Computation Graph DAG

In our toy model, we can see the input flows through each operation, in our case, linear layers and activations, with tensor shapes transforming at various steps. This graph is what PyTorch moves through during the forward pass, dispatching each operation to an implementation.

PyTorch provides a Tensor API that defines those operations on our data. When you call a tensor operation in Python, it converts that call into a C++ binding via ATen (PyTorch’s C++ tensor library). ATen does the CPU-side work: allocating output memory, checking tensor shapes, and picking which implementation to use. But what happens when ATen makes those calls? Let’s look at the next layer down.

CUDA Libraries / CUDA Runtime API

At this point we have C++ function calls, but these still need to be translated into highly parallel GPU operations. This is where CUDA libraries and the CUDA Runtime API come in.

You can think of CUDA libraries like a collection of useful and optimized kernels (specialised functions designed to run on the GPU in parallel across thousands of threads).

The are two main libraries or kernels we are likely to eoncounter:

  • cuBLAS (Basic Linear Algebra Subprograms): Think basic linear algebra ops like matrix multiply
  • cuDNN (Deep Neural Network): Think operations specific to DNNs like convolutions or pooling.

These CUDA libraries in turn call the CUDA Runtime API. This API translates high-level requests like “allocate this much GPU memory,” “launch this kernel with these dimensions” into low-level driver calls, which build command buffers (blocks of memory containing binary instructions the GPU can execute).

So now let’s make the link back to PyTorch..

The PyTorch tensor API ATen we mentioned earlier has two paths it can take:

  • Some calls go to CUDA libraries (like the cuBLAS gemv kernel for the matrix-vector multiply in our Linear layers, or for cuDNN it could be something like cudnnConvolutionForward for convolutions in the case of a CNN)
  • Others go straight to the Runtime API (like element-wise operations such as the ReLU activation).


Seeing it in action: Tracing a Matrix Multiply

Let’s try illustrat this in simplified example of a single matmul operation flowing through all these layers:

Python: result = torch.matmul(A, B)
  ↓
C++ ATen: at::matmul(tensor A, tensor B)
  ↓
  CPU-side work:
    - Validate: Check A and B have compatible shapes
    - Allocate: Call cudaMalloc() to allocate output tensor memory on GPU
    - Dispatch: Examine tensor shapes and dtypes to pick best algorithm
  ↓
  GPU-side work:
    - Call cuBLAS: cublasSgemm(handle, ..., A_ptr, B_ptr, C_ptr)
      ↓
      cuBLAS internally calls: cudaLaunchKernel(matmul_kernel, grid, block, ...)
      ↓
      Kernel executes on GPU

You can see the translation in action: a single Python line becomes C++ calls, which become CUDA library calls, which become kernel launches in the CUDA runtime API, each layer translating the work into something more concrete.

What this looks like in practice

To see these layers in action from a real PyTorch call, let’s look at a profiler trace from running our simple neural network. Again, notice how high level PyTorch ops (like linear and relu) break down into lower-level operations, eventually triggering actual GPU kernel launches:

         
Event Description / Layer CPU Time GPU Time Device Type
Op::to PyTorch op moving tensor between devices (CPU → GPU). 3519.846 0 DeviceType.CPU
Runtime::Unrecognized Overhead or internal framework event not mapped to a known op. 3462.635 0 DeviceType.CPU
Op::linear High-level PyTorch Linear layer op (calls addmm). 378.252 8.576 DeviceType.CPU
Op::t Tensor transpose op (creates view). 69.277 0 DeviceType.CPU
Op::transpose Tensor layout manipulation. 36.092 0 DeviceType.CPU
Op::as_strided Low-level tensor view op, stride adjustment only (no kernel). 11.421 0 DeviceType.CPU
Op::addmm Matrix multiply + bias add op — triggers GEMM kernel. 288.802 8.576 DeviceType.CPU
Runtime::LaunchKernel CUDA runtime API call launching GPU kernels. 86.909 0 DeviceType.CPU
Kernel::gemv (cuBLAS) Low-level cuBLAS GEMV kernel (matrix–vector multiply). 0 8.576 DeviceType.CUDA
Op::relu High-level activation op (calls clamp_min). 80.595 3.295 DeviceType.CPU
Op::clamp_min Tensor elementwise clamp (ReLU implementation). 60.201 3.295 DeviceType.CPU
Kernel::clamp_elementwise Compiled CUDA elementwise kernel for clamp/ReLU. 0 3.295 DeviceType.CUDA
Mem::[memory] Alloc/free bookkeeping event (no compute). 0 0 DeviceType.CPU
Runtime::DeviceSynchronize CUDA synchronization (wait for GPU to finish). 22.214 0 DeviceType.CPU

Although these events are not strictly in order (they have been aggregated to make them easier to digest) you can see the pattern: a PyTorch operation like linear decomposes into addmm (matrix multiply + add), which triggers a Runtime::LaunchKernel call, which finally launches the actual Kernel::gemv on the GPU. The separation between CPU time (host-side overhead) and GPU time (actual compute) becomes more apparent with this view.

CUDA Driver API

Beneath the CUDA Runtime API sits the CUDA Driver, which talks directly to the NVIDIA GPU driver inside your operating system. This layer assembles command buffers (blocks of memory filled with binary commands) and submits them to the GPU over the PCI Express bus. The driver’s job is to schedule work, manage memory mappings, and move data between the GPU and CPU using DMA (Direct Memory Access).

NVIDIA GPU Driver & Hardware

Finally, the GPU gets involved. It reads those command buffers, decodes kernel launch commands, and assigns them to streaming multiprocessors the heart of the GPU (loosely equivalant to CPU core, except massively parallel). Each kernel runs as hundreds of parallel threads, crunching through the tensor math.

The Journey Back Up

When the kernels finish, the GPU signals completion back through the driver. Results are copied from GPU memory into the CPU memory and output tensors PyTorch allocates at the start of the process, and control returns to your Python code, the whole round trip from Python down to silicon and back happens in milliseconds.

Wrapping Up

There’s an astonishing amount of orchestration hiding behind y = model(x). A single Python call expands into a graph of tensor ops, those ops become CUDA library calls, the runtime and driver convert them into command buffers, and the GPU executes them as hardware instructions across thousands of parallel threads. So hopefully going through this simplified journey from PyTorch to GPU you’ve gained a little insight into what’s happening under the hood!




  1. The computation graph is produced by a library called TorchVista