Abstract
Training deep neural networks carries increasing computational costs and environmental impact, which are worsened by traditional layer-by-layer execution as model depth increases. DeepPCR, an algorithm based on Parallel Cyclic Reduction (PCR), addresses this bottleneck by parallelising sequential operations into a system of equations solved via PCR. This dissertation presents the first reproducibility study of DeepPCR, reimplementing and extending its original single-device parallelisation with targeted optimisations and cross-platform analysis. Performance is assessed using roofline-guided profiling, solver microbenchmarks, parameter sweeps, and machine learning applications across NVIDIA V100 and AMD Instinct MI300X GPUs. The results confirm speedups consistent with the original authors' findings, while also delineating the regime where width dominates depth and costs rise sharply. This work strengthens the empirical foundations of DeepPCR, provides a clearer understanding of its performance boundaries, and highlights potential directions for optimising its applicability to deep, large-scale architectures.