Requirements An NVIDIA GPU; tensor cores increase performance when available. All shown results come from an RTX 3090. A C++14 capable compiler. The following choices are recommended and have been tested: Windows: Visual Studio 2019 or 2022 Linux: GCC/G++ 8 or higher A recent version of CUDA. The following choices are recommended and have been tested: Windows: CUDA 11.5 or higher Linux: CUDA 10.2 or higher CMake v3.21 or higher. The fully fused MLP component of this framework requires a very large amount of shared memory in its default configuration. It will likely only work on an RTX 3090, an RTX 2080 Ti, or higher-end GPUs. Lower end cards must reduce the n_neurons parameter or use the CutlassMLP(better compatibility but slower) instead.