Thesis
Selected theses proposals from the EFCL researchers. For a complete list, see the webpages of the individual EFCL groups.
All the flavours of FFT on MemPool
external page MemPool is a IIS-born many-core system, having 256 Snitch cores and 1024 banks of shared tightly coupled L1 data-memory. Leveraging its hierarchical architecture, we can scale the system to TeraPool, a cluster of 1024 Snitch cores, having 4096 banks of shared memory. The huge parallel computing power and the small latency cost of the shared memory accesses in TeraPool suit perfectly the purpose of accelerating embarrassingly parallel tasks, such as matrix-matrix multiplication. Things get more tricky with kernels having irregular memory accesses, such as the Fast Fourier Transform. In the framework of a poject were MemPool accelerates the workload of 5G processing, we already implemented a performant version of Cooley-Turkey FFT [link], and we are now looking into different algorithmic strategies to execute up to 128 FFT tasks in less than 0.5ms
The goal of this project is to implement and optimize different FFT kernels: extending our work on Cooley-Turkey FFT to different radix, iImplementing and optimizing on MemPool and TeraPool other FFT kernels (e.g. six steps FFT), adding hardware extensions to specialize MemPool for the execution of FFT and other key algorithms in the field of wireless communications. Another option is also the integration of a PULP FFT accelerator in the MemPool Tile.
Runtime partitioning of L1 memory in Mempool
external page MemPool is a IIS-born many-core system, having 256 Snitch cores and 1024 banks of shared tightly coupled L1 data-memory. Leveraging its hierarchical architecture, we can scale the system to TeraPool, a cluster of 1024 Snitch cores, having 4096 banks of shared memory. The huge parallel computing power of TeraPool suits perfectly the purpose of accelerating embarrassingly parallel tasks, such as matrix-matrix multiplication. Other kernels benefit less from large-scale parallelization and the speed-up that can be obtained with respect to a single-core execution is found when the algorithm is executed in parallel over a sub-set of the cores. In both cases the kernels are optimized to avoid conflicts in the access to the memory interconnection resources, which would generate a stall of the processors LSU.
When MemPool and TeraPool are employed for the execution of composite algorithmic chains, different kernels might be allocated to sub-sets of the processors in the cluster. The sequential addressing scheme of MemPool and TeraPool makes conflicts more likely to happen when different kernels running at the same time access their respective data-structures, which are allocated at the same addresses of L1 memory. This is for instance the case when different stages of the 5G base-band singnal processing chain are executed on the platform.
The goal of this project is to create a runtime partitioning of MemPool's and TeraPool's L1 memory, to separate the memory regions where the data structures of different kernels, running concurrently on the platform, are allocated.
Efficient Execution of Transformers in RISC-V Vector Machines with Custom HW acceleration (taken)
Transformers have set a new standard in natural language processing and other recurrent machine learning tasks (e.g., computer vision, molecular dynamics). In contrast to recurrent neural networks, the entire input is processed, but a learned attention mechanism provides context for any local position in the input sequence. These methods have proven to train significantly faster than LSTM and similar models, and therefore allowed for much larger and complex models.
Machine Learning Accelerators have been good in exploiting the inherent redundancy of convolutional neural networks layers, unfortunately the building blocks in transformers have significant lower operational intensity (i.e., compute/data load ratio) and rely on more general-purpose compute (e.g., softmax).
In this master thesis, we want to elaborate the bottleneck of transformers in a energy-efficient general-purpose vector processor, and extend the vector processor to cope with the new challenges to run efficiently transformer models.
Vector-based Parallel Programming Optimization of Communication Algorithm
Flexible and scalable solutions will be needed for future communications processing systems. Vector processors provide an efficient means of exploiting data-level parallelism (DLP), which is heavily present in communications kernels. The external page Spatz, a small and energy-efficient vector unit based on the RISC-V vector extension specification is introduced for efficiency and performance improvement. Spatz lean Processing Element (PE) acts as an accelerator to a scalar core, which is a good candidate for achieving ideal hardware utilization and enabling scalability. Based on these metrics, we implemented Spatz on the TeraPool architecture as our hardware platform, a scaled-up system from MemPool, which has 1024 Snitch cores and 4096 banks of shared tightly coupled L1 data-memory. In this project, we will exhibit considerable DLP and implement the typical kernels of baseband signal processing tasks.
When MemPool and TeraPool are employed for the execution of composite algorithmic chains, different kernels might be allocated to sub-sets of the processors in the cluster. The sequential addressing scheme of MemPool and TeraPool makes conflicts more likely to happen when different kernels running at the same time access their respective data-structures, which are allocated at the same addresses of L1 memory. This is for instance the case when different stages of the 5G base-band singnal processing chain are executed on the platform.
The goal of this project is to create a runtime partitioning of MemPool's and TeraPool's L1 memory, to separate the memory regions where the data structures of different kernels, running concurrently on the platform, are allocated.