This repository contains minimal, educational implementations of three key reinforcement learning algorithms: DPO (Direct Preference Optimization), PPO (Proximal Policy Optimization), and GRPO (Group ...
For all the tasks, the user can specify the Clifford gate set and qubit connectivity. The implementation of reinforcement learning with a non-cumulative reward based on [2] is also possible by setting ...