• nn.Linear(bias=False) in training • Forward • output = input @ P.t() • Backward • grad_input = grad_output @ P • G = grad_output.t() @ input • Optimizer step • adam_step(P, G, M, V, lr=…) Stage States Involved Execution time Forward P, activation Input-dependent Backward P, G, activation Input-dependent Optimizer step P, G, M, V Input-independent Usually much faster than forward/backward