Mixture-of-Experts Pattern

Cameron Rohn · Category: frameworks_and_exercises

Using a sparse mixture-of-experts attention architecture activates only 32 B parameters at inference, enabling scaling to a trillion-parameter model cost-effectively.

Mixture-of-Experts Pattern

Cameron Rohn

Tom Spencer

Channels