Two systems with identical parameter counts can behave dramatically differently depending on how they are built.
When I watch our trade start handing its tests to language models, I don't feel relief. I feel the same itch I get when a release goes too quiet.