@hilde45 One thing I’ll keep turning over: namely that LLM users weren’t just equivalent to the control group on condition identification -- they were measurably worse. Unless I’m confused, that finding alone seems like the sharpest challenge to the "skilled users will figure it out" interpretation, and I’m not sure the thread has fully reckoned with it. But it’s time for a walk.
We should look at the big picture—i.e., 95% accuracy by LLMs alone versus 34% when used by end users for condition identification. This suggests that AI is highly sensitive to how people ask questions and to the quality of those questions.
There’s nothing wrong with trying to rationalize why Google/web search by end users (42%) slightly outperforms LLM use by end users (34%). One possible reason could be, when performing web searches, people tend to refine their queries and compare or cross-reference multiple sources, which may improve accuracy relative to relying on a single synthesized answer from an LLM.
That said, we’re looking at results that, although statistically significant, are only marginally better (42% vs. 34%). Compared to the 95% accuracy achieved by LLMs alone, however, this represents a substantial (and disappointing) drop-off. It’s important to keep the big picture in perspective.

