@hilde45 The Abstract does not articulate the results so it is a bit confusing when reading it. The best way to comprehend the whole experiment is by Fig. 2 in 'Result' section.
There are two statistical tests, including (1) Identify condition and (2) Take proper action / disposition, on two groups. The two groups are (1) direct query (by professional) to LLMs and (2) response from end users using LLMs and User Interface (UI).
Here is the confusing part you were running to. The end user being randomly assigned to use one of the LLMs is called 'treatment group', and the end users using the any assistance (including internet search) at home is called 'control group' in the article. The control group end users are instructed to use any assistance they would typically use at home through internet search. The researchers also wants to find out if there is significant difference between two groups of end users. The findings are:
(1) The 'control group' end users perform significantly (statistically) better in identifying condition than the 'treatment group' (upper right chart); and
(2) The 'control group' end users do NOT perform better in taking proper actions than the 'treatment group' (lower right chart).
Now it comes to the main part of results comparing (1) end users responses using LLMs and (2) direct query to LLMs. The results indicate: (using GPT-40 for illustration)
(1) GPT-40 identifies 94.7% conditions (upper left chart) and the end users group identify approximately low 30% - low 40% condition (upper right chart);
(2) GPT-40 model accuracy in recommending proper action is 64.7% (lower left chart) and the end users group accuracy is merely 40% (lower right chart)
Hope this help.
