AudiogoN
Search Buy Sell Learn MyPage
 Learn > Forums > Cables > 1039633192  Start New Thread | Log In |  * 

  Why is Double Blind Testing Controversial?
I noticed that the concept of "double blind testing" of cables is a controversial topic. Why? A/B switching seems like the only definitive way of determining how one cable compares to another, or any other component such as speakers, for example. While A/B testing (and particularly double blind testing, where you don't know which cable is A or B) does not show the long term listenability of a cable or other component, it does show the specific and immediate differences between the two. It shows the differences, if at all, how slight they are, how important, etc. It seems obvious that without knowing which cable you are listening to, you eliminate bias and preconceived notions as well. So, why is this a controversial notion?
Moto_man  (System | Answers)

12-11-02
  Responses (1-67 of 67)
Click title to read one, or click date to read all below it.

12-11-02   Doesn't double blind mean that the experimenter also does no ...   Drubin

12-11-02   I do not know why it is so controversial, but i can tell you ...   Wellfed

12-11-02   One reason blind (or double blind) testing is controversial ...   Jwrobinson

12-11-02   I think that double blind testing was invented to confuse pe ...   Twl

12-11-02   I wonder if a real "blind test" is what is meant h ...   Tobias

12-11-02   Drubin's right about double-blind: it means that nobody in t ...   Bomarc

12-11-02   Blind testing serves no useful purpose. it presumes that by ...   Judit

12-11-02   I think that double blind testing is essential. i have actua ...   Elmuncy

12-11-02   It's amazing that anyone would find a totally objective and ...   Labtec

12-11-02   The main reason why i am not a fan of a/b testing methods is ...   Redkiwi

12-11-02   A): the audible differences between cables is usually smalle ...   Plato

12-11-02   Having started this thread, i will weigh in with a comment. ...   Moto_man

12-11-02   I have had elmuncy's experience many times: noticing a drama ...   Drubin

12-11-02   Sad, isn't it?   Danvetc

12-11-02   I once blind tested a ford and a chevy. with the ford i bou ...   Garfish

12-11-02   Moto man, i do not question the importance of cable comparis ...   Judit

12-11-02   Some years ago i agreed to test prescription eyeglasses for ...   Albertporter

12-11-02   Tobias and redkiwi hit it on the head. it takes prolonged e ...   Ozfly

12-11-02   Well, you don't need to use the blind test to make your deci ...   Drubin

12-12-02   We've covered this before. as such, all i'll add to this th ...   Sean

12-12-02: Nirp
The use of blind and double-blind procedures presumes one is employing the logic of hypothesis testing. That is, that there is a null hypothesis (i.e., that there are no differences between two treatments—in this case, two sets of interconnects) and an alternate hypothesis (there is indeed a difference). Experimenters are more than experimental custodians. Their biases and expectations can profoundly influence a study. To the extent that all people (including experimenters) have biases, one would double-blind the treatments to reduce among other things "experimenter effects." It’s surprisingly easy for an experimenter to influence a study (e.g., Stanley Milgram’s famous obedience studies). It is also easy for other participants (formerly known as “subjects”) to influence each other (e.g., Ash’s line judgment experiments where participants tended to agree with Ash’s confederates that clearly dissimilar lines were the same).

There is a famous researcher/psychologist/statistician by the name of Robert Rosenthal who once told his students that he had obtained two breeds of rats from another famous researcher. One type of rat was called “maze smart” and the other was “maze dull.” Dr. Rosenthal asked the students to teach these rats to run though mazes (ah, the power of cheese). After a few weeks or so the students were asked to show off their rats’ maze prowess (as it were). The “maze smart” rats performed significantly better than their “dumb” counterparts. The kicker here is that the rats were OF THE SAME SPECIES. One cannot infer that the students intentionally influenced the training, but it most certainly was measurable. Moreover, when the experimenter bias was measured it turned out that the “smart" rats owners had "imparted" a greater positive measurement bias than did the “dumb” rats owners negative measurement bias.

There are probably much better examples than these, but I’m in a hurry to go downstairs for dinner :-) so I’ll wrap this up soon.

Something else to consider is that “different” does not mean “better.” People’s ability to remember sounds and colors varies greatly but rarely is the memory accurate after a short decay period. With audio equipment evaluation, it tends to result in a bias for a certain “sound” regardless of whether or not that sound is authentic. When it comes to making a decision as to whether one component is better than another, it probably makes the most sense to have a reference. In the case of audio, I’d say that reference should be THE REAL THING. It’s not practical to have live orchestra tag along on equipment tests but it doesn’t hurt to keep that in mind. Some people go on and on about how they prefer one cable to another because their favorite is “warm” or whatever. Real sounds from an orchestra or a band are not necessarily “warm.”

All that said, if one believes that a $6,000 set of interconnects sounds better (they just might sound *different*) than a $70 pair then let ‘em. The more expensive cable might even sound closer to reality. One would hope that the more expensive cables aren’t just mostly cosmetics and markup.

--Paul

p.s. and yes, spending time with a set of cables or anything else in the system is a great way to know if one really likes the sound. On a marginally related note, a friend of mine once said “I’ve never owed a handheld device that I liked after having it for a week.”

Nirp  (Answers)


12-12-02   Well said bomarc. i'll go a little further along that road. ...   Paulwp

12-12-02   I have done blind testing myself, on many occasions, as i ha ...   Twl

12-12-02   Twl, thanks for taking the time to clearly articulate what i ...   Wellfed

12-12-02   Blind and double-blind is a way, as we all know, to attempt ...   Cpdunn99

12-12-02   What an interesting thread. i've othen wondered if there wou ...   Brianjh

12-12-02   Twl, i'm not following your logic or maybe i just didn't rea ...   Labtec

12-12-02   I advocate this testing as a way of attempting to control va ...   Drubin

12-12-02   Labtec, just because i have used blind testing in the past, ...   Twl

12-12-02   Because when you dbt, some people will hear differences wher ...   Gs5556

12-12-02   Gs5556, of course we did that. do you think that a couple of ...   Twl

12-12-02   To audition equipment in their home with their equipment usi ...   Petehul

12-12-02   Most people who claim to hear differences in cables, or what ...   Jwrobinson

12-12-02   I don't mind the double blind reviews. the ones that bother ...   Unsound

12-12-02   Hats off and a low bow unsound.   Ozfly

12-12-02   It was like that with me. some of the things that i have pur ...   Elmuncy

12-13-02   Paulwp: why specifically drag me into this ? i made no com ...   Sean

12-13-02   Oh, sean, i didnt mean anything by it. i was referring to a ...   Paulwp

12-13-02   Sean, i was not referring specifically to cables, though it ...   Jwrobinson

12-13-02   It's controversial mainly because most people don't understa ...   Hearhere

12-13-02   Sean, sean, sean. dbts exist to serve ". . . those that ...   Hearhere

12-13-02   Hearhere, again why the need to insist or imply that the ave ...   Wellfed

12-14-02   Wellfed: i think that hearhear was saying that well conduct ...   Sean

12-15-02   Sean, my response to hearhere pertained to his/her first pos ...   Wellfed

12-15-02   Wellfed: i was referring to people collected off the street ...   Sean

12-15-02   Sean, i always thought one of the key features of a true dbt ...   Onhwy61

12-15-02   Those that are administering the tests do not know what is b ...   Sean

12-16-02   Onhwy61 -- it is specifically the motivations that yield the ...   Ozfly

12-16-02   A true double blind test wouldn't be easy to set up, but as ...   Onhwy61

12-16-02   I have a proposal ... double blind posting. audiogon allows ...   Seandtaylor99

12-16-02   This is a very important proposal and needs urgent considera ...   Redkiwi

12-16-02   To address sean's and other's points about pulling random fo ...   Hearhere

12-16-02   Wellfed, in no way did i mean to imply that audiophiles are ...   Hearhere

12-16-02   Tinged spectacles. are we getting carried away with this? wh ...   Unsound

12-16-02   Did i hit a nerve ? or did i miss sarcasm in redkiwi's resp ...   Seandtaylor99

12-16-02   This is exactly why this topic is off limits at audio asylum ...   Albertporter

12-16-02   Huh? redkiwi and seandt were obviously joking. none of the ...   Paulwp

12-17-02   Banning topics such as this is a very bad idea. despite lim ...   Onhwy61

12-17-02   To answer the original question, dbt is controversial becaus ...   Hearhere

12-17-02   Very well stated hearhere.   Wellfed

12-17-02   All this talk of dbt, could anyone provide a link to any suc ...   Socrates

12-17-02   Hearhere, cool user name.   Unsound

12-17-02   Socrates: you've asked a mouthful of questions. i'd suggest ...   Bomarc

12-18-02: Rzado
I think Hearhere summed up the issue well in his last post, but I would come at it from a slightly different angle. Simply put, DBT is not, in and of itself, "controversial." However, there is a great deal of misunderstanding/ disagreement regarding its use and applicability. More particularly, DBT is simply a tool, the results of which are interpreted based on statistical analysis, and must be understood in that context. While DBT does have some applicability in the audio context, it is not the be-all and end-all that some make it out to be.

There are two main problems with how DBTs are used/viewed by certain audiophiles. First and foremost, what many do not understand (but what anyone with experience in statistics can tell you) is that if there is a non statistically significant result, the DBT has not “proven” there are no differences between conditions! Rather, all that can be concluded is that the DBT failed to reject the null hypothesis in favor of the alternative hypothesis.

Second, small-trial (aka "small-N") listening tests analyzed at commonly used statistical significance levels (e.g. <.05) lead to large Type 2 error risks, thereby masking the very differences the tests are supposed to reveal.

Now breaking that down into English is a pain, but I'll give it a shot (I’m an engineer, as opposed to s statistician - thus any stats guys feel free to correct me). In a simple DBT, one attempts to determine if there are audible differences between two conditions (such as by inserting a new interconnect in a given system). This is more commonly called a hypothesis test - the goal is to determine whether you can reject a "null hypothesis" (there are in fact no differences between the two conditions) in favor of a "conjectured hypothesis" (there are in fact differences between the two conditions).

In a DBT, there are four possible results: 1) there are differences and the listener correctly identifies that there are differences; 2) there are no differences and the listener correctly identifies there are no differences; 3) there are no differences, but the listener believes there are differences; and 4) there are differences, but the listener believes there are no differences. Obviously, 1 and 2 are correct results. Circumstance 3 (concluding that differences exist when in reality they don’t) is commonly referred to as "Type 1 error". Circumstance 4 (missing a true difference) is commonly referred to as "Type 2 error". Put in terms of the hypothesis test stated above, type I error occurs when the null hypothesis is true and wrongly rejected, and type II error occurs when the null hypothesis is wrongly accepted when false.

Now, things get a little complicated. First we need to introduce a variable, p_u, which is the probability of success of the underlying process. In the listening context, this is the probability that a listener can identify a difference between conditions, which is based on the acuity of the listener, the magnitude of the differences, and the conditions of the trial (e.g. the quality of the components, recording, ambient noise, etc). Unfortunately, we can never “know” p_u, but can only make reasonable guesses at it.

We also need to introduce the variable "alpha". Alpha, or the significance level, is the level at which we can reject the null hypothesis in favor of the alternative hypothesis. By selecting a suitable significance level during the data analysis, you can select a risk of Type 1 error that you are willing to tolerate. A common significance level used in DBT testings is .05.

Finally, we need to look at the probability value. In hypothesis testing, the probability value is the probability of obtaining data as extreme or more extreme than the results achieved by the experiment assuming the null hypothesis is true (put another way, it is the likelihood of an observed statistic occurring on the basis of the sampling distribution).

Once the DBT is performed, one compares the probability value to alpha to determine whether the result of the test is statistically significant, such that we can reject the null hypothesis. In our example, if the null hypothesis is rejected, we can concluded there are in fact audible differences between ICs.

Now, here comes the fun part. It might seem that you want to set the smallest possible significance level to test the data, thereby producing the smallest possible risk of Type 1 error (i.e., set alpha to .01 as opposed to .05). However, this doesn’t work, because, as you reduce the risk of Type 1 error (lower alpha), the risk of Type 2 error necessarily increases.

Further, and a greater impediment to practical DBT testing, is that the risk of Type 2 error increases not only as you reduce Type 1 error risk, but also with reductions in the number of trials (N), and the listener's true ability to hear the differences under test. Since you really never know p_u, and can only speculate on how to increase it (e.g., by selecting only high quality recordings of unamplified music using a high quality system to test the ICs), the best ways to reduce the risk of Type 2 error in a practical listening test is by increasing either N or the risk of Type 1 error.

Now for some examples. Let's assume we use 16 tests on the IC in question. For purposes of the example, further assume that the probability of randomly guessing correctly whether the new IC was inserted is 0.5. Finally, we must make a guess at “p_u”, which we could say is 0.7. In this instance, the minimum number of correct results for the probability value to exceed .05 is 12 (our type I error in this case is = 0.0384). However, our type II error in this case goes through the roof - in this example, it is .5501, which is huge! Thus, this test suffers from a high level of type 2 error, and is therefore unlikely to resolve differences that actually exist between the interconnects.

What happens if there were only 11 correct results? Our p value is then .1051, which exceeds alpha. Thus, we are not able to reject the null hypothesis in favor of the alternative hypothesis, since the p value is greater than alpha. However, this does not allow us to concluded that there are in fact no audible differences between Ics. In other words, data not sufficient to show convincingly that a difference between conditions is not zero do not prove that the difference is zero.

So now lets increase the number of trials to 50. Now, the number of correct results needed to yield statistically significant results is 32 (p value = .0325). Assuming again p_u is 70%, our Type 2 error drops to ~ 0.14, which is more acceptable, and thus differences between conditions are more likely to be revealed by the test.

OK, one last variation. Let’s assume that the differences are really minor, or we are using a boom box to test the interconnects, such that p_u is only 60%. What happens to Type II error? It goes up - in the 50 trial example above, is goes from .1406 to .6644 - again, the test likely masks any true difference between ICs.

To sum up, DBT is tool that can be very useful in the audio context if used and understood correctly. Indeed, this is where I take issue with Bomarc, when he says "I don't want to get into statistics, except to say that's usually not the weak link in a DBT". Rather, the (mis)understanding of statistics is precisely the weak link in applicability of DBTs.

Rzado  (System | Threads | Answers)


12-18-02   Thanks, rzado, for the refresher course. let me try to summa ...   Bomarc

12-18-02: Rzado
Good post, Bomarc - I agree with 98% of what you had to say. I guess the one thing I'm not sure about is the point you are making with respect to multiple inconclusive tests lending to a strong inference that a difference is inaudible. If you have multiple tests with high Type 2 error (e.g. Beta ~.4-.7), I do not believe this is accurate. However, if you have multiple tests where you take steps to minimize Type 2 error (high N trials), I can see where you are going. But you are correct, that can start getting messy.

Thanks for clarifying your point about statistics, though. In general, I tend to give experimenters the benefit of the doubt with respect to setting up the DBT, unless I have a specific problem with the setup. But I agree, there are numerous ways to screw it up.

However, the few studies in high-end audio with which I am familiar(e.g. the ones done by Audio magazine back in the 80's) in general suffered from the problems outlined above (small N leading to high Type 2 error, erroneous conclusions based on non-rejection of null hypothesis due to tests not achieving p value < .05). There have been a couple of AES studies with which I'm familiar where the setup was such that p_u was probably no better than chance - in that circumstance, you can say either the setup is screwed up or the interprettion of the statistics is screwed up. At least one or two studies, though, were pretty demonstrative (e.g. the test of the Genesis Digital Lens, which resulted in 124 out of 124 correct identifications).

My biggest beef with DBT in Audio is that you just need to do the work - i.e. use high N trials - which is a lot easier said than done.

Rzado  (System | Threads | Answers)


12-18-02   Rzado: my point on retesting is this: if something really is ...   Bomarc


  Post your response
Subject


Your response

No html, but you may use markup tags


Username
Members only

Password
 

         




Rewind:  Forums: PC Audio