Using Claude (or other) AI to design a systematic amp comparison protocol


I had a great exchange with AI today about my audio trials of a class D amp. Here is a summary of what it advised. Possibly of interest.

Claude:

"I have 20 days to decide whether to keep a TRIAL CLASS D amp amplifier or return it, and I needed to compare it against a borrowed Hypex DIY amp and my Pass amp, using both solid-state and tube preamps. The problem was figuring out how to do meaningful comparisons without getting lost in endless testing or relying on faulty auditory memory.

I asked Claude to help me design a testing protocol. I started with my basic idea—compare all the amps with both preamps—and Claude helped me think through the methodology more carefully. It pointed out issues I hadn't considered, like auditory memory decay making day-to-day comparisons unreliable, the importance of controlling for listening order bias, and why I should test the two Class D amps against each other first before bringing the Pass into the picture.

The back-and-forth was really helpful. Claude initially suggested some things that were too complicated (three amps per day), and I pushed back. It also caught an oversight I made about not comparing the Pass with the tube preamp. We went through several iterations, and I kept asking it to revise based on what would actually be practical for my schedule and listening stamina.

What we ended up with was a progressive elimination structure: Days 1-6 compare TRIAL CLASS D amp vs Hypex with both preamps, with built-in order reversals to control for bias. If the TRIAL CLASS D amp doesn't beat the Hypex, I'm done—return it. Only if it survives does it earn the right to be compared against the Pass for another 6 days. The protocol includes specific note-taking templates for each session, decision checkpoints, and early exit options if results are clear.

Claude also emphasized things I might have glossed over—like limiting to just 3 reference tracks for consistency, taking real breaks between listening sessions, and writing notes after (not during) each session. It created a complete 12-day schedule with dates, checkboxes, and decision trees, formatted as a printable RTF document I can mark up as I go.

If you're facing a similar decision with gear comparisons, I'd recommend trying this approach. You basically describe what you're trying to figure out, let Claude ask clarifying questions, push back when something doesn't fit your situation, and iterate until you have a plan that makes sense. It's like having a conversation with someone who understands experimental design and can help you think through the methodology without getting emotionally invested in the outcome.

HOW DOES IT KNOW? WHAT IS THE SOURCE OF THIS ADVICE?
 
When I asked Claude about its sources, it explained that the methodology draws on research in psychoacoustics (like studies on auditory memory decay), experimental psychology (order effects and position bias in preference testing), and sensory evaluation methods used in fields like food science and wine tasting. The general principles about counterbalanced presentation, same-day comparisons for better memory retention, and progressive elimination structures come from established experimental design practices across multiple domains.

 The audio-specific knowledge comes from Claude's training on technical literature and enthusiast discussions over many years, though it can't cite specific papers. It mentioned that if you want to dig deeper into the research foundations, look into the psychoacoustics literature on auditory discrimination, sensory evaluation work on paired comparison testing, and audio researchers like Floyd Toole and Sean Olive at Harman who've published extensively on listening test methodology.

 Claude was also transparent that some of the practical details—like the specific session lengths (30-40 minutes), break durations (15 minutes), and when to take notes—are more informed heuristics based on how human attention and fatigue work rather than being pulled from rigorous audio testing protocols. So it's a mix of research-backed principles and reasonable practical judgments about what makes testing sustainable and reliable."

hilde45

@erik_squires I realize that. Thanks. I use it to help with permutations and with subjects I have some preexisting knowledge. For other things, I ask for citations and sources.

The hallucination errors you cite have become a prejudice for many, preventing them from making some very good use of these tools. Perhaps that is good (for the environment) but it also perpetuates misunderstandings of the tools’ uses. Smart prompts and careful vetting of answers is a prerequisite for getting something out of AI, or really any other tool.

A prejudice is an opinion that is not based on reason or actual experience.  This is not what I have. 

How Claude and other AI tools misbehave is a known behavior pattern, something I know from personal experience AND reading the experience of others. 

The belief that a smart prompt will shield you from AI scheming has been shown to not be true. At best, smart prompts with constant checking of sources needs to be done. 

Despite these known failure modes I continue to leverage them.  That doesn’t mean I pretend they don’t fail.   If you want a fascinating read, look for "scheming" in AI models, what it means and attempt to reduce it. 

The reason I say they turn "evil" is because I've several times been getting really good information and then from one sentence to the next I'm getting fake data, with no lights or buzzers or contextual warnings, and this is what can trip you up. 

Fascinating that you don't seem to trust your ears to make these decisions.

 

Didn’t say you had a prejudice. Many people do, though. Didn’t mean to imply it was you.

The belief that a smart prompt will shield you from AI scheming has been shown to not be true. At best, smart prompts with constant checking of sources needs to be done. 

That’s what I do. I have been studying and writing about AI for a couple years now.

Fascinating that you don’t seem to trust your ears to make these decisions.

I'm not a simpleton.

Have you tried prompting for great prompts?  Basically, asking AI to help you ask the right questions to create the prompt you’re looking for.  Sounds like you did some of that with your prompt refinements.  It’s an interesting way to get AI constrained to more of the context of your question.