I think the vast majority of terms used by audiophiles to describe what they are hearing, i.e., imaging, soundstage, detail, transparency, attack/decay, etc, are related to something real, acoustically speaking.
I have yet to hear an explanation, rooted in acoustics, that could account for one system being more rhythmically accurate than another.
As far as one system eliciting someone to bob their head or tap their foot with the music, that is much more likely to be other sonic factors, that just cause the individual to be more engaged with the system.
The only timing issues I believe are audible, are drivers being badly misaligned, causing step response, phase and time arrival problems. But these would relate to transient response, imaging/soundstage, vertical/horizontal off axis response issues.

