Markov-Switching Models for Musical Interpretation
Daniel J. McDonald
“Jazzy sound”
“Strong beat”
“good lyrics”
“anything by Taylor Swift”
Harder to describe a performance
Classical music is mainly about performances of the same music
How do we think about which ones we like?
The real difference is that when a composer writes a piece of what’s usually called classical music, he puts down the exact notes that he wants, the exact instruments or voices that he wants to play or sing those notes—even the exact number of instruments or voices; and he also writes down as many directions as he can think of.
–Leonard Bernstein
Generally much more musically complicated
Notes change speed
Sometimes purposeful
Speed is important for interpretation
Notes
Beat
Measure
Time signature
Tempo
Dynamics
All those little black dots
Strongly felt impetus
Collections of notes delimited by vertical “barlines”
Number of beats / measure; type of note that gets the beat
The prevailing speed, measured in bpm
Loudness of the note
Focus on timing only (dynamics also available)
50 recordings: Chopin Mazurka Op. 68 No. 3
Recorded between 1931 and 2006
45 different performers
Fryderyk Chopin (1810–1849)
Born in Poland
Moved to Paris at 21
Virtuoso pianist
Wrote mainly piano music
Mazurka
A Polish dance
Chopin composed at least 58 for Piano
Repetition is very important
Certain rhythmic figures
1. Constant tempo
2. Accelerando (speeding up)
3. Ritarando (slowing down)
4. Tenuto (emphasis)
1. Constant tempo
2. Accelerando (speeding up)
3. Ritarando (slowing down)
4. Tenuto (emphasis)
1. Constant tempo
2. Accelerando (speeding up)
3. Ritarando (slowing down)
4. Tenuto (emphasis)
Before \(|\mathcal{S}| = 4\). Now, \(|\mathcal{S}| = 11\)
Viterbi algorithm gives estimates of \(\{S_k\}\)
Computation is \(O(n)\): likelihood of \(p\) and estimates of \(\{S_k\}\)
Inside of EM (Baum-Welch) gives estimates of \(p\)
\[ \begin{aligned} X_{k+1} &= d_k + T_k X_k + \eta_{k+1}, & \eta_{k+1} &\sim \textrm{N}(0, Q_{k+1}),\\ Y_k &= c_k + Z_k X_k + \epsilon_{k}, &\epsilon_k & \sim \textrm{N}(0, G_k).\\ \end{aligned} \]
Assume \(X_0\) is Gaussian
KF just tracks mean and variance of \(X_k\ |\ \{Y_i\}_{i=1}^k\)
Does this iteratively for each \(k\)
kfilter()
gives estimate of \(\{X_k\; |\; Y_1,\ldots,Y_k\}_{k=1}^n\)
kfilter()
also gives likelihood of \(\theta\)
ksmoother()
gives estimate of \(\{X_k\; |\; Y_1,\ldots,Y_n\}_{k=1}^n\)
\[ \begin{aligned} X_{k+1} &= {\color{Plum} d_k} + {\color{Plum} T_k} X_k + \eta_{k+1}, & \eta_{k+1} &\sim \textrm{N}(0, {\color{Plum} Q_{k+1}}),\\ Y_k &= {\color{Plum} c_k} + {\color{Plum} Z_k} X_k + \epsilon_{k}, &\epsilon_k & \sim \textrm{N}(0, {\color{Plum} G_k}).\\ \end{aligned} \]
Parameter matrices may depend on
\[ \begin{aligned} X_{k+1} &= {\color{Plum} d_k} + {\color{Plum} T_k} X_k + \eta_{k+1}, & \eta_{k+1} &\sim \textrm{N}(0, {\color{Plum} Q_{k+1}}),\\ Y_k &= {\color{Plum} c_k} + {\color{Plum} Z_k} X_k + \epsilon_{k}, &\epsilon_k & \sim \textrm{N}(0, {\color{Plum} G_k}).\\ \end{aligned} \]
kfilter()
\(+\) ksmoother()
algorithms gives estimates of \(\{X_k\}\)
Computation is \(O(n)\): likelihood of \(\theta\) and estimates of \(\{X_k\}\)
Inside of EM gives estimates of \(\theta\)
\[ \begin{aligned} X_{k} &= d(s_k,\ s_{k-1}) + T(s_k,\ s_{k-1}) X_{k-1} + \eta_{k}\\\\ Y_k &= c(s_k) + Z(s_k) X_k + \epsilon_{k}\\\\ \eta_{k} &\sim \textrm{N}(0, Q(s_k,\ s_{k-1}))\\\\ \epsilon_k & \sim \textrm{N}(0, G(s_k)) \end{aligned} \]
We want to estimate: \(\theta\), \(p\), \(\{X_k\}\), and \(\{S_k\}\)
This costs \(O(|\mathcal{S}|^n)\), \(|\mathcal{S}|=11\), \(n=231\)
dpf()
Track at most \(J\) paths through the \(|\mathcal{S}|^n\) tree
At time \(k\), receive \(J\) paths: a state sequence and a weight
Propagate each of the \(J\) paths forward, creating \(\approx J^2\)
Using the weights, sample the \(\approx J^2\) possibilities to get only \(J\)
iterate forward through time until done, return \(J\) sequences and their weights
For each performance:
dpf()
gives greedy state sequence \(\{\widehat{S}_k\}_{k=1}^n\)kfilter()
dpf()
and ksmoother()
at \(\widehat{\theta}\) to get \(\{\widehat{S}_k\}_{k=1}^n\) and \(\{\widehat{X}_k\}_{k=1}^n\)For each performance, we estimate \(p\) and \(\theta\) by penalized maximum likelihood.
Abuse notation and redefine \(\theta := (p, \theta)\)
average speed in different states
some variance parameters
transition probabilities
We have strong prior information.
Use Mahalanobis \[d(\theta,\theta') = \sqrt{(\theta-\theta')^\mathsf{T} \mathbf{V}^{-1}(\theta-\theta')}\]
\(\mathbf{V}\) is prior covariance matrix
Incorporates correlations correctly on probability vectors
Some performances have no close neighbors
We develop a switching model for tempo decisions
We give an algorithm for performing likelihood inference
We estimate our model using a large collection of recordings of the same composition
We demonstrate how the model is able to recover performer intentions
We use the learned representations to compare and contrast recordings
Similar model for dynamics
Can we do this fast and “online”?
Use it for teaching?
Markov Models for Music — dajmcdon/hmms-chopin-2025