Massey Ratings for Match Outcome Prediction in Table Tennis: Evidence of Greater Stability than the ITTF World Ranking

Massey Ratings for Match Outcome Prediction in Table Tennis: Evidence
of Greater Stability than the ITTF World Ranking YUKI TAKASAKI , EIJI KONAKA (MEIJO UNIVERSITY, JAPAN)

Outline 1. Background 2. Research Objective 3. Prediction Models 4.
Evaluation Metrics 5. Results and Discussion 6. Summary

Background: Player Strength Estimation in Sports ⚫Numerical indicators of player
strength (rankings) are essential in competitive sports ⚫Used for seeding, participation eligibility, and fan engagement ⚫Representative examples ⚫Tennis → ATP / WTA Rankings, Table Tennis → ITTF World Ranking Ref. B) Ref. A)

Background: The ITTF World Ranking System ⚫Official world ranking established
by the ITTF ⚫Ranking points earned across tournaments of varying tier and point allocation ⚫Rankings determined by points accumulated over the past 52 weeks ⚫Only the best 8 tournament results count toward the ranking ⚫Rankings shown: Week 44, 2025 ⚫Tomokazu Harimoto: 4th (5,500 pts) ⚫Sora Matsushima: 15th (↑1)

Background: WTT Tournament Structure WHAT IS WTT? Tier Category I
Grand Smash II WTT WTT Finals III WTT Champions IV WTT Star Contender V WTT Contenders WTT TOURNAMENT STRUCTURE ⚫The WTT tour, operated by the ITTF's commercial arm, WTT, was launched in 2021 ⚫Points are awarded based on the round reached in each event ⚫Grand Smash winner: 2,000 pts ⚫Champions winner: 1,000 pts

Background: WTT Point Allocation •Points increase proportionally with each additional
win •Higher-tier events award more points → Does this ranking system accurately reflect player strength? W F SF QF R16 R32 … Grand Smash 2000 1400 700 350 175 90 … Champions 1000 700 350 175 90 15 … Star Contender 600 420 210 105 55 25 … Contender 400 280 140 70 35 4 … ≃× 1.4 =× 2 =× 2 ≃× 2

Research Objective ⚫The ITTF ranking is widely used as a
measure of player strength. ⚫However, its predictive validity has not been rigorously examined ⚫Ranking points reflect only the round reached, not the margin of victory ⚫Our approach: Ratings derived from match results can capture the margin of victory. Objective: Build prediction models based on both approaches and compare their predictive performance.

Prediction Models: Two Approaches Ranking-based (Proposed)Rating-based Input WTT ranking points
Massey ratings Training data 2022-2024 (12494 matches) 2023-2024 (10488 matches) Test data 2025 (4536 matches) 2025 (5221 matches) Common to both approaches: • Models: 3-PLM • Evaluation: Accuracy, Log Loss, ECE • Source: 19,477 matches from WTT tournaments and the Olympics (2022–Sep 2025)

Prediction Models: Massey Rating ⚫Massey rating method ⚫Simple algorithm. No
prior ranking required ⚫Finds the rating (strength estimate) for each player ⚫Rating differences correspond to expected score differences 𝑟𝑖 − 𝑟𝑗 + 𝜀𝑘 = 𝑠𝑘 • 𝑟𝑖 , 𝑟 𝑗 : ratings of players 𝑖 and 𝑗 • 𝑠𝑘 : score difference in match 𝑘 • 𝜖𝑘 : error term

Prediction Models: Massey Rating (Numerical Example) ⚫Massey rating method 𝑟𝑖
− 𝑟𝑗 + 𝜀𝑘 = 𝑠𝑘 • 𝑟𝑖 , 𝑟 𝑗 : ratings of players 𝑖 and 𝑗 • 𝑠𝑘 : score difference in match 𝑘 • 𝜖𝑘 : error term 𝒊 𝒋 𝒔𝒊 𝒔𝒋 𝒔𝒌 1 2 5 0 5 1 2 4 1 3 2 3 3 2 1 2 3 1 2 -1 3 1 0 1 -1 3 1 4 2 2 𝑟𝑖 − 𝑟 𝑗 = 𝑠𝑘 − 𝜀𝑘 Example: 𝑟1 − 𝑟2 = 5 − 𝜖1 𝑟1 − 𝑟2 = 3 − 𝜖2 𝑟2 − 𝑟3 = 1 − 𝜖3 𝑟2 − 𝑟3 = −1 − 𝜖4 𝑟3 − 𝑟1 = −1 − 𝜖5 𝑟3 − 𝑟1 = 2 − 𝜖6

Prediction Models: Massey Rating (Numerical Example) ⚫Massey rating method Matrix-vector
form 1 −1 0 1 −1 0 0 1 −1 0 1 −1 −1 0 1 −1 0 1 𝑟1 𝑟2 𝑟3 = 5 3 1 −1 −1 2 − 𝜖1 𝜖2 𝜖3 𝜖4 𝜖5 𝜖6 𝑟1 − 𝑟2 = 5 − 𝜖1 𝑟1 − 𝑟2 = 3 − 𝜖2 𝑟2 − 𝑟3 = 1 − 𝜖3 𝑟2 − 𝑟3 = −1 − 𝜖4 𝑟3 − 𝑟1 = −1 − 𝜖5 𝑟3 − 𝑟1 = 2 − 𝜖6

Prediction Models: Massey Rating (Numerical Example) ⚫Massey rating method 𝑿𝒓
= 𝒔 − 𝝐 𝒓 = 𝑿𝑻𝑿 −𝟏 𝑿𝑻𝒔 Matrix-vector form Minimize 𝝐⊤𝝐 𝒓 ∶ vector of player ratings Least square method 1 −1 0 1 −1 0 0 1 −1 0 1 −1 −1 0 1 −1 0 1 𝑟1 𝑟2 𝑟3 = 5 3 1 −1 −1 2 − 𝜖1 𝜖2 𝜖3 𝜖4 𝜖5 𝜖6

Prediction Models: Massey Rating (How to Include Score Difference) ⚫Instead
of raw game counts, we use a transformed game win ratio: ⚫Captures the margin of victory as a continuous measure ⚫Additive smoothing (𝑠𝑖 → 𝑠𝑖 + 1) prevents undefined log values • Example: Player 𝑖: 3 games won Player 𝑗: 0 games won 𝑦𝑘 = log 𝑠𝑖𝑗 1 − 𝑠𝑖𝑗 , 𝑠𝑖𝑗 = 𝑠𝑖 + 1 𝑠𝑖 + 𝑠𝑗 + 2 𝑠𝑖 , 𝑠𝑗 ： won games of player 𝑖, 𝑗 𝑠𝑖𝑗 = 3 + 1 3 + 0 + 2 = 4 5 = 0.8 𝑦𝑘 = log 0.8 1 − 0.8 = log 4 ≈ 1.386

Ranking vs. Rating: Rank Correlation ⚫Massey ratings are computed on
the same dates as the weekly ranking updates ⚫Spearman rank correlation between ITTF ranking order and Massey rating order is computed ⚫High correlation overall, but some discrepancies exist ⚫→Do they perform equally well in match outcome prediction?

Prediction Accuracy and McNemar's Test ⚫Target: 4,536 matches in 2025
⚫Prediction rule: the player with higher ranking points (or rating) wins ⚫(Proposed)Rating-based accuracy: 0.7086 ⚫Ranking-based accuracy: 0.6693 ⚫McNemar's test shows that the rating- based model is significantly more accurate Rating /Correct Rating /Incorrect Ranking /Correct 2589 447 3036 (0.670) Ranking /Incorrect 625 875 1500 3214 (0.709) 1322 McNemer’s test: 𝒑 = 𝟓. 𝟒𝟑 × 𝟏𝟎−𝟖

Limitations of Prediction Accuracy ⚫Accuracy and McNemar's test confirm the
rating-based model is significantly more accurate than the official ranking ⚫However, accuracy only evaluates whether the winner was correctly predicted ⚫It does not assess the quality of the predicted probabilities → Construct probabilistic prediction models for both approaches and compare predicted probabilities using Log Loss and ECE

⚫ 𝑟𝑖 , 𝑟 𝑗 : WTT ranking points of
players 𝑖 and 𝑗 ⚫ 𝛼: sensitivity to ranking point differences ⚫𝑐: lower bound on win probability. Even a lower-ranked player retains a non- negligible win probability ⚫Estimated parameters: ො 𝛼 = 1.056, Ƹ 𝑐 = 0.201 →Even a player with very few ranking points retains about a 20% chance of winning Prediction Models: 3-PLM (Ranking-based) 𝑝𝑖,𝑗 = 𝑐 + 1 − 2𝑐 𝑟𝑖 𝛼 𝑟𝑖 𝛼 + 𝑟𝑗 𝛼

Prediction Models: 3-PLM(Proposed) (Rating-based) ⚫𝑟𝑖 , 𝑟 𝑗 : Massey
ratings of players 𝑖 and 𝑗 ⚫𝛼: sensitivity to ranking point differences ⚫𝑐: lower bound on win probability. ⚫Estimated parameters: ො 𝛼 = 2.082, Ƹ 𝑐 = 0.054 ( < 0.201 (ranking-based) ) 𝑝𝑖,𝑗 = 𝑐 + 1 − 2𝑐 exp(𝛼𝑟𝑖 ) exp 𝛼𝑟𝑖 + exp(𝛼𝑗 )

Evaluation Metrics LogLoss ⚫Quality of predicted probabilities ⚫Smaller → predicted
probabilities closer to true outcomes ⚫Penalizes overconfident wrong predictions ECE (Expected Calibration Error) ⚫Reliability of predicted probabilities ⚫Smaller → predicted probability matches empirical win rate ⚫Example: predicted 0.8, actual win rate 80% → good calibration Both metrics: smaller is better

Results: Training (Ranking-point-based) ⚫Trained on 2022–2024 data ⚫Horizontal and vertical
axes: log ranking point ratio and win rate, respectively ⚫Lower histogram: number of matches per bin ⚫Empirical win rate is unstable at extreme ranking point ratios due to small sample sizes

Results: Training (Ranking-point-based) ⚫The 3-PLM captures the empirical win rate
closely ⚫→ Apply the trained model to 2025 test data

Results: Training (Proposed-Rating-based) ⚫Trained on 2023–2024 data ⚫Horizontal and vertical
axes: rating difference and win rate, respectively ⚫Lower histogram: number of matches per bin ⚫The 3-PLM captures the empirical win rate closely ⚫Compared to the ranking-based model, the empirical win rate is more stable ⚫→ Apply the trained model to 2025 test data

Results: Predicted vs. Actual Win Rate Ranking-based (Proposed)Rating-based

Results: Predicted vs. Actual Win Rate Ranking-based (Proposed)Rating-based Stronger players
are underestimated Accurate estimation

Results: Evaluation Summary ⚫ (Proposed)Rating-based model achieves lower Log Loss
on test data → more accurate probability predictions ⚫Rating-based model achieves lower ECE on test data → better calibration Train Test (2025) Methods LogLoss ECE LogLoss ECE Ranking 0.9165 0.0026 0.8871 0.0192 Rating 0.8201 0.0026 0.8341 0.0155 The proposed rating-based model outperformed the ranking-based model

Results: Discussion ⚫ Rating-based model achieved higher accuracy and smaller
ECE → more stable win rate estimation ⚫Ranking points are affected by year- to-year changes in tournament composition (right Table) ⚫Massey ratings are updated from match results only → less affected by tournament structure changes ‘21 ‘22 ‘23 ‘24 ‘25 (9/25) Grand Smash 0 1 1 3 4 Champions 0 2 3 5 4 Star Contender 2 2 4 4 4 Contender 5 8 11 10 7 Feeder 1 11 15 19 15

Summary Summary ⚫Constructed match outcome prediction models using WTT ranking
points and Massey ratings ⚫Used the 3-PLM as the prediction model ⚫Evaluated with accuracy, Log Loss, and ECE Findings ⚫Accuracy and McNemar's test: rating- based model is significantly more accurate than the official ranking ⚫Log Loss and ECE: rating-based model shows better and more stable probability estimation ⚫Ranking points are sensitive to changes in tournament structure; Massey ratings are less affected by such changes Massey rating: more accurate and stable

A) https://jp.reuters.com/life/sports /5UE3WRUQOJL7LKY5FISW5C2O4 M-2024-12-24/ B) https://www.sankei.com/article/ 20240221- WIKN55PIBNJ5XEKXTID2SZIG3Y/p hoto/X2MKVX4MN5MC7NWUJ4P X2X2KXQ/
References

Evaluation Metric: Log Loss • A smaller Log Loss indicates
that the predicted probabilities are closer to the actual outcomes • It imposes a large penalty on predictions that assign a low probability to the true outcome • Lower values indicate better probabilistic predictions • We use Log Loss to evaluate the performance of probabilistic predictions • Let 𝑦𝑖 ∈ 0 1 be the true outcome of match 𝑖(win = 1, loss = 0), and let 𝑝𝑖 ∈ 0 1 be the predicted probability of winning Log Loss evaluates not only whether the predicted winner is correct, but also how well the predicted probability reflects the actual outcome LogLoss = − 1 𝑁 ෍ 𝑖=1 𝑁 𝑦𝑖 log2 𝑝𝑖 + 1 − 𝑦𝑖 log2 1 − 𝑝𝑖

Evaluation Metric: Expected Calibration Error (ECE) • Lower ECE indicates
better calibration • A well-calibrated model satisfies: Predicted probability = 0.8 → Observed win rate ≈ 80% Predicted probability = 0.8 → Observed win rate = 60% → Overconfident prediction • Expected Calibration Error (ECE) evaluates how well the predicted probabilities match the observed win rates (calibration) conf 𝑚 = 1 𝑛𝑚 ෍ 𝑖∈𝐵𝑚 𝑝𝑖 acc 𝑚 = 1 𝑛𝑚 ෍ 𝑖∈𝐵𝑚 𝑦𝑖 ECE = ෍ 𝑚=1 𝑀 𝑛𝑚 𝑁 acc 𝑚 − conf 𝑚 • Divide the prediction probability range [0 , 1] into M bins • Let 𝐵𝑚 denote the set of matches in bin 𝑚, and let 𝑛𝑚 be the number of matches in that bin • conf 𝒎 ：Average predicted win probability in bin 𝑚 • acc 𝒎 ： Observed win rate in bin 𝑚 In this study, the prediction probabilities were divided into 10 bins to calculate ECE

Massey Ratings for Match Outcome Prediction in ...

Massey Ratings for Match Outcome Prediction in Table Tennis: Evidence of Greater Stability than the ITTF World Ranking

More Decks by konakalab

Other Decks in Science

Featured

Transcript