Automatic Data Repair without Format Specifications

The University of Sydney 1 Zijian Luo University of Sydney
[email protected] Automatic Data Repair without Format Specifications Lukas Kirschner Saarland University [email protected] Ezekiel Soremekun Singapore University of Technology and Design [email protected] Rahul Gopinath University of Sydney [email protected]

The University of Sydney Background Human mistakes Software bugs hardware
failures Data loss because of … Corruption in input files poses a significant threat to software reliability. 2

The University of Sydney File truncation due to an unreliable
channel 3

The University of Sydney Human mistakes 08/08/2025 08-08-2025 8.8.2025 2025,8,8
8,8,2025 12:30 P.M. 12:30 pm 12.30 4

The University of Sydney Software error due to inconsistent parsers
5 { ”example": "\uD800\uZZZZ" }

The University of Sydney The problem: inconsistent implementations – Dependence
on formal specifications – Lack of standardization – Parser variability 26.8K implementations of JSON parser on GitHub 6

The University of Sydney Automated data repair Given a corrupted
input file and a black-box interpreter, the objective is to generate a new file that is accepted by the interpreter, while minimizing the difference between the new and original file. 7 Black-box parser

The University of Sydney Automatic data repair – Aho and
Peterson, SIAM J. Comput. 1972 – Diekmann and Tratt, ECOOP 2020 – Parr and Fisher, SIGPLAN Notices 2011 8

The University of Sydney Related work: DDMax 9 DDMax in
ICSE, 2020.

The University of Sydney Limitations of DDMax 1. Deletion-Only Repairs
2. Need for a Valid Empty Baseline and Waypoints 3. Poor Handling of Multi-Character or Multi-Fault Errors 10 Invalid JSON string: “{

The University of Sydney Leveraging failure feedback for better repair
$ echo -n ‘{"name" : "Dave" "age":42}' | jq . parse error: Expected separator at line 1, column 21 $ echo -n '{"name": "Dave" ' | jq . parse error: Unfinished JSON term at EOF at line 1, column 16 Input rejected Input incomplete $ echo -n '{"name" : "Dave", "age":42}' | jq . { "name": "Dave", "age": 42 } Input accepted 11 $ echo -n ‘{"name" : "Dave" "age": ' | jq . parse error: Expected separator at line 1, column 21 Input rejected

The University of Sydney ϵREPAIR 13 Output：‘{“name” : “Dave” ,"age":42}
Partial/Fully accepted by parser Rejected by parser Using binary search {"name" : "Dave" "age":42} This is the boundary Input：{"name" : "Dave" "age":42} '{"name": "D {"name" : "Dave" "age": '

The University of Sydney ϵREPAIR Binary search the boundary of
⋫ and ▷ ⋫ means incomplete/accept ▷ means reject Delete/insert characters in the boundary. 14

The University of Sydney Examples {"name" : "Dave" "age":42} accepted
(edit distance=1) 15 {"name" : "Dave" ,"age":42} {"name" : "Dave" |"age":42} {"name" : "Dave" "age":42a} {"name" : "Dave" |"age":42a} {"name" : "Dave" ,"age":42|a} {"name" : "Dave" ,"age":42} accepted (edit distance=2)

The University of Sydney Evaluation of ϵREPAIR against DDMax –
RQ1: What is the quality of data repair by ϵREPAIR in comparison to its competitors? – RQ2: How many corrupt records can be repaired by ϵREPAIR in comparison to its competitors? – RQ3: How does ϵREPAIR compares to DDMax in performance? 16

The University of Sydney Evaluation of ϵREPAIR against DDMax RQ1:
What is the quality of data repair by ϵREPAIR in comparison to its competitors? RQ2: How many corrupt records can be repaired by ϵREPAIR in comparison to its competitors? 17 Name LOC Parser Lang. Input Format Development ini 511 C INI 2009-2022 cjson 3413 C JSON 2009-2022 sexp 978 C SExp 2016-2016 tinyc 421 C TinyC 2011-2018 Parsers used in evaluation Name Record Len. Single Corr. Double Corr. Truncated INI 102.0 ± 20.4 1000 100 100 (29.1%) JSON 146.6 ± 46.6 1000 100 100 (26.7%) SExp 66.8 ± 31.2 1000 100 100 (26.8%) TinyC 45.3 ± 20.4 1000 100 100 (24.8%)

How does ϵREPAIR compares to DDMax in performance? We conducted our experiments on a Mac M2 Ultra machine with 192 GB of RAM. During the experiment 18

What is the quality of data repair by ϵREPAIR in comparison to its competitors? Overall ϵREPAIR produced repairs that were on average 7.0 edits away from the original record compared to 16.0 for DDMax, an improvement of 19 Subject eRepair DDMax DDmaxG ANTLR Single INI 1.4 σ 0.8 2.5 σ 0.3 26.2 σ 6.4 25.0 σ 5.6 JSON 5.1 σ 19.0 26.0 σ 43.7 48.5 σ 31.5 40.4 σ 24.3 SExp 10.3 σ 19.0 7.6 σ 14.7 36.9 σ 27.2 n.a σ n.a TinyC 4.1 σ 6.5 9.2 σ 13.5 25.1 σ 11.7 21.8 σ 10.4 Double INI 1.5 σ 0.9 3.0 σ 0.34 26.6 σ 6.8 24.8 σ 5.7 JSON 7.0 σ 24.3 43.0 σ 52.1 41.0 σ 27.0 40.0 σ 27.0 SExp 12.0 σ 19.5 10.9 σ 16.7 39.5 σ 27.3 n.a σ n.a TinyC 6.6 σ 5.1 27.6 σ 15.4 27.3 σ 12.5 20.7 σ 10.8 Truncated INI 1.0 σ 0.2 2.0 σ 0.7 18.7 σ 5.6 17.8 σ 4.6 JSON 3.3 σ 17.5 74.3 σ 29.6 63.2 σ 36.1 n.a σ n.a SExp 1.8 σ 0.4 22.2 σ 18.7 35.2 σ 22.1 n.a σ n.a TinyC 1.9 σ 0.8 22.3 σ 9.0 28.0 σ 9.4 n.a σ n.a Average 5.1 σ 13.9 13.5 σ 27.3 35.0 σ 24.8 29.0 σ 17.4 Recovery 94% σ 0.16 83% σ 0.25 80% σ 0.23 91% σ 0.13 Subject eRepair DDMax DDmaxG ANTLR Single INI 2.4 σ 0.8 3.3 σ 2.7 27.3 σ 6.3 25.8 σ 5.6 JSON 5.3 σ 19.0 26.1 σ 43.6 48.5 σ 31.5 40.4 σ 24.3 SExp 13.3 σ 19.2 8.6 σ 14.5 37.4 σ 27.0 n.a σ n.a TinyC 4.1 σ 6.5 9.2 σ 13.5 25.0 σ 11.8 21.4 σ 10.4 Double INI 3.5 σ 0.9 3.8 σ 0.30 28.6 σ 6.8 25.0 σ 6.0 JSON 7.8 σ 24.2 43.7 σ 51.6 41.0 σ 27.0 40.0 σ 27.0 SExp 13.3 σ 19.3 12.5 σ 16.3 39.5 σ 27.4 n.a σ n.a TinyC 6.5 σ 5.2 27.6 σ 15.4 27.0 σ 12.5 20.4 σ 10.7 Truncated INI 28.1 σ 15.9 30.6 σ 16.2 45.5 σ 15.6 44.7 σ 15.0 JSON 35.1 σ 23.1 119.0 σ 51.1 55.1 σ 27.1 n.a σ n.a SExp 15.7 σ 11.4 40.6 σ 26.0 51.1 σ 27.0 n.a σ n.a TinyC 7.8 σ 7.6 36.6 σ 13.4 39.1 σ 11.7 n.a σ n.a Average 7.0 σ 15.4 16.0 σ 29.8 37.2 σ 27.2 30.4 σ 17.7 Recovery 92% σ 0.17 82% σ 0.26 79% σ 0.24 90% σ 0.13 2.3 x

How many corrupt records can be repaired by ϵREPAIR in comparison to its competitors? ϵREPAIR was able to repair 97% of all records, which is comparable to 98% from DDmaxG and DDmaxG Moreover, Epsilon repair is the method capable of perfectly fixing corruption. 20 Subject eRepair DDMax DDmaxG ANTLR Single INI 1000 1000 1000 884 JSON 999 971 982 703 SExp 966 1000 1000 0 TinyC 1000 984 984 481 Double INI 100 100 100 91 JSON 98 99 98 68 SExp 94 100 100 0 TinyC 100 98 98 28 Truncated INI 100 100 100 B100 JSON 82 90 100 1 SExp 39 100 100 0 TinyC 82 77 77 4 Total 4660 4719 4739 2355 Subject eRepair DDMax DDmaxG ANTLR INI 0 0 0 0 JSON 25 0 0 0 SExp 7 0 0 0 TinyC 63 0 0 0 only

How does ϵREPAIR compares to DDMax in performance? Although ϵREPAIR is 40% slower than DDMax, its average runtime of 3.8 seconds per record is still practical for data repair. Format-free Format-dependent Metric εRepair DDMax DDmaxG ANTLR Runtime 3.87 secs 2.7 secs 2.0 secs 0.3 secs 21

The University of Sydney Repairing regular data-formats with fixed length
ϵREPAIR is the technique capable of repairing corrupt records validated by regular expressions Formats Total Success rate Repair-Distance filepath 100 100% 0.96 (0.2) date 100 100% 1.13 (0.7) ipv6 100 100% 0.88 (0.4) time 100 100% 0.90 (0.7) url 100 100% 0.45 (0.5) ipv4 100 100% 0.91 (0.5) isbn 100 100% 1.67 (0.5) 22 only

The University of Sydney Discussion ϵREPAIR is applicable when: •
A parser provides meaningful error feedback (common in software engineering). • The parser can be instrumented for feedback (e.g., in fuzzing). • A formal grammar or regex is available ϵREPAIR’s key innovations include: • Relaxed parser constraints, relying on parser feedback instead of requiring valid waypoints. • Support for a wider range of repair operations. 23

The University of Sydney Thanks for listening… 25

Automatic Data Repair without Format Specificat...

Automatic Data Repair without Format Specifications

Rahul Gopinath

More Decks by Rahul Gopinath

Featured

Transcript

The University of Sydney 1 Zijian Luo University of Sydney

The University of Sydney Background Human mistakes Software bugs hardware

The University of Sydney File truncation due to an unreliable

The University of Sydney Human mistakes 08/08/2025 08-08-2025 8.8.2025 2025,8,8

The University of Sydney Software error due to inconsistent parsers

The University of Sydney The problem: inconsistent implementations – Dependence

The University of Sydney Automated data repair Given a corrupted

The University of Sydney Automatic data repair – Aho and

The University of Sydney Related work: DDMax 9 DDMax in

The University of Sydney Limitations of DDMax 1. Deletion-Only Repairs

The University of Sydney Leveraging failure feedback for better repair

The University of Sydney ϵREPAIR 13 Output：‘{“name” : “Dave” ,"age":42}

The University of Sydney ϵREPAIR Binary search the boundary of

The University of Sydney Examples {"name" : "Dave" "age":42} accepted

The University of Sydney Evaluation of ϵREPAIR against DDMax –

The University of Sydney Evaluation of ϵREPAIR against DDMax RQ1:

The University of Sydney Evaluation of ϵREPAIR against DDMax RQ3:

The University of Sydney Evaluation of ϵREPAIR against DDMax RQ1:

The University of Sydney Evaluation of ϵREPAIR against DDMax RQ2:

The University of Sydney Evaluation of ϵREPAIR against DDMax RQ3:

The University of Sydney Repairing regular data-formats with fixed length

The University of Sydney Discussion ϵREPAIR is applicable when: •

The University of Sydney Thanks for listening… 25