Quality of Automated Program Repair on Real-World Defects
by Manish Motwani, Mauricio Soto, Yuriy Brun, René Just, Claire Le Goues
Abstract:
Automated program repair is a promising approach to reducing the costs of manual debugging and increasing software quality. However, recent studies have shown that automated program repair techniques can be prone to producing patches of low quality, overfitting to the set of tests provided to the repair technique, and failing to generalize to the intended specification. This paper rigorously explores this phenomenon on real-world Java programs, analyzing the effectiveness of four well-known repair techniques, GenProg, Par, SimFix, and TrpAutoRepair, on defects made by the projects' developers during their regular development process. We find that: (1) When applied to real-world Java code, automated program repair techniques produce patches for between 10.6% and 19.0% of the defects, which is less frequent than when applied to C code. (2) The produced patches often overfit to the provided test suite, with only between 13.8% and 46.1% of the patches passing an independent set of tests. (3) Test suite size has an extremely small but significant effect on the quality of the patches, with larger test suites producing higher-quality patches, though, surprisingly, higher-coverage test suites correlate with lower-quality patches. (4) The number of tests that a buggy program fails has a small but statistically significant positive effect on the quality of the produced patches. (5) Test suite provenance, whether the test suite is written by a human or automatically generated, has a significant effect on the quality of the patches, with developer-written tests typically producing higher-quality patches. And (6) the patches exhibit insufficient diversity to improve quality through some method of combining multiple patches. We develop JaRFly, an open-source framework for implementing techniques for automatic search-based improvement of Java programs. Our study uses JaRFly to faithfully reimplement GenProg and TrpAutoRepair to work on Java code, and makes the first public release of an implementation of Par. Unlike prior work, our study carefully controls for confounding factors and produces a methodology, as well as a dataset of automatically-generated test suites, for objectively evaluating the quality of Java repair techniques on real-world defects.
Citation:
Manish Motwani, Mauricio Soto, Yuriy Brun, René Just, and Claire Le Goues, Quality of Automated Program Repair on Real-World Defects, IEEE Transactions on Software Engineering (TSE), vol. 48, no. 2, February 2022, pp. 637–661.
Bibtex:
@article{Motwani22tse,
  author = {Manish Motwani and Mauricio Soto and Yuriy Brun and Ren{\'{e}} Just and Claire {Le Goues}},
  title = {\href{http://people.cs.umass.edu/brun/pubs/pubs/Motwani22tse.pdf}{Quality of Automated Program Repair on Real-World Defects}},
  journal = {IEEE Transactions on Software Engineering (TSE)},
  venue = {TSE},
  issn = {0098-5589},
  year = {2022},
  volume = {48},
  number = {2},
  month = {February},
  pages = {637--661},

  doi = {10.1109/TSE.2020.2998785},
  note = {\href{https://doi.org/10.1109/TSE.2020.2998785}{DOI:
  10.1109/TSE.2020.2998785}},

  abstract = {Automated program repair is a promising approach to reducing
  the costs of manual debugging and increasing software quality. However,
  recent studies have shown that automated program repair techniques can be
  prone to producing patches of low quality, overfitting to the set of tests
  provided to the repair technique, and failing to generalize to the intended
  specification. This paper rigorously explores this phenomenon on real-world
  Java programs, analyzing the effectiveness of four well-known repair
  techniques, GenProg, Par, SimFix, and TrpAutoRepair, on defects made by the
  projects' developers during their regular development process. We find
  that: (1) When applied to real-world Java code, automated program repair
  techniques produce patches for between 10.6% and 19.0% of the defects,
  which is less frequent than when applied to C code. (2) The produced
  patches often overfit to the provided test suite, with only between 13.8%
  and 46.1% of the patches passing an independent set of tests. (3) Test
  suite size has an extremely small but significant effect on the quality of
  the patches, with larger test suites producing higher-quality patches,
  though, surprisingly, higher-coverage test suites correlate with
  lower-quality patches. (4) The number of tests that a buggy program fails
  has a small but statistically significant positive effect on the quality of
  the produced patches. (5) Test suite provenance, whether the test suite is
  written by a human or automatically generated, has a significant effect on
  the quality of the patches, with developer-written tests typically
  producing higher-quality patches. And (6) the patches exhibit insufficient
  diversity to improve quality through some method of combining multiple
  patches. We develop JaRFly, an open-source framework for implementing
  techniques for automatic search-based improvement of Java programs. Our
  study uses JaRFly to faithfully reimplement GenProg and TrpAutoRepair to
  work on Java code, and makes the first public release of an implementation
  of Par. Unlike prior work, our study carefully controls for confounding
  factors and produces a methodology, as well as a dataset of
  automatically-generated test suites, for objectively evaluating the quality
  of Java repair techniques on real-world defects.},
  
  fundedBy = {NSF CCF-1453474, NSF CCF-1563797, NSF CCF-1564162, 
  high performance computing equipment obtained under a grant from the
  Collaborative R\&D Fund managed by the Massachusetts Technology Collaborative},
  
}