Towards Issue Impact Prediction: Predicting Files for Issues

Software developers typically coordinate their work by creating issues in a project management system (e.g. Jira). Issues are used to track bugs, feature requests, and other work items. They typically contain a problem description, or a description of work to be done.

In resolving the issues, a significant amount of software developers’ time is spent on code comprehension, including determining what files should be changed to solve the issue. Code comprehension can be even more time-consuming for developers that are new to a given project.

The goal of this project is to develop a machine learning model that can predict the files involved in a given issue, given the text of an issue, and the current codebase. The model should rank all source files according to their relevance for the given issue.

This project can accomodate multiple students (including groups), exploring different approaches to the problem:

Pure-NLP models: Use NLP models such as bag of words, word embeddings, and transformer models to predict the files involved in a given issue.
Multi-modal models: Use models that combine information from different modalities, such as text and historical changes, to predict the files involved in a given issue. This approach is more technical from a machine learning perspective; The idea is to map issue text and graph-based embeddings of historical information into a common embedding space, similar to CLIP [1].

References

[1] Radford, Alec, et al. “Learning transferable visual models from natural language supervision.” International conference on machine learning. PmLR, 2021.

Towards Issue Impact Prediction: Predicting Files for Issues

References

Supervisor(s)