Speed at the Cost of Quality: How Cursor AI Increases Short-Term Velocity and Long-Term Complexity in Open-Source Projects
This repository contains the replication package for the following paper:
Hao He, Courtney Miller, Shyam Agarwal, Christian Kästner, and Bogdan Vasilescu. 2026. Speed at the Cost of Quality: How Cursor AI Increases Short-Term Velocity and Long-Term Complexity in Open-Source Projects. In 23rd International Conference on Mining Software Repositories (MSR ’26), April 13–14, 2026, Rio de Janeiro, Brazil. ACM, New York, NY, USA, 19 pages. https://doi.org/10.1145/3793302.3793349
The additional .gitignored files are available on Zenodo: 10.5281/zenodo.18368661
- The
scripts/folder contains all the scripts we used to collect data, run propensity score matching, and collect SonarQube metrics. These scripts are volatile in nature: Rerunning them will likely not get the same dataset we obtained for our paper (GitHub repositories growing and taken down, GitHub API changing, SonarQube changing, etc.), but they might be a valuable reference. - The
data/folder contains the exact dataset we used for our paper. Some data files are.gitignored in the GitHub repository but available in Zenodo.repos.csv: Metadata for treatment group repositories (Cursor-adopting projects)cursor_commits.csv: Commit-level data showing commits modifying Cursor configuration filescursor_files.csv: File-level data for Cursor configuration filesrepo_events.csv/repo_events_control.csv: GitHub event data for treatment/control repositoriesmatching.csv: Propensity score matching results (treatment-control pairs)panel_event_monthly.csv: Main panel dataset used for DiD analysis (monthly aggregated)ts_repos_monthly.csv/ts_repos_control_monthly.csv: Monthly time series for treatment/control groupsrepo_metrics.csv: Additional repository-level metrics (stars, contributors, etc.)sonarqube_warnings.csv: SonarQube static analysis warnings per repository (as available)sonarqube_warning_definitions.csv: Definitions and categories of SonarQube warning typescontrol_repo_candidates_*.csv: Monthly snapshots of candidate control repositories
- The
notebooks/folder contains R notebooks that will read fromdata/and reproduce results in the paper.DataCollection.md: Dataset descriptionPropensityScoreMatching.md: Matching resultsDiffInDiffBorusyak.md: Main DiD resultsDynamicPanel.md: Main panel GMM resultsDiffInDiffAll.md: Comparing alternative DiD estimatorsAnalyzeSonarQubeWarnings.md: Appendix analysis on SonarQube warningsNonCausalMethods.md: Some interesting results replicating the same RQ with only descriptive and correlational methodsDiffInDiffCallaway.md,DiffInDiffPosterFigures.md,DiffInDiffTWFE.md: Older notebooks that has been deprecated
- The
plots/folder contains all the plots in this paper (.gitignored in the GitHub repository but available in Zenodo)
All results were obtained using R 4.3.3 for statistical analysis and visualization (DiD estimation, panel models, plotting), and Python 3.11.4 for data collection scripts (GitHub API, BigQuery, SonarQube). We provide the detailed packages and their versions for future replication purposes. Due to the R ecosystem constantly evolving, different R versions and package versions may lead to slightly different results for this paper.
| Package | Version | Description |
|---|---|---|
| tidyverse | 2.0.0 | Data manipulation and visualization |
| ggplot2 | 3.5.1 | Grammar of graphics plotting |
| data.table | 1.15.4 | Fast data manipulation |
| dplyr | 1.1.4 | Data manipulation verbs |
| tidyr | 1.3.1 | Data tidying |
| tibble | 3.2.1 | Modern data frames |
| lubridate | 1.9.3 | Date/time manipulation |
| scales | 1.3.0 | Scale functions for visualization |
| fixest | 0.12.1 | Fixed effects estimation |
| did | 2.1.2 | Callaway & Sant'Anna DiD estimator |
| didimputation | 0.3.0 | Borusyak et al. imputation estimator |
| plm | 2.6-4 | Panel data models (GMM) |
| bacondecomp | 0.1.1 | Bacon decomposition for TWFE |
| modelsummary | 2.1.1 | Model summary tables |
| knitr | 1.48 | Dynamic report generation |
| kableExtra | 1.4.0 | Enhanced table formatting |
| gridExtra | 2.3 | Arrange multiple plots |
| cowplot | 1.1.3 | Plot composition |
| corrplot | 0.94 | Correlation visualization |
| RColorBrewer | 1.1-3 | Color palettes |
| Cairo | 1.6-2 | High-quality graphics device |
| showtext | 0.9-7 | Custom fonts in plots |
| ggfx | 1.0.1 | Graphics effects for ggplot2 |
| Package | Version | Description |
|---|---|---|
| pandas | 2.2.0 | Data manipulation |
| numpy | 1.26.4 | Numerical computing |
| requests | 2.31.0 | HTTP requests |
| python-dotenv | 1.0.1 | Environment variable management |
| GitPython | 3.1.43 | Git repository interaction |
| PyGithub | 2.3.0 | GitHub API client |
| google-cloud-bigquery | 3.25.0 | BigQuery client for GH Archive |
| scikit-learn | 1.5.0 | Machine learning (propensity scores) |
| semver | 3.0.2 | Semantic versioning |
| node-semver | 0.9.0 | Node.js semver parsing |
| gql | 3.5.0 | GraphQL client |
| aiohttp | 3.9.5 | Async HTTP for GraphQL |
Clone this repository and download the full dataset from Zenodo:
git clone https://github.com/hehao98/CursorStudy.git
cd CursorStudy
# (Optional) Download data files from Zenodo and place them in the data/ folderInstall R 4.3.3 and the required packages:
install.packages(c(
"tidyverse", "ggplot2", "data.table", "dplyr", "tidyr", "tibble",
"lubridate", "scales", "fixest", "did", "didimputation", "plm",
"bacondecomp", "modelsummary", "knitr", "kableExtra", "gridExtra",
"cowplot", "corrplot", "RColorBrewer", "Cairo"
))Knit the notebooks in the following order (Using RStudio or the R VS Code extension):
notebooks/DataCollection.Rmd— Overview of the dataset (Table 1, Figure 2)notebooks/PropensityScoreMatching.Rmd— Matching diagnostics (Appendix)notebooks/DiffInDiffBorusyak.Rmd— Main DiD results (Tables 2, Figure 3, Figure 4)notebooks/DynamicPanel.Rmd— Panel GMM results (Table 3)notebooks/DiffInDiffAll.Rmd— Comparing DiD estimators (Appendix)notebooks/AnalyzeSonarQubeWarnings.Rmd— SonarQube analysis (Appendix)
Each notebook reads from data/ and outputs tables/figures to plots/.
To rerun the data collection scripts (not recommended for exact replication):
# Set up Python environment
python -m venv venv
source venv/bin/activate
pip install pandas numpy requests python-dotenv GitPython PyGithub google-cloud-bigquery scikit-learn semver node-semver gql aiohttp
# Configure API keys in .env file
cp .env.example .env
# Edit .env with your GitHub token and BigQuery credentials
# Run scripts (see scripts/ for individual script usage)Note: Data collection scripts interact with live APIs and will produce different results due to repository changes, API updates, or rate limiting.