Replication Package: The Structure of Cross-National Collaboration in Open-Source Software Development
This repository contains the replication package for the following paper:
Henry Xu, Katy Yu, Hao He, Hongbo Fang, Bogdan Vasilescu, and Patrick S. Park. 2025. The Structure of Cross-National Collaboration in Open-Source Software Development. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM ’25), November 10–14, 2025, Seoul, Republic of Korea. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3746252.3761237
This replication package works with Python 3.10+, Gephi 0.10+, and the latest version of R. The required Python dependencies are described in requirements.txt.
- data/economy_collaborators.csv: GitHub collaboration graph aggregated from this repository.
- data/ctry_civ_labels.csv: Country civilization labels as categorized by Huntington (1987).
- data/gdp_per_capita.csv: Country-level GDP per-capita data collected from the World Bank
This section provides documentation regarding the disparity filter.
python edge_filtering/disparity_filter_alpha_plots.py --inputFilePath data/economy_collaborators.csv --year 2023 --quarter 1 --normalize outgoingFlags:
- --inputFilePath economy_collaborators.csv (Original Github innovation graph data for economy_collaborators.csv)
- --year 2023 (year for analysis)
- --quarter 1 (quarter for analysis)
- --normalize outgoing (normalize by sender total weight)
Key metrics:
- Cumulative Degree Distribution
- Distribution of Link Weights
- Clustering coefficients, percent of total weight, percent of total nodes
- Largest Connected Component ratio (after vs before alpha filtering)
For metrics 3 and 4, the purple vertical line highlights the chosen alpha value based on the Largest Connected Component Ratio (right before the biggest drop is chosen). We used Weakly Connected Component for the Largest Connected Component measurement.
To generate data for visualization:
python edge_filtering/disparity_filter.py --inputFilePath data/economy_collaborators.csv --outputFilePath data/filtered/economy_collaborators_outgoing --normalize outgoing --excludeCountries EU --optimalAlpha 0.09Flags:
- --inputFilePath economy_collaborators.csv: Use original GitHub innovation graph data for economy_collaborators.csv
- --outputFilePath: this is for the output filename prefix. We used filtered_graph_test_combine_all_exclude_EU_normalize_sender to represent what we did, which is summing all the edge weights from all years and quarters into one file for each edge, and then we excluded EU, normalized by sender total weights.
- --normalize outgoing: this flag is for normalizing the weights. The options are outgoing, incoming, log, or none (default). In our case, we normalized by outgoing, which is normalization by total sender country weight across all its edges.
- --excludeCountries: this flag allows for entering country codes to exclude from the data. In our case, we excluded the EU
- --optimalAlpha: this flag allows for entering a list of alpha values for filtering, and the script will generate one CSV output per alpha value. In our case, we used 0.09, 0.12, 0.15, 1 (no filtering) as alpha values.
The following two Python files replicate Figure 1 and Figure 2.
python blockmodeling.py
python reciprocity.pyThe ERGM results were obtained from the following R file.
The Node2Vec results can be obtained by following these 3 steps:
- Embedding generation:
Download all files in github-innovation-graph/node2vec/
Including the edgelist from graph and the two python files
- graph/economy_collaborators_no_US_with_weights.edgelist
- src/node2vec.py
- src/main.py
Note: It is important that you then go into the node2vec folder (if you're running this on terminal, ensure you are in the node2vec folder, otherwise you need to update the paths to match)
Install requirements for node2vec with node2vec/node2vec_requirements.txt
pip install -r node2vec_requirements.txtthen run the command:
python node2vec_no_us.pyThis will generate all 25 embedding files in the node2vec folder (emb/no_US_experiments/)
- Silhouette score:
run the command:
python silhouette_analysis.pyThis will generate elbow and silhouette analysis and show some plots. We will be using the 6 clusters from the silhouette analysis for both homophily and structural equivalence case. The homophily setting (p = 4, q = 0.25), silhouette score was highest at 3 clusters, and then 6, but 3 clusters was merging too many countries into one cluster (and thus not as informative), so we used 6 clusters. Similarly, for structural equivalence case (p = 0.25, q = 4), the silhouette score was highest at 2 clusters, and then 6, but 2 clusters was too few clusters to be informative.
- K-means clustering based on 6 clusters from silhouette analysis:
run the command:
python k_means_clustering.pyThis will generate the csv files for k-means clustering based on 6 clusters from silhouette analysis. The csv files will be saved in the KMeans_Community_Results folder. You can then take these csv files and import them into Gephi to visualize the communities.
Note that the current settings only have 2 p and q values (0.25 and 4) and 1 cluster target (6 clusters). You can modify the P_VALUES, Q_VALUES, and CLUSTER_TARGETS variables in the k_means_clustering.py file to change the settings.