- Linux based operating system, 16GB RAM for dataset generation.
- For model training a 16GB+ GPU is strongly recommended.
First, install the project with either pip:
pip install .or poetry:
poetry installExternal data sources should be downloaded from the links provided below or requested from the author.
The all data processing can be completed by executing the scripts in src/mscproject/jobs in the indicated order.
You may need to install the Apache Spark framework with bash scripts/spark-install.sh.
We recommend using a GPU with minimum 15GB of memory for training the GNN models.
- Create a virtual machine with GPU on Google Cloud Platform with
scripts/create-vm.sh - If data processing has been completed locally, upload to cloud storage with
scripts/upload-data.sh(remember to set the GCP_BUCKET_NAME environment variable). - SSH into the virtual machine and build the Docker image with
scripts/build-docker.sh - Download data to the virtual machine with
scripts/download-data.sh(don't forget to set the GCP_BUCKET_NAME environment variable on the VM). - Run
scripts/exec-docker-optuna.shto perform the neural architecture search. This script can be modifified to change the number of trials and other parameters. - Run
scripts/exec-docker-gnn-evaluation.shto produce test set predictions for the best GNN model architectures. - Run
scripts/upload-data.shto upload the test set predictions to cloud storage. - Run
scripts/download-data.shon your local machine to retrieve the test set predictions for further analysis.
The entire CatBoost training and tuning process can be completed by running notebooks/13-catboost-model.ipynb.
Model performance is evaluated on the test set predictions in notebooks/17-full-evaluation.ipynb.
Here is an overview of the key files within this project:
├── Dockerfile <---- Project container image
├── config
│ └── conf.yaml <---- Configuration file for data processing
├── figures
├── models/pyg <---- Trained PyG models
│ ├── regularised
│ │ ├── GCN.pt (kGNN)
│ │ └── GraphSAGE.pt
│ └── unregularised
│ ├── GCN.pt (kGNN)
│ └── GraphSAGE.pt
├── notebooks <---- Notebooks for training and evaluating models
│ ├── archive
│ ├── 13-catboost-model.ipynb
│ ├── 16-gnn-evaluation.ipynb
│ ├── 17-full-evaluation.ipynb
│ ├── 18-graph-visualisation.ipynb
│ ├── addresses.html
│ ├── companies.html
│ ├── persons.html
│ └── relationships.html
├── poetry.lock
├── pyproject.toml
├── reports
├── requirements.txt
├── **scripts** <---- Scripts for building the project on cloud
│ ├── create-vm.sh <---- Create VM on Google Cloud Platform with a GPU
│ ├── build-docker.sh <---- Build the project docker image
│ ├── download-data.sh <---- Download `data/` from cloud storage
│ ├── exec-docker-gnn-evaluation.sh <---- GNN evaluation process in Docker container
│ ├── exec-docker-optuna.sh <---- NN search and optimisation process in Docker container
│ ├── launch-gnn-evaluation.sh <---- Launch GNN evaluation process
│ ├── launch-optuna.sh <---- Configure and launch NN search and optimisation
│ ├── spark-install.sh <---- Install Spark (linux only)
│ └── upload-data.sh <---- Upload `data/` to cloud storage
├── src
│ └── mscproject
│ ├── dataprep.py <---- Initial data preparation
│ ├── datasets.py <---- PyG class for the Dataset
│ ├── experiment.py <---- NN architecture search and optimisation
│ ├── features.py <---- Graph feature generation
│ ├── graphprep.py <---- Graph building and processing
│ ├── metrics.py <---- Metrics for evaluating the models
│ ├── models.py <---- PyG model classes
│ ├── preprocess.py <---- Data preprocessing (normalisation, etc.)
│ ├── pygloaders.py <---- PyG data loaders
│ ├── simulate.py <---- Graph anomaly simulation
│ └── transforms.py <---- Misc data transforms
│ └── **jobs** <---- Scripts for running the data processing pipeline
│ ├── 01_process_raw_data.py
│ ├── 02_process_interim_data.py
│ ├── 03_make_connected_components.py
│ ├── 04_make_graph_data.py
│ ├── 05_make_anomalies.py
│ ├── 06_make_graph_features.py
│ ├── 07_preprocess_features.py
│ └── 08_make_pyg_data.py
Data used in this project can be found at these pages. Please email the study authors if you require a copy of the original snapshots used.
| Data | Provider | Date Retrieved |
|---|---|---|
| Company Data | Companies House | 2022-05-24 |
| PSC Data | Companies House | 2022-05-24 |
| Ownership Data | Open Ownership | 2022-05-24 |
| Offshore Leaks | ICIJ | 2022-06-02 |