Skip to content
55 changes: 37 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,32 @@
# HackingBuddyGPT

How can LLMs aid or even emulate hackers? Threat actors are already using LLMs, so we need to create testbeds and ground truth for whitehats to learn and prepare. Currently we are using linux privilege escalation attacks as test use-case, but we are branching out into web-security and Active Directory testing too.
How can LLMs aid or even emulate hackers? Threat actors are [already using LLMs](https://arxiv.org/abs/2307.00691),
creating the danger that defenders will not be prepared for this new threat.

How are we doing this? We are providng testbeds as well as tools. The initial tool `wintermute` targets linux priv-esc attacks. It uses SSH to connect to a (presumably) vulnerable virtual machine and then asks OpenAI GPT to suggest linux commands that could be used for finding security vulnerabilities or privilege escalation. The provided command is then executed within the virtual machine, the output fed back to the LLM and, finally, a new command is requested from it..
We aim to become **THE** framework for testing LLM-based agents for security testing.
To create common ground truth, we strive to create common security testbeds and
benchmarks, evaluate multiple LLMs and techniques against those, and publish our
prototypes and findings as open-source/open-access reports.

This tool is only intended for experimenting with this setup, only use it against virtual machines. Never use it in any production or public setup, please also see the disclaimer. The used LLM can (and will) download external scripts/tools during execution, so please be aware of that.
We strive to make our code-base as accessible as possible to allow for easy experimentation.
Our experiments are structured into `use-cases`, e.g., privilege escalation attacks. A researcher
wanting to create a new experiment would just create a new use-case that mostly consists
of the control loop and corresponding prompt templates. We provide multiple helper and base
classes, so that a new experiment can be implemented in a few dozens lines of code as
connecting to the LLM, logging, etc. is taken care of by our framework. For further information (esp. if you want to contribute use-cases), please take a look at [docs/use_case.md](docs/use_case.md).


Our initial forays were focused upon evaluating the efficiency of LLMs for [linux
privilege escalation attacks](https://arxiv.org/abs/2310.11409) and we are currently breaching out into evaluation
the use of LLMs for web penetration-testing and web api testing.

We release all tooling, testbeds and findings as open-source as this is the only way that comprehensive information will find their way to defenders. APTs have access to more sophisticated resources, so we are only leveling the playing field for blue teams. For information about the implementation, please see our [implementation notes](docs/implementation_notes.md). All source code can be found on [github](https://github.com/ipa-lab/hackingbuddyGPT).

## Current features:
## Privilege Escalation Attacks

How are we doing this? The initial tool `wintermute` targets linux priv-esc attacks. It uses SSH to connect to a (presumably) vulnerable virtual machine and then asks OpenAI GPT to suggest linux commands that could be used for finding security vulnerabilities or privilege escalation. The provided command is then executed within the virtual machine, the output fed back to the LLM and, finally, a new command is requested from it..

### Current features (wintermute):

- connects over SSH (linux targets) or SMB/PSExec (windows targets)
- supports OpenAI REST-API compatible models (gpt-3.5-turbo, gpt4, gpt-3.5-turbo-16k, etc.)
Expand All @@ -18,6 +36,21 @@ We release all tooling, testbeds and findings as open-source as this is the only
- automatic root detection
- can limit rounds (how often the LLM will be asked for a new command)

### Example run

This is a simple example run of `wintermute.py` using GPT-4 against a vulnerable VM. More example runs can be seen in [our collection of historic runs](docs/old_runs/old_runs.md).

![Example wintermute run](docs/example_run_gpt4.png)

Some things to note:

- initially the current configuration is output. Yay, so many colors!
- "Got command from LLM" shows the generated command while the panel afterwards has the given command as title and the command's output as content.
- the table contains all executed commands. ThinkTime denotes the time that was needed to generate the command (Tokens show the token count for the prompt and its response). StateUpdTime shows the time that was needed to generate a new state (the next column also gives the token count)
- "What does the LLM know about the system?" gives an LLM generated list of system facts. To generate it, it is given the latest executed command (and it's output) as well as the current list of system facts. This is the operation which time/token usage is shown in the overview table as StateUpdTime/StateUpdTokens. As the state update takes forever, this is disabled by default and has to be enabled through a command line switch.
- Then the next round starts. The next given command (`sudo tar`) will lead to a pwn'd system BTW.


## Academic Research/Expsoure

hackingBuddyGPT is described in [Getting pwn'd by AI: Penetration Testing with Large Language Models ](https://arxiv.org/abs/2308.00121):
Expand Down Expand Up @@ -63,20 +96,6 @@ This work is partially based upon our empiric research into [how hackers work](h
}
~~~

## Example run

This is a simple example run of `wintermute.py` using GPT-4 against a vulnerable VM. More example runs can be seen in [our collection of historic runs](docs/old_runs/old_runs.md).

![Example wintermute run](docs/example_run_gpt4.png)

Some things to note:

- initially the current configuration is output. Yay, so many colors!
- "Got command from LLM" shows the generated command while the panel afterwards has the given command as title and the command's output as content.
- the table contains all executed commands. ThinkTime denotes the time that was needed to generate the command (Tokens show the token count for the prompt and its response). StateUpdTime shows the time that was needed to generate a new state (the next column also gives the token count)
- "What does the LLM know about the system?" gives an LLM generated list of system facts. To generate it, it is given the latest executed command (and it's output) as well as the current list of system facts. This is the operation which time/token usage is shown in the overview table as StateUpdTime/StateUpdTokens. As the state update takes forever, this is disabled by default and has to be enabled through a command line switch.
- Then the next round starts. The next given command (`sudo tar`) will lead to a pwn'd system BTW.

## Setup and Usage

We try to keep our python dependencies as light as possible. This should allow for easier experimentation. To run the main priv-escalation program (which is called `wintermute`) together with an OpenAI-based model you need:
Expand Down
88 changes: 88 additions & 0 deletions docs/configurable.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Configurable

Marking a class as `@configurable` allows for the class to be configured via command line arguments or environment variables.

This is done by analyzing the parameters to the class' `__init__` method and its `__dataclass_fields__` attribute if it is a `@dataclass`.
As having a `@configurable` also be a `@dataclass` makes it easier to extend it, it is usually recommended to define a configurable as a `@dataclass`.
Furthermore, using a dataclass allows more natural use of the `parameter()` definition.

All [use-cases](use_case.md) are automatically configurable.

## Parameter Definition

Parameters can either be defined using type hints and default values, or by using the `parameter()` method.

```python
from dataclasses import dataclass
from utils.configurable import configurable, parameter


@configurable("inner-example", "Inner Example Configurable for documentation")
@dataclass
class InnerConfigurableExample:
text_value: str


@configurable("example", "Example Configurable for documentation")
@dataclass
class ConfigurableExample:
inner_configurable: InnerConfigurableExample
text_value: str
number_value_with_description: int = parameter(desc="This is a number value", default=42)
number_value_without_description: int = 43
```

As can be seen, the `parameter` method allows additionally setting a description for the parameter, while returning a `dataclasses.Field` to allow interoperability with existing tools.

The type of a configurable parameter may only be a primitive type (`int`, `str`, `bool`) or another configurable.

## Usage

When a class is marked as `@configurable`, it can be configured via command line arguments or environment variables.
The name of the parameter is automatically built from the field name (in the case of the example to be `text_value`, `number_with_description` and `number_value_without_description`).

If a configurable has other configurable fields as parameters, they can be recursively configured, the name of the parameter is built from the field name and the field name of the inner configurable (here `inner_configurable.text_value`).

These parameters are looked up in the following order:

1. Command line arguments
2. Environment variables (with `.` being replaced with `_`)
3. .env file
4. Default values

When you have a simple use case as follows:

```python
from dataclasses import dataclass
from usecases import use_case, UseCase

@use_case("example", "Example Use Case")
@dataclass
class ExampleUseCase(UseCase):
conf: ConfigurableExample

def run(self):
print(self.conf)
```

You can configure the `ConfigurableExample` class as follows:

```bash
echo "conf.text_value = 'Hello World'" > .env
export CONF_NUMBER_VALUE_WITH_DESCRIPTION=120
export CONF_INNER_CONFIGURABLE_TEXT_VALUE="Inner Hello World"

python3 wintermute.py example --conf.inner_configurable.text_value "Inner Hello World Overwrite"
```

This results in

```
ConfigurableExample(
inner_configurable=InnerConfigurableExample(text_value='Inner Hello World Overwrite'),
text_value='Hello World',
number_value_with_description=120,
number_value_without_description=43
)
```

28 changes: 28 additions & 0 deletions docs/use_case.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Use Cases

Wintermute consists of different use-cases (classes extending `UseCase`, being annotated with `@use_case` and being imported somewhere from the main `wintermute.py` file), which can be run individually.

The `@use_case` annotation takes a name and description as arguments, which are then used for the sub-commands in the command line interface.

When building a use-case, the `run` method must be implemented, which is called after calling the (optional) `init` method (note that this is not the `__init__` method).
The `run` method should contain the main logic of the use-case, though it is recommended to split the logic into smaller methods that are called from `run` for better readability (see the code for [`RoundBasedUseCase`](#round-based-use-case) for an example).

A use-case is automatically a `configurable`, which means, that all parameters of its `__init__` function (or fields for dataclasses) can be set via command line / environment parameters. For more information read the [configurable](configurable.md) documentation.
It is recommended to define a use case to be a `@dataclass`, so that all parameters are directly visible, and the use-case can be easily extended.

## General Use Cases

Usually a use case follows the pattern, that it has a connections to the log database, a LLM and a system with which it is interacting.

The LLM should be defined as closely as necessary for the use case, as prompt templates are dependent on the LLM in use.
If you don't yet want to specify eg. `GPT4Turbo`, you can use `llm: OpenAIConnection`, and dynamically specify the LLM to be used in the parameters `llm.model` and `llm.context_size`.

In addition to that, arbitrary parameters and flags can be defined, with which to control the use-case. For consistency reasons please take a look if similar parameters are used in other use cases, and try to have them act accordingly.

When interacting with a LLM, the prompt and output should always be logged `add_log_query`, `add_log_analyze_response`, `add_log_update_state` or alike, to record all interactions.

## Round Based Use Case

The `RoundBasedUseCase` is an abstract base class for use-cases that are based on rounds where the LLM is called with a certain input and the result is evaluated using different capabilities.

An implementation needs to implement the `perform_round` method, which is called for each round. It can also optionally implement the `setup` and `teardown` methods, which are called before and after the rounds, respectively.
Loading