Skip to content
20 changes: 15 additions & 5 deletions docs/settings.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,10 @@
# Setting up the config

## Settings

To use this project, you need to have a `.csv` file with the knowledge base and a `.toml` file with your prompt configuration.

We recommend that you create a folder inside this project called `data` and put CSVs and TOMLs files over there.

### `.csv` knowledge base
## `.csv` knowledge base

**fields:**

Expand Down Expand Up @@ -54,11 +52,11 @@ salesy way; the loyalty program is our growth strategy."""
prompt = """I'm sorry, I didn't understand your question. Could you rephrase it?"""
```

### Environment Variables
## Environment Variables

Look at the [`.env.sample`](.env.sample) file to see the environment variables needed to run the project.

#### LangSmith
### LangSmith

**Optionally:** if you wish to add observability to your llm application, you may want to use [Langsmith](https://docs.smith.langchain.com/) (so far, for personal use only) to help to debug, test, evaluate, and monitor your chains used in dialog. Follow the [setup instructions](https://docs.smith.langchain.com/setup) and add the env vars into the `.env` file:

Expand All @@ -68,3 +66,15 @@ LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"
LANGCHAIN_API_KEY=<YOUR_LANGCHAIN_API_KEY>
LANGCHAIN_PROJECT=<YOUR_LANGCHAIN_PROJECT>
```

## Generate an embedding `load_csv.py`

Embeddings create a vector representation of a question and answer pair from the knowledge base, enabling semantic search where we look for text passages that are most similar in the vector space.

We have a CLI that generates embeddings by reading the knowledge base `csv`.
By default, `load_csv.py` performs a **diff** between the existing vector database and the new questions and answers in the `csv`.

The **CLI** has some parameters:

*`--path`: path to the CSV (knowledge base)
*`--cleandb`: deletes all previously imported vectors and reimports everything again.
52 changes: 50 additions & 2 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ alembic = "^1.12.1"
langchain-community = "^0.0.20"
importlib-metadata = "^7.0.1"
langchain-openai = "^0.0.6"
pyarrow = "^15.0.0"

[tool.poetry.group.dev.dependencies]
ipdb = "^0.13.13"
Expand Down
9 changes: 7 additions & 2 deletions src/load_csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
from dialog.models.db import session


def load_csv_and_generate_embeddings(path):
def load_csv_and_generate_embeddings(path, cleardb=False):
df = pd.read_csv(path)
necessary_cols = ["category", "subcategory", "question", "content"]
for col in necessary_cols:
Expand All @@ -27,6 +27,10 @@ def load_csv_and_generate_embeddings(path):
lambda row: hashlib.md5(row.encode()).hexdigest()
)

if cleardb:
session.query(CompanyContent).delete()
session.commit()

df_in_db = pd.read_sql(
text(
f"SELECT category, subcategory, question, content, dataset FROM {CompanyContent.__tablename__}"
Expand Down Expand Up @@ -62,6 +66,7 @@ def load_csv_and_generate_embeddings(path):
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--path", type=str, required=False, default="./know.csv")
parser.add_argument("--cleardb", action="store_true")
args = parser.parse_args()

load_csv_and_generate_embeddings(args.path)
load_csv_and_generate_embeddings(args.path, args.cleardb)