talkdai · avelino · Feb 29, 2024 · Feb 28, 2024 · Feb 28, 2024 · Feb 28, 2024
diff --git a/docs/settings.md b/docs/settings.md
@@ -1,12 +1,10 @@
 # Setting up the config
 
-## Settings
-
 To use this project, you need to have a `.csv` file with the knowledge base and a `.toml` file with your prompt configuration.
 
 We recommend that you create a folder inside this project called `data` and put CSVs and TOMLs files over there.
 
-### `.csv` knowledge base
+## `.csv` knowledge base
 
 **fields:**
 
@@ -54,11 +52,11 @@ salesy way; the loyalty program is our growth strategy."""
 prompt = """I'm sorry, I didn't understand your question. Could you rephrase it?"""
 ```
 
-### Environment Variables
+## Environment Variables
 
 Look at the [`.env.sample`](.env.sample) file to see the environment variables needed to run the project.
 
-#### LangSmith
+### LangSmith
 
 **Optionally:** if you wish to add observability to your llm application, you may want to use [Langsmith](https://docs.smith.langchain.com/) (so far, for personal use only) to help to debug, test, evaluate, and monitor your chains used in dialog. Follow the [setup instructions](https://docs.smith.langchain.com/setup) and add the env vars into the `.env` file:
 
@@ -68,3 +66,15 @@ LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"
 LANGCHAIN_API_KEY=<YOUR_LANGCHAIN_API_KEY>
 LANGCHAIN_PROJECT=<YOUR_LANGCHAIN_PROJECT>
 ```
+
+## Generate an embedding `load_csv.py`
+
+Embeddings create a vector representation of a question and answer pair from the knowledge base, enabling semantic search where we look for text passages that are most similar in the vector space.
+
+We have a CLI that generates embeddings by reading the knowledge base `csv`.
+By default, `load_csv.py` performs a **diff** between the existing vector database and the new questions and answers in the `csv`.
+
+The **CLI** has some parameters:
+
+*`--path`: path to the CSV (knowledge base)
+*`--cleandb`: deletes all previously imported vectors and reimports everything again.
diff --git a/poetry.lock b/poetry.lock
diff --git a/pyproject.toml b/pyproject.toml
@@ -26,6 +26,7 @@ alembic = "^1.12.1"
 langchain-community = "^0.0.20"
 importlib-metadata = "^7.0.1"
 langchain-openai = "^0.0.6"
+pyarrow = "^15.0.0"
 
 [tool.poetry.group.dev.dependencies]
 ipdb = "^0.13.13"

diff --git a/src/load_csv.py b/src/load_csv.py
@@ -9,7 +9,7 @@
 from dialog.models.db import session
 
 
-def load_csv_and_generate_embeddings(path):
+def load_csv_and_generate_embeddings(path, cleardb=False):
     df = pd.read_csv(path)
     necessary_cols = ["category", "subcategory", "question", "content"]
     for col in necessary_cols:
@@ -27,6 +27,10 @@ def load_csv_and_generate_embeddings(path):
         lambda row: hashlib.md5(row.encode()).hexdigest()
     )
 
+    if cleardb:
+        session.query(CompanyContent).delete()
+        session.commit()
+
     df_in_db = pd.read_sql(
         text(
             f"SELECT category, subcategory, question, content, dataset FROM {CompanyContent.__tablename__}"
@@ -62,6 +66,7 @@ def load_csv_and_generate_embeddings(path):
 if __name__ == "__main__":
     parser = argparse.ArgumentParser()
     parser.add_argument("--path", type=str, required=False, default="./know.csv")
+    parser.add_argument("--cleardb", action="store_true")
     args = parser.parse_args()
 
-    load_csv_and_generate_embeddings(args.path)
+    load_csv_and_generate_embeddings(args.path, args.cleardb)