CSV Validator is used to find and address data issues. It specifically handles negative numbers, missing data, and duplicate values.
Some data issues are always handled the same way while others can be configured to handle the data issue according to configurable preferences.
You will need Node.js and NPM installed on your local machine.
To install the project follow these steps:
- Clone or download this repo to your local machine using
https://github.com/James9446/csv_validator.git
. - Navigate to the project directory and install the dependencies using:
npm install
This will install the following dependencies:
- "@inquirer/prompts"^3.0.4
- "csv-parser"^3.0.0
- "fast-csv"^4.3.6
You can start the application using:
npm start
You can test the application using the provided dummy data CSV files located in the raw
folder. To validate a new CSV add it to the raw
folder, which is located in the data
folder.
CSV Validator expects and necessitates the presence of three specific columns in the provided CSV files:
email
customerId
points
These three columns MUST be formatted exactly as displayed above. These columns are crucial for the CSV Valdiator to properly perform all its functions.
Your CSV file can contain additional columns. Any extra columns, beyond the aforementioned three, will be included in the newly generated CSV files during the data handling processes. Regardless of the actions selected (like handling missing data, duplicates, or negative numbers), any extra data not explicitly altered remains consistent in the new CSV files.
Action | If selected folder = raw |
If selected folder = modified |
Additional action |
---|---|---|---|
Run Tests | No data modifications | No data modifications | None |
Handle Missing Data | Creates a new modified CSV with the same file name and adds it to the modified folder |
Modifies and replaces the selected file | Creates an additional CSV with all of the rows that were removed due to missing data. It includes a column that describes what data was missing. The file name will be based on the selcted file (e.g. missing_data_[selected file name].csv ). This file will be added to the modified folder. If a file with this exact same file name already exists in the modified folder then it will be replaced. |
Handle Duplicates | Creates a new modified CSV with the same file name and adds it to the modified folder |
Modifies and replaces the selected file | Creates an additional CSV with all of the rows that were removed due to duplicate data. It includes a column that describes what data was duplicated. The file name will be based on the selcted file duplicates_[selected file name].csv . If a file with this exact same file name already exists in the modified folder then it will be replaced. |
Handle Negatives | Creates a new modified CSV with the same file name and adds it to the modified folder |
Modifies and replaces the selected file | None |
The Run Tests
action will use all 3 of the other actions to run tests on the selected file. The results will be logged to the console. For each of these 3 tests it will specify whether the selected file passed or failed. If a test fails then the number of rows that have a data issue will also be logged to the console.
When using the Handle Missing Data Action, you will select a configuration option. This option only comes into play if both the customerId
and the email
are not missing.
Skip the customer record and capture the details
Set the value to 0
Data Issue | Configuration Option | Record Included in New CSV | Record Included in Missing Data CSV |
---|---|---|---|
Missing customerId |
Either configuration option selected | No | Yes |
Missing email |
Either configuration option selected | No | Yes |
Missing points |
Skip the customer record and capture the details | No | Yes |
Missing points |
Set the value to 0 | Yes | No |
When using the Handle Duplicates Action, you will select a configuration option. This option only comes into play if the duplicate customer record has the exact same email
and customerId
values but does not have the same points
value.
Configuration Options: What should be done if a customer record is duplicated and has different point values?
Assign whichever point value is lower
Assign whichever point value is higher
Set the value to 0 and capture the details
Data Issue | Configuration Option | Record Included in New CSV | Record Included in Duplicates CSV |
---|---|---|---|
Exact duplicate; the email , customerId , and points values are all the same |
Any configuration option selected | No | No |
Duplicate customerId ; the customerId is associated with more than one email |
Any configuration option selected | No | Yes |
Duplicate email ; the email is associated with more than one customerId |
Any configuration option selected | No | Yes |
Duplicate points ; email and customerId values match but the points value does not |
Assign whichever point value is lower | Yes | No |
Duplicate points ; email and customerId values match but the points value does not |
Assign whichever point value is higher | Yes | No |
Duplicate points ; email and customerId values match but the points value does not |
Set the value to 0 and capture the details | Yes | Yes |
All customer records with negative points
values are set to 0 and added to the new CSV. There is no additional CSV for capturing customer records that had a negative value.
- Run the application using
npm start
. - Pick an action you’d like to execute:
Run Tests
Handle Missing Data
Handle Duplicates
Handle Negatives
- Select the folder containing the file you wish to take an action on.
- Select a CSV file from the chosen folder.
- Based on the action chosen, you may need to follow the prompts to configure your action.
Note: It is generaly best to run actions in the order they are listed in the prompt. It is best to handle missing data before handling duplicate data or handling negatives. It is best to handle duplicates before handling negatives. Run Tests
>>> Handle Missing Data
>>> Handle Duplicates
>>> Handle Negatives
>>> Run Tests
- Add a CSV file to the
raw
folder (or try using one of the test CSV files provided). - Run the application using
npm start
. - Select the
Run Tests
action. - Select the
raw
folder. - Select a CSV file.
- See which tests fail and handle the data issues accordingly. (Let's imagine all test's fail)
- Run the application again using
npm start
. - Select the
Handle Missing Data
action. - Select the
raw
folder. - Select the same CSV file from step 5.
- Select a Configuration Option (e.g.
Skip the customer record and capture the details
) - A new CSV (with the same file name) will be added to the
modified
folder. - ALL ADDITIONAL ACTIONS should be run on the new file located in the
modified
folder. - If you perform the
Run Tests
action on the new file in themodified
folder it should now pass the Missing Data Test. - The
Handle Duplicates
and theHandle Negatives
should now be run on the new CSV in themodified
folder. Doing so will update this CSV. If you run these actions on the original CSV located in theraw
folder then the CSV in themodified
folder will be replaced instead of updated and the modifications from the theHandle Missing Data
action will be lost. - Complete all actions until all tests pass!
npm start
What action would you like to take? (Use arrow keys)
❯ Run Tests
Handle Missing Data
Handle Duplicates
Handle Negatives
Select a Folder
modified
❯ raw
Select a File (Use arrow keys)
❯ dummy_data_0.csv
dummy_data_1.csv
dummy_data_2.csv
dummy_data_3.csv
dummy_data_million_records.csv
test_bad_headers.csv
Missing Data Test: FAILED
• Count of Missing: 7
Duplicates Test: FAILED
• Count of Duplicates: 6
• Count of Exact Duplicates: 1 (Note: Exact Duplicates will not be added to duplicates_dummy_data_0.csv)
Negatives Test: FAILED
• Negatives total: 2
npm start
What action would you like to take?
Run Tests
❯ Handle Missing Data
Handle Duplicates
Handle Negatives
Select a Folder
modified
❯ raw
Select a File (Use arrow keys)
❯ dummy_data_0.csv
dummy_data_1.csv
dummy_data_2.csv
dummy_data_3.csv
dummy_data_million_records.csv
test_bad_headers.csv
What should be done if a customer record is missing the points value? (Use arrow keys)
❯ Skip the customer record and capture the details
Set the value to 0
Created file: dummy_data_0.csv. It can be found at ./data/modified/
Created file: missing_data_dummy_data_0.csv. It can be found at ./data/modified/
npm start
What action would you like to take?
Run Tests
Handle Missing Data
❯ Handle Duplicates
Handle Negatives
Select a Folder (Use arrow keys)
❯ modified
raw
Select a File (Use arrow keys)
❯ dummy_data_0.csv
missing_data_dummy_data_0.csv
What should be done if a customer record is duplicated and has different point values? (Use arrow keys)
❯ Assign whichever point value is lower
Assign whichever point value is higher
Set the value to 0 and capture the details
Modified file dummy_data_0.csv. It can be found at ./data/modified/
Created file duplicates_dummy_data_0.csv. It can be found at ./data/modified/
npm start
What action would you like to take?
Run Tests
Handle Missing Data
Handle Duplicates
❯ Handle Negatives
Select a Folder (Use arrow keys)
❯ modified
raw
Select a File (Use arrow keys)
❯ dummy_data_0.csv
duplicates_dummy_data_0.csv
missing_data_dummy_data_0.csv
Modified file dummy_data_0.csv. It can be found at ./data/modified/
npm start
What action would you like to take? (Use arrow keys)
❯ Run Tests
Handle Missing Data
Handle Duplicates
Handle Negatives
Select a Folder (Use arrow keys)
❯ modified
raw
Select a File (Use arrow keys)
❯ dummy_data_0.csv
duplicates_dummy_data_0.csv
missing_data_dummy_data_0.csv
Missing Data Test: PASSED
Duplicates Test: PASSED
Negatives Test: PASSED
This package is licensed under the ISC license.
This project was developed by James Reynolds.