Skip to content

Commit 27f4a96

Browse files
[DOCS] Capital One Data Profiler README update (#5387)
* update to include profile stats descriptions * add GE integration author and link Co-authored-by: Austin Ziech Robinson <44794138+austiezr@users.noreply.github.com>
1 parent 34aac03 commit 27f4a96

File tree

1 file changed

+96
-2
lines changed
  • contrib/capitalone_dataprofiler_expectations

1 file changed

+96
-2
lines changed

‎contrib/capitalone_dataprofiler_expectations/README.md‎

Lines changed: 96 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -88,8 +88,8 @@ The format for a structured profile is below:
8888
"min": [null, float, str],
8989
"max": [null, float, str],
9090
"mode": float,
91-
"median", float,
92-
"median_absolute_deviation", float,
91+
"median": float,
92+
"median_absolute_deviation": float,
9393
"sum": float,
9494
"mean": float,
9595
"variance": float,
@@ -165,6 +165,100 @@ The format for an unstructured profile is below:
165165
}
166166
}
167167
```
168+
# Profile Statistic Descriptions
169+
170+
### Structured Profile
171+
172+
#### global_stats:
173+
174+
* `samples_used` - number of input data samples used to generate this profile
175+
* `column_count` - the number of columns contained in the input dataset
176+
* `row_count` - the number of rows contained in the input dataset
177+
* `row_has_null_ratio` - the proportion of rows that contain at least one null value to the total number of rows
178+
* `row_is_null_ratio` - the proportion of rows that are fully comprised of null values (null rows) to the total number of rows
179+
* `unique_row_ratio` - the proportion of distinct rows in the input dataset to the total number of rows
180+
* `duplicate_row_count` - the number of rows that occur more than once in the input dataset
181+
* `file_type` - the format of the file containing the input dataset (ex: .csv)
182+
* `encoding` - the encoding of the file containing the input dataset (ex: UTF-8)
183+
* `correlation_matrix` - matrix of shape `column_count` x `column_count` containing the correlation coefficients between each column in the dataset
184+
* `chi2_matrix` - matrix of shape `column_count` x `column_count` containing the chi-square statistics between each column in the dataset
185+
* `profile_schema` - a description of the format of the input dataset labeling each column and its index in the dataset
186+
* `string` - the label of the column in question and its index in the profile schema
187+
* `times` - the duration of time it took to generate the global statistics for this dataset in milliseconds
188+
189+
#### data_stats:
190+
191+
* `column_name` - the label/title of this column in the input dataset
192+
* `data_type` - the primitive python data type that is contained within this column
193+
* `data_label` - the label/entity of the data in this column as determined by the Labeler component
194+
* `categorical` - ‘true’ if this column contains categorical data
195+
* `order` - the way in which the data in this column is ordered, if any, otherwise “random”
196+
* `samples` - a small subset of data entries from this column
197+
* `statistics` - statistical information on the column
198+
* `sample_size` - number of input data samples used to generate this profile
199+
* `null_count` - the number of null entries in the sample
200+
* `null_types` - a list of the different null types present within this sample
201+
* `null_types_index` - a dict containing each null type and a respective list of the indicies that it is present within this sample
202+
* `data_type_representation` - the percentage of samples used identifying as each data_type
203+
* `min` - minimum value in the sample
204+
* `max` - maximum value in the sample
205+
* `mode` - mode of the entries in the sample
206+
* `median` - median of the entries in the sample
207+
* `median_absolute_deviation` - the median absolute deviation of the entries in the sample
208+
* `sum` - the total of all sampled values from the column
209+
* `mean` - the average of all entries in the sample
210+
* `variance` - the variance of all entries in the sample
211+
* `stddev` - the standard deviation of all entries in the sample
212+
* `skewness` - the statistical skewness of all entries in the sample
213+
* `kurtosis` - the statistical kurtosis of all entries in the sample
214+
* `num_zeros` - the number of entries in this sample that have the value 0
215+
* `num_negatives` - the number of entries in this sample that have a value less than 0
216+
* `histogram` - contains histogram relevant information
217+
* `bin_counts` - the number of entries within each bin
218+
* `bin_edges` - the thresholds of each bin
219+
* `quantiles` - the value at each percentile in the order they are listed based on the entries in the sample
220+
* `vocab` - a list of the characters used within the entries in this sample
221+
* `avg_predictions` - average of the data label prediction confidences across all data points sampled
222+
* `categories` - a list of each distinct category within the sample if `categorial` = 'true'
223+
* `unique_count` - the number of distinct entries in the sample
224+
* `unique_ratio` - the proportion of the number of distinct entries in the sample to the total number of entries in the sample
225+
* `categorical_count` - number of entries sampled for each category if `categorical` = 'true'
226+
* `gini_impurity` - measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset
227+
* `unalikeability` - a value denoting how frequently entries differ from one another within the sample
228+
* `precision` - a dict of statistics with respect to the number of digits in a number for each sample
229+
* `times` - the duration of time it took to generate this sample's statistics in milliseconds
230+
* `format` - list of possible datetime formats
231+
232+
### Unstructured Profile
233+
234+
#### global_stats:
235+
236+
* `samples_used` - number of input data samples used to generate this profile
237+
* `empty_line_count` - the number of empty lines in the input data
238+
* `file_type` - the file type of the input data (ex: .txt)
239+
* `encoding` - file encoding of the input data file (ex: UTF-8)
240+
* `memory_size` - size of the input data in MB
241+
* `times` - duration of time it took to generate this profile in milliseconds
242+
243+
#### data_stats:
244+
245+
* `data_label` - labels and statistics on the labels of the input data
246+
* `entity_counts` - the number of times a specific label or entity appears inside the input data
247+
* `word_level` - the number of words counted within each label or entity
248+
* `true_char_level` - the number of characters counted within each label or entity as determined by the model
249+
* `postprocess_char_level` - the number of characters counted within each label or entity as determined by the postprocessor
250+
* `entity_percentages` - the percentages of each label or entity within the input data
251+
* `word_level` - the percentage of words in the input data that are contained within each label or entity
252+
* `true_char_level` - the percentage of characters in the input data that are contained within each label or entity as determined by the model
253+
* `postprocess_char_level` - the percentage of characters in the input data that are contained within each label or entity as determined by the postprocessor
254+
* `times` - the duration of time it took for the data labeler to predict on the data
255+
* `statistics` - statistics of the input data
256+
* `vocab` - a list of each character in the input data
257+
* `vocab_count` - the number of occurrences of each distinct character in the input data
258+
* `words` - a list of each word in the input data
259+
* `word_count` - the number of occurrences of each distinct word in the input data
260+
* `times` - the duration of time it took to generate the vocab and words statistics in milliseconds
261+
168262
# Support
169263

170264
### Supported Data Formats

0 commit comments

Comments
 (0)