You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[DOCS] Capital One Data Profiler README update (#5387)
* update to include profile stats descriptions
* add GE integration author and link
Co-authored-by: Austin Ziech Robinson <44794138+austiezr@users.noreply.github.com>
Copy file name to clipboardExpand all lines: contrib/capitalone_dataprofiler_expectations/README.md
+96-2Lines changed: 96 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -88,8 +88,8 @@ The format for a structured profile is below:
88
88
"min": [null, float, str],
89
89
"max": [null, float, str],
90
90
"mode": float,
91
-
"median", float,
92
-
"median_absolute_deviation", float,
91
+
"median": float,
92
+
"median_absolute_deviation": float,
93
93
"sum": float,
94
94
"mean": float,
95
95
"variance": float,
@@ -165,6 +165,100 @@ The format for an unstructured profile is below:
165
165
}
166
166
}
167
167
```
168
+
# Profile Statistic Descriptions
169
+
170
+
### Structured Profile
171
+
172
+
#### global_stats:
173
+
174
+
*`samples_used` - number of input data samples used to generate this profile
175
+
*`column_count` - the number of columns contained in the input dataset
176
+
*`row_count` - the number of rows contained in the input dataset
177
+
*`row_has_null_ratio` - the proportion of rows that contain at least one null value to the total number of rows
178
+
*`row_is_null_ratio` - the proportion of rows that are fully comprised of null values (null rows) to the total number of rows
179
+
*`unique_row_ratio` - the proportion of distinct rows in the input dataset to the total number of rows
180
+
*`duplicate_row_count` - the number of rows that occur more than once in the input dataset
181
+
*`file_type` - the format of the file containing the input dataset (ex: .csv)
182
+
*`encoding` - the encoding of the file containing the input dataset (ex: UTF-8)
183
+
*`correlation_matrix` - matrix of shape `column_count` x `column_count` containing the correlation coefficients between each column in the dataset
184
+
*`chi2_matrix` - matrix of shape `column_count` x `column_count` containing the chi-square statistics between each column in the dataset
185
+
*`profile_schema` - a description of the format of the input dataset labeling each column and its index in the dataset
186
+
*`string` - the label of the column in question and its index in the profile schema
187
+
*`times` - the duration of time it took to generate the global statistics for this dataset in milliseconds
188
+
189
+
#### data_stats:
190
+
191
+
*`column_name` - the label/title of this column in the input dataset
192
+
*`data_type` - the primitive python data type that is contained within this column
193
+
*`data_label` - the label/entity of the data in this column as determined by the Labeler component
194
+
*`categorical` - ‘true’ if this column contains categorical data
195
+
*`order` - the way in which the data in this column is ordered, if any, otherwise “random”
196
+
*`samples` - a small subset of data entries from this column
197
+
*`statistics` - statistical information on the column
198
+
*`sample_size` - number of input data samples used to generate this profile
199
+
*`null_count` - the number of null entries in the sample
200
+
*`null_types` - a list of the different null types present within this sample
201
+
*`null_types_index` - a dict containing each null type and a respective list of the indicies that it is present within this sample
202
+
*`data_type_representation` - the percentage of samples used identifying as each data_type
203
+
*`min` - minimum value in the sample
204
+
*`max` - maximum value in the sample
205
+
*`mode` - mode of the entries in the sample
206
+
*`median` - median of the entries in the sample
207
+
*`median_absolute_deviation` - the median absolute deviation of the entries in the sample
208
+
*`sum` - the total of all sampled values from the column
209
+
*`mean` - the average of all entries in the sample
210
+
*`variance` - the variance of all entries in the sample
211
+
*`stddev` - the standard deviation of all entries in the sample
212
+
*`skewness` - the statistical skewness of all entries in the sample
213
+
*`kurtosis` - the statistical kurtosis of all entries in the sample
214
+
*`num_zeros` - the number of entries in this sample that have the value 0
215
+
*`num_negatives` - the number of entries in this sample that have a value less than 0
216
+
*`histogram` - contains histogram relevant information
217
+
*`bin_counts` - the number of entries within each bin
218
+
*`bin_edges` - the thresholds of each bin
219
+
*`quantiles` - the value at each percentile in the order they are listed based on the entries in the sample
220
+
*`vocab` - a list of the characters used within the entries in this sample
221
+
*`avg_predictions` - average of the data label prediction confidences across all data points sampled
222
+
*`categories` - a list of each distinct category within the sample if `categorial` = 'true'
223
+
*`unique_count` - the number of distinct entries in the sample
224
+
*`unique_ratio` - the proportion of the number of distinct entries in the sample to the total number of entries in the sample
225
+
*`categorical_count` - number of entries sampled for each category if `categorical` = 'true'
226
+
*`gini_impurity` - measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset
227
+
*`unalikeability` - a value denoting how frequently entries differ from one another within the sample
228
+
*`precision` - a dict of statistics with respect to the number of digits in a number for each sample
229
+
*`times` - the duration of time it took to generate this sample's statistics in milliseconds
230
+
*`format` - list of possible datetime formats
231
+
232
+
### Unstructured Profile
233
+
234
+
#### global_stats:
235
+
236
+
*`samples_used` - number of input data samples used to generate this profile
237
+
*`empty_line_count` - the number of empty lines in the input data
238
+
*`file_type` - the file type of the input data (ex: .txt)
239
+
*`encoding` - file encoding of the input data file (ex: UTF-8)
240
+
*`memory_size` - size of the input data in MB
241
+
*`times` - duration of time it took to generate this profile in milliseconds
242
+
243
+
#### data_stats:
244
+
245
+
*`data_label` - labels and statistics on the labels of the input data
246
+
*`entity_counts` - the number of times a specific label or entity appears inside the input data
247
+
* `word_level` - the number of words counted within each label or entity
248
+
* `true_char_level` - the number of characters counted within each label or entity as determined by the model
249
+
* `postprocess_char_level` - the number of characters counted within each label or entity as determined by the postprocessor
250
+
*`entity_percentages` - the percentages of each label or entity within the input data
251
+
*`word_level` - the percentage of words in the input data that are contained within each label or entity
252
+
*`true_char_level` - the percentage of characters in the input data that are contained within each label or entity as determined by the model
253
+
*`postprocess_char_level` - the percentage of characters in the input data that are contained within each label or entity as determined by the postprocessor
254
+
*`times` - the duration of time it took for the data labeler to predict on the data
255
+
*`statistics` - statistics of the input data
256
+
*`vocab` - a list of each character in the input data
257
+
*`vocab_count` - the number of occurrences of each distinct character in the input data
258
+
*`words` - a list of each word in the input data
259
+
*`word_count` - the number of occurrences of each distinct word in the input data
260
+
*`times` - the duration of time it took to generate the vocab and words statistics in milliseconds
0 commit comments