From the course: Learning the R Tidyverse
What is tidy data? - R Tutorial
From the course: Learning the R Tidyverse
What is tidy data?
- [Instructor] So what is tidy data, and why is the tidyverse named after it? Understanding this is crucial to learning how to use the tidyverse for problem-solving. Tidy data has a very specific definition. Each column represents a single variable, and each row represents a single observation. Whereas untidy data has no sense to how it's organized. This is a screenshot of an Excel file I made of data from the United Nations. Does this data feel tidy to you? It's not for two reasons. There is a unique column for each month, and there's nothing in the data to tell us what the values represent. This could be any variable. It could be GDP, millions of internet users, or even hectares of agricultural land per country. This is the same data tidied up. Each column represents a unique variable, country, year, month, and we now know that the values represent births per month. Now, how would we add deaths to this dataset? That's an interesting question. When we talk about tidy data in the tidyverse, we also talk about data being wide or long. It's not a simple one-to-one relationship between untidy and wide. If I described a dataset as being wide, I'd often be talking about a dataset that is in many ways quite tidy, but there are a few columns that could be made tidier. In our example, I've added a new column called deaths, which has added some width to the dataset. The theoretical tidy data representation of this dataset would collapse the birth and death columns into a new column to record which UN measure each value represented. Admittedly, when we're working with just two variables, this doesn't feel like a big difference. But if I add 10 variables, it would be very clear that the first approach creates wide data, and the second creates long data. The tidyverse is fully equipped to transform your data between wide and long formats through the pivoting functions. So that's tidy data. Why is it useful? This beautiful cartoon from Allison Horst explains it. Tidy datasets all look alike and are fairly literate, whereas messy or wide datasets are all messy or wide in different unique ways. In our example, we didn't know our data was about births in the initial wide representation. Tidying data forces us to sensibly name our variables and have them structured, ready for wrangling, visualizing, or modeling with the tidyverse.
Contents
-
-
-
-
What is tidy data?3m 2s
-
(Locked)
Why does ggplot2 want tidy data?4m 22s
-
(Locked)
Using pivot_longer() to tidy data into a long format4m 26s
-
(Locked)
Cleaning column names with the janitor package3m 21s
-
(Locked)
Tidying columns containing multiple values with separate_*()4m 44s
-
(Locked)
List columns and nested tibbles5m 3s
-
-
-
-
-
-