From the course: Complete Guide to Generative AI for Data Analysis and Data Science

Supervised and unsupervised learning

- [Instructor] Data analysis and data science is highly influenced by machine learning. And oftentimes, the boundaries between being a machine learning engineer and being a data scientist or a data analyst are sometimes blurred. So in this course, we're going to focus on supervised and unsupervised learning. Now, before we delve into the details of those two sort of sub areas of machine learning, I want to talk about the broader domains of AI. And in particular, I want to distinguish between generative AI and discriminative AI. Now, generative AI we've been using throughout this course to create things like scripts that help us do data analysis or prepare data or generate visualizations. Well, generative AI is focused on creating new content similar to content that's used in training. So we might have models that are trained on images or trained on documents, and we use those to generate new images or new text documents. Now, generative AI applications are often built on foundation models, and these models are quite large and expensive to train, but they can be used for many different applications. So that's where the term foundation models comes from. The other broad domain with regards to AI and machine learning is discriminative AI or discriminative machine learning, where we focus on learning boundaries between classes and making predictions about numerical attributes. Now, discriminative AI compared to generative AI is computationally efficient. Discriminative AI and machine learning includes supervised learning. Now, within supervised learning models are trained with labeled data. So if we have, for example, images in addition to the image, we have a label with that image, like this is a dog and this is a cat. So the label is that indicator about what we want to classify this particular data point as. Now, supervised learning is also used to make predictions based on training data. This is particularly the case when we have say, numeric values and we want to predict a numeric value based on some set of attributes. The quality of our supervised learning models depends largely on the quality of our training data. So we spend a lot of time picking the right data sets that meet our problem framework and also ensuring that there are not data quality issues with our training data. Now the other area within discriminative AI that we're going to talk about is unsupervised learning. And in this case, we train models without labels. So we might have images, but we don't necessarily label them dog or cat. Instead, we're more focused on, here's a group of images, break them down into subgroups and identify natural groupings of these images might be an example of that. So we use unsupervised learning to explore data sets and find these subgroups. Now, one of the challenges with unsupervised learning is that once you have these subgroups, it can be challenging to measure their quality and try and figure out, well, you know, how good is this? And it often depends on expert knowledge of the domain to be able to assess whether or not these groupings make sense. Or if these groupings don't quite make sense, and you might have to like adjust some parameters and try another clustering approach. Now let's look at some examples of supervised learning. Classifying documents is a great example. So if you have say, different types of news stories and you want to identify them as domestic politics or foreign relations or economics or local news, something like that. Well, document classification is an example of supervised learning. identifying objects and images. Medical diagnosis is an important area. So here, for example, we might be looking at either a set of symptoms and trying to understand what underlying disease might lead to that set of symptoms, or we might be analyzing an image and trying to understand, is there something in the image that indicates a particular pathological state? Fraud detection is important, especially in financial services, but in other industries as well. If you're trying to, for example, if a company who is streaming videos and managing accounts, they may want to ensure that multiple accounts aren't being used at the same time that that can be considered an area of fraud or invalid use detection. Just generally categorizing objects broadly falls under the umbrella of supervised learning. Now, there are some challenges to supervised learning. As we get going, you'll pretty quickly see you really need quality label data, and sometimes that can be hard to get. And so there's been a lot of research in supervised machine learning about, you know, how do you generate additional label data, maybe generate synthetic data to help complement what might exist, but might be like say small sets of data? One of the challenges we run into with machine learning and building models is we may overfit the data. And what that means is that it's almost like we memorize the training set and we can repeat back exactly, you know, what category is which example. And the problem with overfitting is that it tends not to generalize well. So when you see something that wasn't in the training dataset, the model doesn't perform very well. So we want to watch for overfitting. In some cases, with supervised learning, the performance that we get, the quality of the model that we build can depend on feature selection and feature engineering. Now this can vary by the type of algorithm that you're using. For example, you know, some algorithms work well with actually detecting features for us and figuring out which are good features to use and others don't. So depending on the algorithm and the way we go about feature engineering, we may have a more difficult time or less difficult time building a model that meets our needs. Now we also want to watch for bias in training data because that of course can lead to bias predictions. And this bias can be maybe our training set doesn't actually reflect the actual population that we're working with. So for example, if we have a training set that has a skewed sample from the population, then our model is going to be skewed according to that training set and may not work well when we apply it to the general population. So we want to just be aware that we want to make sure that we try to identify and eliminate bias in our training data. There are a large number of supervised learning algorithms out there, and we'll talk a little bit about different categories, but some specific examples are linear regression and logistic regression. Those are very popular and relatively straightforward to use. Decision trees and different variations on Decision tree algorithms work quite well. Those include Random forests. And XGBoost is a tree base, but it also uses something called gradient boosting. So it really kind of focuses in on errors as it's training, it tries to improve it by focusing on what it got wrong. Neural networks work really well when some of the other simpler algorithms don't work well. So for example, linear regression works really well when we have linear relationships. But oftentimes when we look at supervised and unsupervised training issues, we're often dealing with non-linear relationships. Neural networks work really well when we're dealing with non-linear relationships. So now let's shift gears and look at some examples of unsupervised learning. One example is clustering and market segmentation. So in business, if you're trying to understand your customer base and different subgroups within your customer base so that you can maybe target different groups with different offers, that's clustering and market segmentation. And that's an example of unsupervised learning. Anomaly detection is another type of unsupervised learning. And anomaly detection is really useful because it can be the foundation for more specific kinds of problems. For example, you can use anomaly detection to help you understand when a device or a machine might be about to break down. There might be anomalies that you can detect that precede a typical like sudden event where a machine breaks down. Anomaly detection can also be useful in things like fraud detection. So we can consider fraud as a type of anomaly. Recommendation systems are really popular, so whenever we get recommendations about books to buy or videos to watch, we're seeing the product of unsupervised learning model at work there. Document clustering is also an example. So this is a type of clustering. So like market segmentation. You know, we can segment customers and customer data could be highly structured. We might have say, you know, a well-defined relational model that describes our customer. Or we might have a document model like a JSON structure that might have a lot of information about what a customer is. Those are examples of structured and semi-structured data and applying clustering or unsupervised learning to those. Document clustering is an example of how we can apply unsupervised learning to unstructured data. Now, unstructured data includes things like text and images and video. You know, they don't fit well into like JSON structures or tabular structures, so we consider them unstructured. Image clustering is another example where we can apply unsupervised learning to unstructured data as well. Now there are some challenges to unsupervised learning. As I mentioned before, the results can be difficult to interpret 'cause we don't have a ground truth that we can compare to. So we can't say, "Oh, right, this particular data point belongs in group A or group B." We don't have that, we don't have those labels. We can also have the risk of overfitting the training data. So again, the model can work really well with the training data, but then when we apply it to other data that we haven't seen before, it doesn't really work. And that's an example of overfitting. Now, some unsupervised learning algorithms include K-means clustering, which is pretty easy to implement. It's pretty easy to understand, it's a straightforward one. Hierarchical clustering works by building a hierarchy of clusters. We can think of it as the entire training set belongs to one cluster, and then we start to break it down maybe into two or three clusters. And then each of those get broken down into, you know, some number of clusters and we keep doing that until we no longer divide our clusters into subgroups. DBSCAN is a density based scanning algorithm, which tries to look for dense clusters within a training set and kind of identify different clusters or different high density regions within a cluster. Principle component analysis is an interesting one. It uses sort of linear algebra methods to kind of reduce the number of dimensions that we have to identify or group together. Explanations for variance in the data, like which set of attributes kind of explain these different variances in the data that we see. Autoencoders are a really interesting method for reducing the number of dimensions or reducing the number of features. And basically with auto encoders, we take a large set of features and we try to build a vector or a set of numerical values which describe the input data set in such a way so that we can easily regenerate the input data set. So we basically take an input data set or a set of features, we map it to an auto encoder, which has fewer features. And then we use something called a decoder. So we have an encoder to map it into a smaller representation. And then a decoder, which maps back to the more explicit set of features. And so auto encoders have a lot of interesting use cases that we can use in unsupervised learning. The Apriori algorithm is an interesting one. This is used when we want to understand commonly occurring collections of items. And the idea of market basket stems from the idea if we're out shopping and we're shopping for certain things, if you buy one thing, you often buy another. So for example, if you're in the United States and you like peanut butter, you often buy jelly. So something like that where there's things that typically go together, the Apriori algorithm is a good way of identifying those sort of clusters of things that show up a lot together in transactions.

Contents