From the course: Creating a Book Search Engine from Scratch Using Java and GitHub Copilot
Implementing full-text search in PostgreSQL
From the course: Creating a Book Search Engine from Scratch Using Java and GitHub Copilot
Implementing full-text search in PostgreSQL
- [Instructor] All right, so as we can see we have four books, and now we need to create a full text search index. And in order to do that, the first thing that we need to do is to create a vector, and we will do that on the books table. So, I'm going to paste the command. So we are altering the books table, and we are adding a new column called search_vector of type tsvector, and this is where a similar inverted index that we saw in the spreadsheet, that was an example, but this one is where the index will be stored. So, first thing is we need to create this column. So I'm going to run this, and it says, "Query return successfully." So I'm going to see what we have now, so running everything. So if I go all the way at the end, we have search vector, which is of type tsvector, but all of the data is null. So the next thing we need to do is to populate this search index. So, we are going to run the update script, and let me just paste this query first. So, we are updating the books table, and we are setting the information in the tsvector. So, we are converting it using the English language, and we are using three columns to create the search index, which is title, description, and ISBN. And A, B, C refers to the priority, so it means that when the search happens, if the search matches first in the title, it'll get the higher preference over the description and over ISPN and so on and so forth. And we are doing a to_tsvector, which means we are taking the exact content that is available in these columns and performing the rules, such as normalization and stemming so that they are stored in the index in a format which is easier to look up. Now I'm going to run this query and see what happens. It gave me an error because I selected only tsvector, so I'm going to select this entire piece of code and then run it again, and it updated the query. So it says it updated four rows. So we'll go back to the first select statement, run it, and then go all the way to the right. And now, as you can see, it created this search index and populated this index for us. It looks similar, but it's not exactly similar, it is much more involved with regard on how this activity works. I gave you a very simple example on (indistinct) search index, but this code that is implemented by Postgres is much more involved and advanced. So now that we have our search vector updated, we need to start making our search queries. So for that, what I'm going to do is take a query that I already created about a select statement, paste it so that we can talk through it. So, we are selecting title, and we are also using ts_rank, which is going to provide us the relevance or the ranking score on what the search score was. And the way it works is we tell it which column contains our vector and then we are taking our query, which is algorithms in English language, and we are converting that into a query object. Now, you see this to_tsquery appeared twice. It's because any query that the user is going to provide you, which is algorithms in this case, needs to pass in through to_tsquery in order to make it a queryable object so that when it is searched against the search vector, they both are doing apples-to-apples comparison. Now, we are using it two times, first time is to help create the relevance score, and the second time is to actually make the filter query. So we are first doing the select, what all we need. We need title and we need the rank. We are searching in the books table and in the WHERE clause we are saying WHERE search_vector, which is the column which we populated a few minutes ago. And we are using the user query and converting that into a to_tsquery. So we are saying select all these things from the table where the search vector is a queryable object based on the user query called algorithms. And we are ordering by the rank, which is the rank, a ts_rank, and let's go ahead and run this now. So when we run, we found only one book with a score. And this is one way where when the user provides algorithms and it matches anywhere in the database, it is being looked up into the search index and getting returned. Now, this same query can get much more advanced. So for example, if somebody is searching for anything that has algorithms or designs, so in this case I will say algorithms or design, the pipe is or, and as we have used to_tsquery in two places, I need to make this change at two places. And if I now go ahead select this and run, we found two rows matching the criteria. If I move this a little bit here, the first one is matching because of the design keyword, and the second one is matching because of the algorithm's keyword with a different score altogether. So this is where the user is giving you A or B kind of queries. The same query can be updated using the pipe operator. If you want to model A and B, you use the and symbol, and I'm going to change it here again. And if I select and run it again, nothing matched because there is no such book that contains the word algorithms and design. But if you go back and change it to or condition, we found two books as we saw a few moments ago. So let me just run this. We saw two books. Now the point is that how this to_tsquery works, so in order to find how this works, I'm going to copy this and say select and paste it here. So we are saying give me the output for calling this function to_tsquery on this query, which is algorithms or design. So I'm going to select and run it. So, behind the scenes, see what it did? From algorithms, it went to algorithm, which means it performed these stemming and from design, it kept design. So if I change it to designers and algorithms to algorithmic, algorithmic, and then run it again, let's see if anything changes. So it changed it again. I think I had a spelling mistake. It was algorithmic. Let me just run it again. So changed it back to algorithm and design. So, the algorithm, the search index creation is smart enough to know how to normalize and how to stem the words so that it's easier to match based on the user query, because the user query is also going through the same function for normalization and stemming. And the matching is easier, because it does a lookup, finds which book ID it is, and returns the result. Now, if I changed it to capital Algorithmic and D capital and then run it again, it again changed it to the same word because it normalized from capital A to small a and small d from capital D. Great, so I hope that this gave you some sense of how full text search in Postgres works. Now this is a very small subset that I have covered, but if you're interested in finding more or you have a use case which is similar or require more changes, I would recommend looking at the Postgres documentation, chapter 12 on full text search, and you will find all the different sort of things, including the examples. All right, fantastic. So to recap what we have done so far, so we have a docker container, we have a database, we are leveraging pgAdmin in Docker to manage the database. We have successfully inserted four books, and then now we have created a search vector, which is the search index on the books, and we saw how the full text search queries work in the context of Postgres database. Great, now we are only working in the context of four books. As I mentioned that, we have 20,000 books that we have to work on. So, in the next step what we are going to do is we are going to write code that can insert the books information sitting remote somewhere on GitHub servers into our database that is sitting in our Docker containers before we move on to business logic and APIs. All right, fantastic, and I will see you in the next video.