PostgreSQL #8- Tuning Part 2 - Indexes

PostgreSQL #8- Tuning Part 2 - Indexes


In the last post I described how we can gather data from the database to identify some performance issues. In this post I I will focus mostly on indexes. In this post I'm going to answer the next questions : 

-What is an index ?
-How does an index look like ?
-How does index saved in disk ?
-Why using an index is sometimes faster then reading the table directly ?
-How the database decides whether to use an index or not ? 
-What are seq_page_cost and random_page_cost?
-What is an index only scan/ bitmap index scan ?
-How can I force the database to use a specific scan ?
-How to find tables that need an index ?

As I already said in the previous posts, you dont need to be a professional Data engineer to understand what I'm talking about. I'm trying to write those posts in way that most of the people can understand so just try to read it :)

So lets start : 

So what is an index ? A lot of people are familiar with that word but not everyone really understand what it is. Index is a data structure that improves the time it takes to retrieve data from tables. Exactly like tables, indexes are stored as an array of pages(or blocks if you prefer, 8KB as default) on your disk(More about the Database page layout).
Data files can contain data of a table or of an index. The difference between index`s data file to table`s data file is in the structure of the pages.

How does an index look like logically?  Lets try to understand the most basic and common index - btree index. As you can understand from its name, btree stands for binary tree. The btree nodes store the column`s data and a pointer to the relevant table`s page. Each node in the tree can have at most 2 children (left son, right son) : 

So why using an index is faster ? One thing that I didn't mention is that every right son is bigger then his parent and every left son is smaller then his parent. Now when you know that and how an index looks like I will try to explain when scanning an index  is actually useful. In postgres we have 2 important parameters that you should be familiar with : 

seq_page_cost is the cost of reading one page as part of a sequential scan while random_page cost is the cost of reading one page as part of a non-sequential scan - an index scan. Now lets assume we have the next table that contain the first name,last name and age of every student(1026 students for this case) and an index on the age column : 



Now, for the next examples lets assume that every record consume 1 page on disk(It isnt necessary the case.. I will focus on it in the bitmap index scan ). In other words, the cost of reading one row from the table will be exactly seq/random page cost(depends on the execution plan).

In the next 2 examples I will show you the execution plan that the database chose for some queries. I know that I didn't explain about it yet, but basically what you should know is that an execution plan is the plan that the database uses to run your query.

Example 1 :
In this example we want to retrieve all the content of the table. In such case, if we will use a sequential scan the cost will be 1*1026(seq_page_cost * num_of_records). However, if we will use an index to scan the entire table the cost will be 4*1026(random_page_cost * num of records). 


Example 2:
In this example I want to retrieve only a few record from the table. The database will decide to use the index that I created : 


As you saw earlier when we want to retrieve most of the table it is better to preform a sequential scan. In case it isnt clear why, reading most of the table means reading almost all the table`s pages. However, when we tried to query a small number of records we saw that its better to use an index. The rule of thumb here is that if you want to retrieve less then 20% of the table you should always use an index scan.


Does the database bring me the data from the index ? Well it depends what you want the database to retrieve from the disk. In our last example we asked the database to bring all the row and not part of it(the * in the query). If we have asked the database to bring us only the age column then the index itself already had that data and that saved the database time to access the table`s pages : 

This type of scan is called index only scan. In the regular index scan, when the db find an index entry that suits the query`s filter (the where clause) it uses the pointer in the entry and access the table`s relevant page. It does it for every index entry that meets the where clause conditions. 

Now what if I dont want to retrieve all the table but I also want to retrieve more then 20% of the table, for example I want 60% of the tables`s data. What will be more efficient then ?

Bitmap index scan : In this type of scan we combine both index and seq scan. The database decides to use the index and runs over the index entries. Instead of accessing the table`s page for every index entry  the database saves a list of all the values in the index that met the where clause. Afterwards, the database access the disk and if some records located in the same page, the database reads that page one time and not multiple times like it does in the regular index scan. Bitmap index scan is very often used when you try to filter your query with columns that 
each of them has it own index : 

Sometimes we dont understand why the database chose a to use a specific plan. I mean, a lot of times you might ask your self "why the hell the db didnt use an index here ?". Well, postgres has a feature that will might help you understand. By default all types of scans are allowed :

If we set some of those parameters to off, we can force the database to choose a different scan. In the next example, I forced the database to use an index/bitmap scan and not a sequential scan:


As you can see, the cost of using a bitmap scan was higher then the cost of a seq scan. This is exactly why the database preferred to use a seq scan in the first query.

I think that after seeing all these scans you should ask yourself a very important question :
How the database knows to calculate the total cost ? I mean, in order to decide the total cost of the query the database needs to know how many pages it gonna scan and calculate the cost based on that number. Well, that is a good question. The database is calculating the cost and decides what plan to use based on statistics that it gathers. The main tables that the database uses are the pg_statistics and pg_stats. Those tables contains very useful data like number of distinct values in the column,most common values,histograms. The next picture shows the content of the pg_stats for the students table that I used in this post : 

The last thing that I would like to focus on in the post is how we can identify tables that need an index . Well I think that the best 2 ways are checking the logs and the pg_stat_all_tables. If you followed my suggestions in the last post you should see in your database`s logs slow queries. A lot of queries are slow because they are missing an index. For example, I increased the size of the students table to 131,328 rows and then I tried to query it and filter with lname : 

I looked in the logs and saw that the same query was logged :

Another option is to query the table pg_stat_all_tables. This table contains statistics data about all the tables in your database : 

Search for tables that has a high value in the seq_scan column. It means that a lot of queries decided to scan that table sequentially. Now, in a lot of cases it might be the better option, but sometimes there might be a missing index.

In this post I used only the regular btree index. However, there are more type of indexes in postgres. I didnt want to explain about all of them because that is not what I tried to achive in this post. Postgres support many type of indexes that are useful in many cases. I wanted to show you what indexes are and how indexes improve the database performance. I hope that after reading this post you will feel more comfortable with indexes. I promise to you that I will write another post about all the indexes and their uses in this tuning series.


That is all for today. in the next posts I'm going to focus on : 
-Table bloats
-Explanation plans
-Vacuum - full & autovacuum
-Partitions

If you have questions feel free to leave a comment or send me a mail.

תגובות

פוסטים פופולריים מהבלוג הזה

PostgreSQL #7 - Tuning PART 1 - Where to start ?

PostgreSQL #6 - Backup & Restore