PostgreSQL #4 - Physical Layer - Files

PostgreSQL #4 - Physical Layer - Files


In the last post in the blog I described most of PostgreSQL`s architecture. If you didnt read that post I suggest you to do it in order to help you better  understand the things that I'm going to describe in this post and in the next posts.

In the title of the post I mentioned "Physical Layer". One might ask - what does it mean ? In this post I'm going to talk about PostgreSQL`s file system layer.  I'm going to explain the importance and the meaning of the directories and the files that were created during the installation (check it out in the second post in the blog). I'm going to focus mainly on the data directory - pg_hba.conf,postgresql.conf, pg_xlog,base,pg_tblspc.  

Installation directory

When you installed PostgreSQL you had the option to choose where to install it. If you installed it via yum or you didnt manualy choose the installation directory it is likely that it was installed in /var/lib/pgsql. If you followed me in the second post it was installed in /PostgreSQL/9.6.3. In this directory you will find the following directories : 

-The bin directory includes many commands/tools that we can use to manage our instance like : start/stop the instance,create/drop database,backup and restore operations,upgrade and so on.. Therefore, you should add the path of that bin directory to the $PATH variable if you didnt do it until now.
-The share directory consists of documentation files,man directory and samples of important files in the data directory. To be honest there are some important files in that directory but we will not focus on them. You should add the man`s directory path to the $MANPATH variable in order to allow you to use the man command to get help on some postgresql`s tools/commands.
-The lib directory contains many libraries that postgresql is using. 
-The include directory contains PostgreSQL`s headers files.
-The most important directory is the data directory and in this post we will focus mainly on that directory. You should set the variable $PGDATA to be the path of the data directory.

The data directory

The data directory might look different in some versions of PostgreSQL but most of the important files/directories have the same name/structure. My data directory is located in /PostgreSQL/9.6.3/data. When I list  the content of that directory(ls -ltr) I get the following output : 



-pg_hba : One of the most important configuration files in PostgreSQL. This file determines who can connect to our database. In other words it is the bouncer to our party. In this file we can control which hosts are allowed to connect,how clients are authenticated,which PostgreSQL user names they can use and which databases they can access. The file contains records that has the next structure : 

1)The allowed values for 'connection_type' are : 
local' - connections that come from the local machine'-
host' - connection attempts that are made via tcp/ip'-
-'hostssl' - connection attempts that are made via tcp/ip with ssl encryption.
-'hostnossl' - connection attempts that are made  via tcp/ip with out ssl encryption.

2)The allowed values for 'database' are :  
-'all' - which means that the user can connect to all the databases
-specific_db_name - which means that the user will be able to connect only to that specific database. For example if i want to allow connections only to the database pgmc I will write in this column pgmc.
-'sameuser' or 'samerole' - allow connections to the database that has the same name as the user or as the role.
-'replication' - allow connections to the replication database. In order to allow a user to have connection rights to that database you must specify this record! The all option doesnt cover the replication database.
-a list of databases names seperated by comma for example : db1,db2,db3

3)The allowed values for 'user' are : 
-'all' - all users
-specific user or role(a group name should start with +)
-a list of users names seperated by coma for example : user1,user2,user3

4)The allowed values for 'address' are :
-hostname or ip address(ipv4 or ipv6) of a specific hostname. If you choose to use an ip you must specify the mask of that address.
-Multiple hosts - you can specify an ip address and the net mask of your network. As a result all the hosts in your network will be able to connect to the database

5)There are many connection methods, the most common are : 
-'password' -> password is needed to connect but during the check the password isnt encrypted.
-'md5' -> same as password but the password is encrypted.
-'trust' ->no password is needed.
-reject ->reject connections
-pam -> authentication via pam
-ldap -> authentication via ldap

6)The option is optional and depends on the connection method you chose.

Now lets use my current pg_hba file to show you some examples : 

The first row allows to the user mariel connect from the local machine without typing password. The second row rejects all connections from the server 192.168.54.20. The third row allows incoming connections from all users to all databases with md5 authentication.

Now let me ask you a question, in the next screenshot, do you think that the user postgres will be able to connect localy ? 


The answer : 

Remember that the first record that is related to a specific user is the one that matters. As you can see in the last example I had 2 records that influenced the user postgres. One directly should rejected connections from the user postgres but the other one should allowed all connections including connections from the user postgres.

Note 1 : When you use special keywords like all,sameuser,samerole,replication dont quote them. When you quote one of those keywords they lose their "special" meaning.

Note 2 : Changes in this file doesnt require reboot to the instance, you can just reload the settings via the command pg_ctl reload (make sure the bin directory is part of your $PATH).


-postgresql.conf - the main configuration file of the database. This file contains ALL the parameters of the database. The file itself contains explanations for each parameter. In my machine there is 16GB of ram so I adjusted the parameters according to my environment. There are many parameters so I wont explain all of them but I'll focus on some of them which are very important:

1)listen_address='*' 
This parameter indicates on what network interfaces postgresql listens for connections. By default it is set to 'localhost' so remote connection would get the next error   " could not connect to server: Connection refused Is the server running on host "" and accepting TCP/IP connections on port 5432? "

2)port=5432
By default this parameter is set to 5432. The postmaster listens on that port for incoming connections.

3)max_connections=100
This parameter assigns the max number of connections allowed to the instance.

4)shared_buffers= 4GB (1/4 of your ram memory)
By default this parameters is 128Mb. It is recommended to set this value to 1/4 of your ram memory. Without expanding too much, the shared buffer contains data that is accessed often(reading from memory is cheaper than reading from disk). For more info, check the previous post about PostgreSQL`s architecture.

5)work_mem = 128MB               
The work mem is a memory that is allocated for sort or related join operations. Because it is allocated per operation dont set this value to hight because you can find yourself in a situation where is used all your memory. By default it is set to 4MB.

6)maintenance_work_mem = 4GB (1/4 of your ram memory)
The maintenance work mem is a  memory that is allocated for maintenance operations like : building an index, alter table, vacuum. Dont be afraid to set this parameter to a high value because in contrary to the work_mem there arent many maintenance operations running in the same time. 

7)wal_level = minimal/replica/logical
This parameter defines how much information is written to the wals(Write Ahead log). Well it depends what environment you are configuring right now and if you want to have the ability to restore your database to a point in time.The default value is 'minimal' and it is enough to be able to start your database after a crash or after an immediate shudtown.The 'replica' option adds data to the wals that allows creation of archives->Point in time recovery and allows running read-only queries on standby(In previous versions it was known as hot_standby and archives). The last value is 'logical' and it allows logical decoding.

8)wal_buffers = -1
The default value is -1 and it depends on the size of the shared_buffers. There isnt any recommended size to set it because I dont know what is the purpose of your database and how much work it will handle. However, I can suggest you to start with the default, and check what is the frequency of your checkpoints(remember what it is ?). If you see that it happens very often start from increasing this parameter. On my environment it is set to 16MB.

9)checkpoint_timeout=10min,checkpoint_completion_target = 0.7
Maximum time between two checkpoints. If since last checkpoint there wasnt any checkpoint for 10 minutes force a checkpoint.The default is 5 minutes. The checkpoint_completion_target
 means that the current checkpoint operation aims to finish by the time 70% of the next checkpoint is here.

10)max_wal_size/min_wal_size/archive_command
I wont describe both of the parameters,but if you decided to set the wal_level to replica/logical you should read about those parameters and try to play with them..

11)effective_cache_size=8G
The default is 4GB. This parameter tells the optimizer how much memory is available in the os level. This allows the optimizer to have a better look on the cost of seq scan and index scan. When the server doesnt have a high load and works normalls check with free -m how much memory is available. Use that value as the effective_cache_size.

12)logging_collector=on
By default it is set to off. I recommend you to set it to on.  It enables a background process that captures log messages sent to stderr and redirects them into the log files(PostgreSQL`s documentation).

13)log_min_error_statement = notice
Controls which SQL statements that cause an error condition are recorded in the server log.


14)autovacuum = on
Enables the background process autovacuum.

15)log_Statement='all'
Write to the serverlog all the ddl/dml operations. Mybe when your database will be big it wont be usefull because the server log will be huge. If you have elastic search or logs analytic tool I recommend you to enable the log_statement to max and analyze it with a specific tool for it.

16)log_checkpoints = on
You will be able to see in the server log when checkpoint accured and what is its frequency.

17)log_line_prefix = '%t%u%d' 
Show more details in the serverlog for each message(username,timestamp,databases)

There are many more parameters. As you dig deeper into postgresql you will find many more parameters that are relevant for you. DO IT! You can gain alot from it.

-base directory : This directory contains all the data files(by default) of the instance. When you cd to this directory you will see that it consist from some directories:

Each one of those directories represent a different database in the cluster. Each database in the cluster has a unique number which is known as oid - Object Identifier. For example, the directory 1 is the directory that contains the datafiles of the database that his oid is 1 - template 0. 

Under the directories in the base directory you can find alot of data files. Those files represent tables/indexes. Like the oid of the databases in this case the number that identifies the object is called filenode. You can find the filenode of a table/index in one of two methods : 

There are many more terms in the base directory that we can cover like free space map,visibility map,tuples,ctid and so on.. But I think that it is too early right now and it just will make this post longer.

-PG_VERSION - includes the major version of your postgresql database.

-pg_log - contains the server log and all the logs that created as a rotation of the server log. The most important log in postgresql!

-pg_global - contains data files of the system tables -> all the pg_* like pg_class,pg_tables and so on..
-pg_xlog- This directory is also very important. It consist of all the wal files that are generated. It is very common to set this directory as a soft link to a different directory on a different disk because wal generates alot of I/O. This usually improve the performance of the database very much. 
(Do not do it when the database is up!)

pg_tblspc - tablespaces allows the dba to save the data files (those we saw in the base directory) in different location and not in the default path. Why this is usefull ? Well, firstly, we can put heavy objects in fast disks and then many operations on those object will be much more faster. Secondly, if the mount the database was created on is full, we can work on a different mount (different tablespace) until we can solve our space limition. By default 2 tablespaces are created : pg_default which includes databases template 1/0 and pg_global which includes the system tables. We can create new tablespaces and in pg_tblspc you will find soft links to the real location of the datafiles in that tablespace.


To sum it up, you should understand in the end of this post the physical layer of PostgreSQL. What are the config files and where the datafiles are located. What basic action we can do to improve the performance of our database and what parameters are very important to its daily work. I didnt explain about all the parameters and about all the directories and files in the data directory. You can find more data in Postgresql`s documentation - https://www.postgresql.org/docs/9.6/. If you have any questions or you think that I have a mistake or I forgot to mention something please reply or send me an email :)







תגובות

פוסטים פופולריים מהבלוג הזה

PostgreSQL #7 - Tuning PART 1 - Where to start ?

PostgreSQL #8- Tuning Part 2 - Indexes

PostgreSQL #6 - Backup & Restore