Tuesday, February 10, 2015

Evolution of GIN index - PostgreSQL9.4

0. Introduction
PostgreSQL9.4, the latest version of PostgreSQL has been released at 18th Dec, 2014. This changes contain not only various newly functions but also evaluation existing function. One of them, the evolution of that GIN index is used for full text search, are
  1. Compression of GIN index
  2. Improvement of query speed in specialized case
This article would show the detail of these improvement. If you thought that what's the GIN index? or what's the full text search?, you can read and refer to this(Japanese).

1. Compression of GIN index
In version 9.4, the size of GIN index has reduced in large amounts using by compression method called "verbyte encoding". The image are followings.

In 9.3 or earlier, posting list has each ctid of tuple(block number and offset number). In contrast version 9.4, posting list has only a number which show the difference between one before item. Therefore, we can reduce the size of index 19 bytes in above case. The difference which shows its block id and its offset number is used under the following rules.
  • Use 1-6 bytes for showing difference
  • Therefore, we can show number up to 127  in one byte.
  • 8th bit is set to '1' for showing over 128.
  • Therefore, the setting 1 to 8th bit means "we use also next byte for showing difference"
  • The first 10 bits shows offset number, other bits show block number
For example,
127 = 01111111
128 = 11111111 00000001

2. Improvemtn of query speed in specialized case
In full text searching, some keys are used in searching are generated based on query key word. And postgres returns query result using by these keys. (For example, pg_bigm generates 2-characters keys). version 9.3 or earlier has been using all keys(ctid) of each posting list to judge when postgres search tuple which matches to all key of posting list, so far.

But, in above case, there are no need to judge all keys. It's just enough to use two keys "と京" has. That is, postgres need to judge in "some ctids whose posting list less most keys has" to avoid unnecessary scans.

In version 9.4, postgres uses keys whose posting list is less most keys9.4 has, to match keys. This changes leaded to improve performance.  But it does not mean for all cases. It will be effective for query which has rare data and frequent data, especially.