Posts tagged ‘postgresql’

Full text search in Postgresql

I was always a fan of the Postgresql rdbm, not that I have anything against other databases, but Postgres is just to powerful for something that is free. One feature that I love is full text search, given the importance of text search in modern websites. This is not a Postgres exclusive feature, but in my humble opinion, postgres does it very well. Here are some of the coolest advantages of this feature in Postgres (or skip to the tutorial part):

Out of the box support for many languages

As far as I know there’s Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish and Turkish out of the box. And there’s also something called “simple” for general use. Other languages can be easily added.

Distinguishing between languages is important because the way Postgres works with documents. When parsing a document, Postgres considers many aspects, some of them are language specific, like:

  • stop words, which are those very common words we prefer to ignore, like “and”, “of” and “in”, we don’t want a document to rank better than other just because the word “and” shows up an absurd amount of times, and the search term include the word “and”;
  • synonym dictionary, which allows us to consider synonyms equally. so the word “daring” would match against a search for “fearless”;
  • thesaurus dictionary, pretty much the same as above, but we can work with full sentences. This is specially nice to match abbreviations.
  • word normalization, also known as snowball dictionary, postgres converts every common variants of a word to a simple base, like plurals and verbal forms.
  • ispell dictionary, basically the same as the snowball dictionary, but way more powerfull. By default postgres does not include any ispell dictionary, but they can be easily added.

We can store parsed documents, and create indexes for quicker queries

Obviously, all this dictionary parsing has some processing costs, but we can dramatically reduce them storing the normalized documents. For example, lets assume that we have a table which stores our website news. After inserting a news in the database, we can create a normalized document and store it in a different column, and then use that column to perform our queries, and of course, we can use a trigger to perform this operation automatically. Best of all, is that we can add different weights to different parts of the news (see bellow).

Then we have two types of indexes, GIN (Generalized Inverted Index) and GIST (Generalized Search Tree). We can create these indexes on our normalized column, just like we create a primary key index. I’m not getting into much details about this, but generally we’ll use a GIN index for static data, like the news table for example, which does not change often, and GIST for highly dynamic data.

Different weights for different parts of a document

This is the nicer part. When creating the normalized document (by the way, this is usually called a tsvector), we can define different weights for each part of the table.  Once again, lets assume we have our news table, which contains a title, a teaser and a body, if we assign different weights to each of these columns, when we perform a search, the place where the keywords show up will be an important factor when ranking the search results.

And some other stuff, but enough chit chat and lets go for an example

Let’s start building our news table then. First of all, we need to create it:

CREATE TABLE news
(
title character varying,
teaser text,
body text,
id serial NOT NULL,
CONSTRAINT news_pkey PRIMARY KEY (id)
)

Simple, just the title the teaser the body, and a very standard primary key field. There’s still a couple of columns missing, but that’s ok, we can add them later. First there are some examples I want to show. We can begin by inserting some test data, and now we’re ready to make some full text searches. This is where all this stuff differs from standard pattern matching or regular expression matching, basically, what we are going to compare are not two strings, but two different data types, a tsquery and a tsvector, also known as a document. A tsquery works like a pattern, we can build one suing the funtion to_tsquery, which gets two parameters, the language of the query (optional), and the query itself. Optionally we can just typecast a query string, and the language will be set to the default for your database, usually english.

A tsvector is our normalized document. It holds a vector of lexemes (a lexeme is a normalized word) among with information on where on the document that lexeme appears and it’s weight. So a simple query could look like this:

SELECT * FROM news WHERE to_tsvector(title || teaser || body) @@ to_tsquery('some & text & search');

The @@ operator is used to match a tsquery against a tsvector. As you can see, the string passed to the to_tsquery function is not just a plain list of words, the & or | operators are mandatory between words, and represent how we want to match that word, just like the AND and OR operators  in a google search. We can use Parentheses as well, and quotes which represent a full expression, usually to be matched against a thesaurus dictionary term. What this means is that if we are using user input in our queries, we need to either parse the query first and convert it to a valid tsquery expression, or educate our users to make their searches using the correct syntax… But I would rather go for the first choice.

Now you may be asking how do you include different weights in the document? There is another function called setweight, which gets a tsvector and a weight label represented by the letters A to D. Also, the tsvector type accepts the || operator which performs a concatenation of two documents. So basically, we just need to create three documents for each part, set a weight for each and then concatenate the resulting tsvector’s. But there is a catch (this is a recommendation found on the official PostgreSql documentation), the to_tsvector function will return NULL if it founds an empty string or NULL, and trying to cocatenate a document with a NULL value would cause an error, so, to prevent this, we should use the coalesce function, which fallbacks a variable to a certain value if a NULL value is found. So here’s how we can create a tsvector with all that weight stuff:

setweight(to_tsvector('english', coalesce(title,'')), 'A') ||
	     setweight(to_tsvector('english', coalesce(teaser,'')), 'B') ||
	     setweight(to_tsvector('english', coalesce(body,'')), 'B')

Now we can just match the resulting document to a tsquery using the @@ operator, but that would make much of the previous work useless, because the @@ operator just returns TRUE or FALSE, whether its a positive match or not. What we really need is something to give us some feedback on how well the documents matched. So instead of the @@ operator we can use the ts_rank or ts_rank_cd functions. They work about the same, except for the computation algorithm, because the latter will also consider the position of the lexemes in the document, ranking documents where the keyword appear close to each other higher than documents where the keywords appear in very different positions. The basic usage is to provide a tsquery and a tsvector as arguments, and we’ll get a floating point value ranking our queries, but we can optionally add a first argument, containing the an array of four float values indicating the weight of each weight label, from D to A. Default value for this is 0.1, 0.2, 0.4 ad 1.0. We can also add a fourth optional argument which is a bit flag indicating how a document length can interfere in its ranking. More on that later. So now our query could look like this:

SELECT *, ts_rank('{0.2, 0.5, 0.75, 1.0}',
	          setweight(to_tsvector('english', coalesce(title,'')), 'A') ||
		  setweight(to_tsvector('english', coalesce(teaser,'')), 'B') ||
		  setweight(to_tsvector('english', coalesce(body,'')), 'B'),
		  to_tsquery('english', 'some & text & search')) AS ranking
FROM news ORDER BY ranking DESC;

This is now getting a bit scary… To much code for a text search. Here’s where our column with the pre-normalized document will become handy, along with a trigger that would automatically populate this field each time a row is inserted/updated:

ALTER TABLE news ADD COLUMN tsv tsvector;
 
CREATE FUNCTION news_trigger() RETURNS TRIGGER AS $$
begin
  new.tsv :=
     setweight(to_tsvector('english', coalesce(title,'')), 'A') ||
     setweight(to_tsvector('english', coalesce(teaser,'')), 'B') ||
     setweight(to_tsvector('english', coalesce(body,'')), 'B');
  RETURN new;
end
$$ LANGUAGE plpgsql;
 
CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON news FOR EACH ROW EXECUTE PROCEDURE news_trigger();
 
UPDATE news SET title = title;

That last update instruction is just to force the trigger to be triggered (this sounds weird) for all already existing columns. This has the advantage of not only simplify queries, but also saves lots of processing time when executing them. Here’s our query now:

SELECT *, ts_rank('{0.2, 0.5, 0.75, 1.0}', tsv, to_tsquery('english', 'some & text & search')) AS ranking
FROM news ORDER BY ranking DESC;

Much simpler huh? But we can even save some time only ranking documents where there is effectively a match, and pre-processing our query and our array:

SELECT *, ts_rank(p.factors, tsv, p.query) AS ranking
FROM news, (SELECT '{0.2, 0.5, 0.75, 1.0}'::float4[] AS factors, to_tsquery('some & text & search') AS query) p
WHERE tsv @@ query
ORDER BY ranking DESC;

Nice huh? Now I will tell you a great hint, if you’re creating a multi language website. Usually, assuming that you’ll be using the same table for every language, there is a column which will hold the language settings for each row. But it would be nice too to build our pre-normalized document based on that language. Well, we can use the language field on the trigger function to achieve that. So assuming that the language column will hold only valid and available languages, we can add the column and rewrite the trigger function:

ALTER TABLE news ADD COLUMN lang character varying;
 
CREATE OR REPLACE FUNCTION news_trigger() RETURNS TRIGGER AS $$
begin
  new.tsv :=
     setweight(to_tsvector(new.lang::regconfig, coalesce(title,'')), 'A') ||
     setweight(to_tsvector(new.lang::regconfig, coalesce(teaser,'')), 'B') ||
     setweight(to_tsvector(new.lang::regconfig, coalesce(body,'')), 'B');
  RETURN new;
end
$$ LANGUAGE plpgsql;
 
--update the already existent rows with the new lang field, and at the same time call the trigger on every row
UPDATE news SET lang = 'english';

Notice that the language parameter on the to_tsvector functions is typecasted to regconfig. You need to do this because that’s the type expected by the function, and when you use a hard coded string, that string has no type until the postgres interpreter decides which one fits better for that purpose. Optionally, we could have used the regconfig type when adding the lang column. Now we can safely query our documents in a multi language way…

This concludes my “small” introduction to full text search in PostgreSql, I have plans to continue this explaining how to build GIN and GIST indexes and adding dictionaries and dictionary terms.

See you then…