Thursday, April 29, 2010

Write-Ahead Logging in PostgreSQL 9.0

PostgreSQL 9.0 will feature a new setting, currently called wal_level (we still have a ways to go before the final release, so that could change, but probably not), which will control the amount of information written to the system's write-ahead log. There are three levels:

- "minimal" will write just enough information to the log to ensure proper crash recovery.
- "archive" will write enough extra information to support a standby server.
- "hot_standby" will write the same information as "archive", plus enough additional information to allow the standby server to run queries (the feature popularly known as "Hot Standby").

If you're familiar with setting up a standby server in releases prior to 9.0, you may know that this behavior was controlled in 8.4 and prior by setting the configuration variable called "archive_mode". In 8.4 and prior, archive_mode controls not only the amount of information written to the write-ahead log (WAL), but also whether the archiver process is started and whether unarchived segments are retained on the master. While this design worked OK for earlier releases, we outgrew it in 9.0 due to the addition of two new features: streaming replication and hot standby.

In 8.4, archive_mode controls three separate behaviors: what information is written to WAL, whether or not to start the archiver process, and whether or not to retain unarchived WAL segments indefinitely. In 9.0, it's possible to set up a system where streaming replication is enabled but archiving is not, and streaming replication has the same WAL requirements as archiving. Rather than writing the necessary information to WAL if either archiving or streaming replication is enabled, it seemed cleaner to have a separate setting to explicitly control what got written to WAL.

While streaming replication and archiving have essentially the same WAL requirements, Hot Standby requires all of those same things to be logged plus a few more. Initially, we just had a setting for whether or not to log those few extra things, and it simply didn't do anything unless either archiving or streaming replication was also enabled. But that led to some confusing error conditions. Streaming replication was enabled by setting a variable called max_wal_senders to a value greater than 0, so the WAL needed for Hot Standby would get written if:

(archive_mode = 'on' OR max_wal_senders>0) AND recovery_connections='on'

Again, it seemed cleaner to have a separate setting. When Hot Standby failed to start on the standby because of incorrect settings on the master, the error message needed to say something like this:
We can't start Hot Standby because the master server either has max_wal_senders=0 and archive_mode=off, or else it has recovery_connections=off. So please either go enable recovery_connections, or if it's already enabled, then either increase max_wal_senders or turn on archive_mode.
With the new wal_level setting, we can just say (in effect) this:
We can't start Hot Standby because you didn't set wal_level='hot_standby' on the master.
That's not the exact text of the error message. We actually refer to the thing that we can't start on the standby as "recovery_connections", and there is a "recovery_connections" setting on the standby that controls whether or not we try to start it. I'm not sure whether this is all good terminology - these are all new features so we're hammering it out as we go along! Hopefully we'll get some further feedback as we move into beta. Beta1 should be wrapped later today!

Thursday, April 22, 2010

Materialized Views in PostgreSQL

About 9 months ago, Peter Eisentraut reported on the results of a UserVoice survey he had set up. The purpose of the survey was to gather information from PostgreSQL users on which features they were most interested in seeing added to PostgreSQL. Of course, to what extent the features that got high marks in that survey represent the real interests of the community is questionable: certainly, not every PostgreSQL user (or potential user) reads Peter's blog. Still, the results seem to indicate a lot of interest in materialized views.

Recently, Pavel Baros made a proposed a possible implementation of materialized views on the pgsql-hackers mailing list as a Google Summer of Code project. What he's proposing to do is just about the simplest possible implementation of materialized views I can imagine, comprising just two commands:

CREATE MATERIALIZED VIEW name AS SELECT ...
ALTER MATERIALIZED VIEW name REFRESH;

Of course, right away this opens a can of worms. Basically, there are two things that you might mean when you talk about a "materialized view":

1. A materialized view that is always up to date. In other words, whenever the underlying tables are updated, there is some incremental-update process that concurrently modifies the materialized view data so that it remains up to date also. The tricky part is that it's not easy to update a materialized view of an arbitrary query for a reasonable cost. If the query is complex, the only feasible way to keep the materialized view up to date may be to re-execute the entire query whenever any of the underlying tables are modified (or when any of the functions or operators are redefined), which is probably not feasible for most people. But the plus side is that, when a cheap incremental update IS possible, you don't really need to know that you're working with a materialized view at all. It's indistinguishable from a regular view, up to performance.

2. A materialized view that isn't always up to date. This would include the proposed implementation, which is manual-refresh-only, as well as, for example, views that are refreshed on a regular schedule. (It's pretty easy to convert manual-refresh-only to refreshed-on-a-regular-schedule if you have access to cron!) Materialized views of this type are possibly useful in cases where incremental refresh isn't easily done, such as aggregates over a huge data set that produce a much smaller output table. Of course, you can do this kind of thing today using triggers, but maybe it would be nice to have something in the database that would do it for you automatically.

In the end, I suspect we'll want to support both of these features in PostgreSQL, but I'm curious which one people think is more useful. Our initial implementation of either feature is likely to be a bit rough around the edges, so don't assume we're going to provide all the optimizations we'd ever like to have on day one!

Sunday, April 18, 2010

Materialization in PostgreSQL 9.0

Some plan types, such as nested loops and, to a lesser degree, merge joins, rely on the ability to "rescan" their inner input. Rescanning basically means backing up: the executor asks the target node to produce the same output tuples it has already spit out all over again - either all of them, or just a specified portion. This can be expensive: if the plan that produced those tuples has a high cost associated with it, we'd very much prefer not to do it more than once.

Enter Materialize. When considering a Nested Loop or Merge Join, the planner considers inserting a "Materialize" node atop the inner plan. This node stores a copy of the tuples generated by the inner plan in an in-memory buffer or in a temporary file, making it very cheap to return them over again if needed.

In versions 8.4 and prior, the logic for inserting Materialize nodes when considering Nested Loop plans is thoroughly bogus. The planner had no idea that rescanning a Materialize node would be any cheaper than scanning it the first time - which is pretty horrible when you consider that this is precisely the point of having Materialize nodes. In 9.0, the logic has been extensively rewritten. Based on the limited testing I've done so far, 9.0 will materialize the inner side of a nested loop a great deal more frequently than 8.4, and most of those cases seem to be wins.

It's probably not too surprising that if the node being rescanned is, say, a join, it's faster to reread the tuples from a buffer than it is to redo the whole join. Similarly, if you're rescanning a base table with a filter condition - e.g. find all the rows in table foo where x = 1 - it's faster to save the rows that meet the condition than it is to rescan the whole table and test the condition over again for each row. Amazingly, however, at least in some cases, it seems to be faster to use materialization even when rescanning a base table WITHOUT a filter condition. In other words, if we need to read all the rows in table foo three times, it's at least sometimes faster to read the table once, dump all the tuples into a tuplestore, and then reread the tuplestore twice more than it is to read the base table three times.

How is that possible, you ask? I haven't played with this enough to be sure, but I suspect the win comes from eliminating tuple visibility checks and other overhead.

Even though this isn't the sort of feature that tends to make the headlines, it's a fairly significant adjustment to the planner, so I expect to find some lurking problems (a few have been found and fixed already). It may be that the planner is now too aggressive at materializing, whereas before it seems it wasn't aggressive enough. Or it may turn out that materializing increases query memory usage too much, or just isn't faster in some as-yet-unknown set of cirumstances. On the whole, though, the new logic is far more sensible than the old logic, so I'm hopeful that we'll seem some modest performance improvements out of this change once the dust settles.

Wednesday, April 14, 2010

Finding Our Way to LATERAL

One of the few types of SQL query that we don't yet implement in PostgreSQL is LATERAL, and we seem to get fairly regular requests for it. The typical use case for this feature is where you have a table called, say, foo, and a set-returning function called, say, func, and you'd like to do this:

SELECT foo.x, bar.x FROM foo, func(foo.x) AS bar;

But that doesn't work:

ERROR: function expression in FROM cannot refer to other relations of same query level

In fact, the SQL standard apparently specifies that this is not legal syntax. If you want to reference a variable at the same query level, you need to wrap it using LATERAL():

SELECT foo.x, bar.x FROM foo, LATERAL(func(foo.x)) AS bar;

Currently, PostgreSQL does not support LATERAL(), so this query fails with a complaint that there's no function called LATERAL available. But wouldn't it be neat if it worked? We get requests for this feature pretty regularly. Right now, there's a pretty limited number of ways to call SRFs, and it's hard to get the equivalent of the above query without major gymnastics.

I wanted to see how much work it would be to implement LATERAL() in PostgreSQL, so, just for fun, I beat the parser with a large enough sledgehammer to make it accept the first of the two queries listed above. This effort was doomed to failure, however. When the planner goes to join foo to func(foo.x), it doesn't understand the implicit dependency between those two items. By jiggering the costs, you can get the planner to decide on a nonsensical plan like this: for each row in the output of func(foo.x), scan all the rows in foo. Of course, to us human beings, it's obvious that we have to do those two steps in the opposite order, but the planner doesn't know that. Its only knowledge about the connection between different tables comes from the join clauses, and there are no join clauses in this query.

As it turns out, fixing this turns out to be closely related to solving a problem with nested-loop-with-inner-indexscan plans, which could more generally be called "partial index scan" plans. Currently, we find these plans using some special case crocks that cover most, but not quite all, of the interesting cases. Andrew Gierth supplied this example:
SELECT ... FROM t1 LEFT JOIN (t2 JOIN t3 ON (t2.a=t3.a)) on (t1.a=t2.a);
If t1 is small and t2 and t3 are large, we might like to get a plan that looks something like this:

Nested Loop Left Join
Join Filter: (t1.a = t2.a)
-> Seq Scan on t1
-> Nested Loop
Join Filter: (t2.a = t3.a)
-> Index Scan using t2_a on t2
Index Qual: (t2.a = t1.a)
-> Index Scan using t3_a on t3
Index Qual: (t3.a = t1.a)

But we won't. Instead, we'll spend a lot of energy joining all of t2 to all of t3, and then throw away those results when we do the join to t1. This happens because planner will only consider a partial index scan on t2 or t3 if the index scan is the inner side of a nested loop and the table providing the value to look for is on the outer side of the same nested loop. Here, the index quals are referencing t1, which isn't part of the join: it's one level up. In many cases, the planner can work around the inability to form these types of plans by reordering the joins, but swapping the order of the LEFT and INNER joins in this query would change it's meaning, so that doesn't work in this case.

Tom Lane had an idea about how to generalize the inner-indexscan machinery to handle this. Instead of considering partial index scans using special-case logic at the time we form a nested loop, Tom proposed, essentially, promoting them to first-class citizens, to be considered along with all the other paths we examine when forming a join. To make that work, a little bit of tracking data must be added: if we choose a partial index scan path for t2, we don't need to join it directly to t1, but we do need to *eventually* put whichever portion of the plan contains it on the inner side of a nested loop that has t1 on the outer side, so that t1 can supply the values for the index lookup. As it happens, set-returning functions called via LATERAL need exactly the same thing: a way to make sure that the portion of the query tree where they end up being called is somewhere under the inner side of a nested loop whose outer side supplies the necessary variables.

There's one more fly in the ointment: this is not just a planning problem. We also need to make sure that whatever plan the planner creates can be executed by the executor. It turns out that the executor has an almost identical limitation. Nested loops pass down an "outer tuple" to their inner plan, and if that plan happens to be an index scan then the outer tuple can be used to constrain the index lookup. Unfortunately, again, this only works if the nested loop immediate parent of the index scan. Tom Lane has an idea for fixing this, too: replace the existing mechanism with executor parameters, which are more general and can pass values between far-distant parts of the query plan. Again, we'll use this same design for LATERAL, when a set-returning function has input parameters that must be supplied by a table scan elsewhere in the query.

So, when might we see actual LATERAL support in PostgreSQL? Tom Lane has indicated that he plans to fix the executor limitation described above for 9.1; I'm not sure whether he's planning to fix the planner limitation as well. Once those are fixed, I think we'll be within striking distance of the core feature. So, I'm hopeful that we'll get LATERAL in either 9.1 or 9.2. Stay tuned!

Monday, April 12, 2010

PostgreSQL East 2010 Slides

Quite a few people seemed to like my talk on the PostgreSQL Query Planner at PostgresSQL East, and I got a few requests for copies of my slides. I'll be giving a shortened version of that talk at PGcon 2010. I didn't get as many positive comments on other talk, an Introduction to Triggers, though I had a pretty full room and the people who came seemed to find the material interesting. Anyhow, I've put both presentations on line.