|
| 1 | +# Design decisions |
| 2 | + |
| 3 | +Explanations of why things are done the way they are. |
| 4 | + |
| 5 | +## Why does pglogical_output exist when there's wal2json etc? |
| 6 | + |
| 7 | +`pglogical_output` does plenty more than convert logical decoding change |
| 8 | +messages to a wire format and send them to the client. |
| 9 | + |
| 10 | +It handles format negotiations, sender-side filtering using pluggable hooks |
| 11 | +(and the associated plugin handling), etc. The protocol its self is also |
| 12 | +important, and incorporates elements like binary datum transfer that can't be |
| 13 | +easily or efficiently achieved with json. |
| 14 | + |
| 15 | +## Custom binary protocol |
| 16 | + |
| 17 | +Why do we have a custom binary protocol inside the walsender / copy both protocol, |
| 18 | +rather than using a json message representation? |
| 19 | + |
| 20 | +Speed and compactness. It's expensive to create json, with lots of allocations. |
| 21 | +It's expensive to decode it too. You can't represent raw binary in json, and must |
| 22 | +encode it, which adds considerable overhead for some data types. Using the |
| 23 | +obvious, easy to decode json representations also makes it difficult to do |
| 24 | +later enhancements planned for the protocol and decoder, like caching row |
| 25 | +metadata. |
| 26 | + |
| 27 | +The protocol implementation is fairly well encapsulated, so in future it should |
| 28 | +be possible to emit json instead for clients that request it. Right now that's |
| 29 | +not the priority as tools like wal2json already exist for that. |
| 30 | + |
| 31 | +## Column metadata |
| 32 | + |
| 33 | +The output plugin sends metadata for columsn - at minimum, the column names - |
| 34 | +before each row. It will soon be changed to send the data before each row from |
| 35 | +a new, different table, so that streams of inserts from COPY etc don't repeat |
| 36 | +the metadata each time. That's just a pending feature. |
| 37 | + |
| 38 | +The reason metadata must be sent is that the upstream and downstream table's |
| 39 | +attnos don't necessarily correspond. The column names might, and their ordering |
| 40 | +might even be the same, but any column drop or column type change will result |
| 41 | +in a dropped column on one side. So at the user level the tables look the same, |
| 42 | +but their attnos don't match, and if we rely on attno for replication we'll get |
| 43 | +the wrong data in the wrong columns. Not pretty. |
| 44 | + |
| 45 | +That could be avoided by requiring that the downstream table be strictly |
| 46 | +maintained by DDL replication, but: |
| 47 | + |
| 48 | +* We don't want to require DDL replication |
| 49 | +* That won't work with multiple upstreams feeding into a table |
| 50 | +* The initial table creation still won't be correct if the table has dropped |
| 51 | + columns, unless we (ab)use `pg_dump`'s `--binary-upgrade` support to emit |
| 52 | + tables with dropped columns, which we don't want to do. |
| 53 | + |
| 54 | +So despite the bandwidth cost, we need to send metadata. |
| 55 | + |
| 56 | +In future a client-negotiated cache is planned, so that clients can announce |
| 57 | +to the output plugin that they can cache metadata across change series, and |
| 58 | +metadata can only be sent when invalidated by relation changes or when a new |
| 59 | +relation is seen. |
| 60 | + |
| 61 | +Support for type metadata is penciled in to the protocol so that clients that |
| 62 | +don't have table definitions at all - like queueing engines - can decode the |
| 63 | +data. That'll also permit type validation sanity checking on the apply side |
| 64 | +with logical replication. |
| 65 | + |
| 66 | +## Hook entry point as a SQL function |
| 67 | + |
| 68 | +The hooks entry point is a SQL function that populates a passed `internal` |
| 69 | +struct with hook function pointers. |
| 70 | + |
| 71 | +The reason for this is that hooks are specified by a remote peer over the |
| 72 | +network. We can't just let the peer say "dlsym() this arbitrary function name |
| 73 | +and call it with these arguments" for fairly obvious security reasons. At bare |
| 74 | +minimum all replication using hooks would have to be superuser-only if we did |
| 75 | +that. |
| 76 | + |
| 77 | +The SQL entry point is only called once per decoding session and the rest of |
| 78 | +the calls are plain C function pointers. |
| 79 | + |
| 80 | +## The startup reply message |
| 81 | + |
| 82 | +The protocol design choices available to `pg_logical` are constrained by being |
| 83 | +contained in the copy-both protocol within the fe/be protocol, running as a |
| 84 | +logical decoding plugin. The plugin has no direct access to the network socket |
| 85 | +and can't send or receive messages whenever it wants, only under the control of |
| 86 | +the walsender and logical decoding framework. |
| 87 | + |
| 88 | +The only opportunity for the client to send data directly to the logical |
| 89 | +decoding plugin is in the `START_REPLICATION` parameters, and it can't send |
| 90 | +anything to the client before that point. |
| 91 | + |
| 92 | +This means there's no opportunity for a multi-way step negotiation between |
| 93 | +client and server. We have to do all the negotiation we're going to in a single |
| 94 | +exchange of messages - the setup parameters and then the replication start |
| 95 | +message. All the client can do if it doesn't like the offer the server makes is |
| 96 | +disconnect and try again with different parameters. |
| 97 | + |
| 98 | +That's what the startup message is for. It reports the plugin's capabilities |
| 99 | +and tells the client which requested options were honoured. This gives the |
| 100 | +client a chance to decide if it's happy with the output plugin's decision |
| 101 | +or if it wants to reconnect and try again with different options. Iterative |
| 102 | +negotiation, effectively. |
| 103 | + |
| 104 | +## Unrecognised parameters MUST be ignored by client and server |
| 105 | + |
| 106 | +To ensure upward and downward compatibility, the output plugin must ignore |
| 107 | +parameters set by the client if it doesn't recognise them, and the client |
| 108 | +must ignore parameters it doesn't recognise in the server's startup reply |
| 109 | +message. |
| 110 | + |
| 111 | +This ensures that older clients can talk to newer servers and vice versa. |
| 112 | + |
| 113 | +For this to work, the server must never enable new functionality such as |
| 114 | +protocol message types, row formats, etc without the client explicitly |
| 115 | +specifying via a startup parameter that it understands the new functionality. |
| 116 | +Everything must be negotiated. |
| 117 | + |
| 118 | +Similarly, a newer client talking to an older server may ask the server to |
| 119 | +enable functionality, but it can't assume the server will actually honour that |
| 120 | +request. It must check the server's startup reply message to see if the server |
| 121 | +confirmed that it enabled the requested functionality. It might choose to |
| 122 | +disconnect and report an error to the user if the server didn't do what it |
| 123 | +asked. This can be important, e.g. when a security-significant hook is |
| 124 | +specified. |
0 commit comments