OutbackCDX replication
One of the exciting new features contributed by the Archive-It team in OutbackCDX 0.7.0 is James Kafader’s implementation of primary-secondary replication support. This enables deployments such as high availability failover, load balancing or to host indexes in multiple geographic locations to reduce query latency.
How it works
One instance of OutbackCDX is designated the primary and configured to preserve its transaction log.
Secondary instances poll the primary’s transaction log for changes at
/{collection}/changes?since={seqno}
. This uses RocksDB’s GetUpdatesSince
API to return a list of write batches which the secondary applies as an
incremental update to its index.
[{"sequenceNumber": "1", "writeBatch": "... base64 data ..."},
{"sequenceNumber": "3", "writeBatch": "... base64 data ..."}]
Secondary instances are read-only and will refuse record updates made directly via POST and DELETE API calls.
Configuring the primary instance
When running in primary mode we should use the --replication-window
option
to tell OutbackCDX how many seconds to keep the transaction log for. For example
if you’re certain you’ll never to update a secondary that’s out of date by 7
days you could use 604800 to save some disk space. In this case we’ll use 0
which means we keep the transaction log forever.
Let’s create new directory to store our primary index data in and run it:
$ mkdir /tmp/primary
$ java -jar outbackcdx-0.7.0.jar -d /tmp/primary --replication-window 0
OutbackCDX http://localhost:8080
Now that the primary is running let’s create a collection named ‘example’ and populate it with a cdx line:
$ echo '- 20190101000000 http://example.org/ text/html 200 - - - 1043 333 example.warc.gz' > example.cdx
$ curl --data-binary @example.cdx http://localhost:8080/example
Added 1 records
Configuring a secondary instance
We’ll run a secondary instance with a data directory of /tmp/secondary on port
8081. When running a secondary we need to give it the URL of the primary
collection we want to replicate from using the --primary
option. If you need
to replicate multiple collections use --primary
multiple times.
$ mkdir /tmp/secondary
$ java -jar outbackcdx-0.8.0.jar -d /tmp/secondary -p 8081 --primary http://localhost:8080/example
OutbackCDX http://localhost:8081
Tue Jan 14 17:21:57 KST 2020 ChangePollingThread(http://localhost:8080/example): replicated 1 write batches (1..1) with total length 132 in 0.647s from http://localhost:8080/example/changes?size=10485760&since=0 and our latest sequence number is now 2
Changing the replication interval
By default secondaries will poll for changes every 10 seconds. This can be
adjusted with the --update-interval
option.