Redis 7.0 Multi Part AOF的设计和实现
一 AOF
二 AOFRW
三 AOFRW存在的问题
1 内存开销
aof_pending_rewrite:0
aof_buffer_length:35500
aof_rewrite_buffer_length:34000
aof_pending_bio_fsync:0
3351:M 25 Jan 2022 09:55:39.655 * Backgroundappendonlyfilerewritingstartedbypid 6817
3351:M 25 Jan 2022 09:57:51.864 * AOFrewritechildaskstostopsendingdiffs.
6817:C 25 Jan 2022 09:57:51.864 * Parentagreedtostopsendingdiffs. FinalizingAOF...
6817:C 25 Jan 2022 09:57:51.864 * Concatenating 2135.60MBofAOFdiffreceivedfromparent.
3351:M 25 Jan 2022 09:57:56.545 * BackgroundAOFbuffersize: 100 MB
2 CPU开销
在AOFRW期间,主进程需要花费CPU时间向aof_rewrite_buf写数据,并使用eventloop事件循环向子进程发送aof_rewrite_buf中的数据:
/* Append data to the AOF rewrite buffer, allocating new blocks if needed. */
voidaofRewriteBufferAppend(unsignedchar *s, unsignedlong len){
// 此处省略其他细节...
/* Install a file event to send data to the rewrite child if there is
* not one already. */
if (!server.aof_stop_sending_diff &&
aeGetFileEvents(server.el,server.aof_pipe_write_data_to_child) == 0)
{
aeCreateFileEvent(server.el, server.aof_pipe_write_data_to_child,
AE_WRITABLE, aofChildWriteDiffData, NULL);
}
// 此处省略其他细节...
}
在子进程执行重写操作的后期,会循环读取pipe中主进程发送来的增量数据,然后追加写入到临时AOF文件:
intrewriteAppendOnlyFile(char *filename){
// 此处省略其他细节...
/* Read again a few times to get more data from the parent.
* We can't read forever (the server may receive data from clients
* faster than it is able to send data to the child), so we try to read
* some more data in a loop as soon as there is a good chance more data
* will come. If it looks like we are wasting time, we abort (this
* happens after 20 ms without new data). */
int nodata = 0;
mstime_t start = mstime();
while(mstime()-start < 1000 && nodata < 20) {
if (aeWait(server.aof_pipe_read_data_from_parent, AE_READABLE, 1) <= 0)
{
nodata++;
continue;
}
nodata = 0; /* Start counting from zero, we stop on N *contiguous*
timeouts. */
aofReadDiffFromParent();
}
// 此处省略其他细节...
}
在子进程完成重写操作后,主进程会在backgroundRewriteDoneHandler 中进行收尾工作。其中一个任务就是将在重写期间aof_rewrite_buf中没有消费完成的数据写入临时AOF文件。如果aof_rewrite_buf中遗留的数据很多,这里也将消耗CPU时间。
voidbackgroundRewriteDoneHandler(int exitcode, int bysignal) {
// 此处省略其他细节...
/* Flush the differences accumulated by the parent to the rewritten AOF. */
if (aofRewriteBufferWrite(newfd) == -1) {
serverLog(LL_WARNING,
"Error trying to flush the parent diff to the rewritten AOF: %s", strerror(errno));
close(newfd);
goto cleanup;
}
// 此处省略其他细节...
}
3 磁盘IO开销
4 代码复杂度
/* AOF pipes used to communicate between parent and child during rewrite. */
int aof_pipe_write_data_to_child;
int aof_pipe_read_data_from_parent;
int aof_pipe_write_ack_to_parent;
int aof_pipe_read_ack_from_child;
int aof_pipe_write_ack_to_child;
int aof_pipe_read_ack_from_parent;
四 MP-AOF实现
1 方案概述
BASE:表示基础AOF,它一般由子进程通过重写产生,该文件最多只有一个。
INCR:表示增量AOF,它一般会在AOFRW开始执行时被创建,该文件可能存在多个。
HISTORY:表示历史AOF,它由BASE和INCR AOF变化而来,每次AOFRW成功完成时,本次AOFRW之前对应的BASE和INCR AOF都将变为HISTORY,HISTORY类型的AOF会被Redis自动删除。
2 关键实现
Manifest
1)在内存中的表示
aofInfo:表示一个AOF文件信息,当前仅包括文件名、文件序号和文件类型
base_aof_info:表示BASE AOF信息,当不存在BASE AOF时,该字段为NULL
incr_aof_list:用于存放所有INCR AOF文件的信息,所有的INCR AOF都会按照文件打开顺序排放
history_aof_list:用于存放HISTORY AOF信息,history_aof_list中的元素都是从base_aof_info和incr_aof_list中move过来的
typedefstruct {
sds file_name; /* file name */
longlong file_seq; /* file sequence */
aof_file_type file_type; /* file type */
} aofInfo;
typedefstruct {
aofInfo *base_aof_info; /* BASE file information. NULL if there is no BASE file. */
list *incr_aof_list; /* INCR AOFs list. We may have multiple INCR AOF when rewrite fails. */
list *history_aof_list; /* HISTORY AOF list. When the AOFRW success, The aofInfo contained in
`base_aof_info` and `incr_aof_list` will be moved to this list. We
will delete these AOF files when AOFRW finish. */
longlong curr_base_file_seq; /* The sequence number used by the current BASE file. */
longlong curr_incr_file_seq; /* The sequence number used by the current INCR file. */
int dirty; /* 1 Indicates that the aofManifest in the memory is inconsistent with
disk, we need to persist it immediately. */
} aofManifest;
structredisServer {
// 此处省略其他细节...
aofManifest *aof_manifest; /* Used to track AOFs. */
// 此处省略其他细节...
}
2)在磁盘上的表示
fileappendonly.aof.1.base.rdbseq 1 typeb
fileappendonly.aof.1.incr.aofseq 1 typei
fileappendonly.aof.2.incr.aofseq 2 typei
fileappendonly.aof.1.base.rdbseq 1 typebnewkeynewvalue
fileappendonly.aof.1.incr.aoftypeiseq 1
# thisisannotations
seq 2 typeifileappendonly.aof.2.incr.aof
文件命名规则
seq为文件的序号,由1开始单调递增,BASE和INCR拥有独立的文件序号
type为AOF的类型,表示这个AOF文件是BASE还是INCR
format用来表示这个AOF内部的编码方式,由于Redis支持RDB preamble机制,因此BASE AOF可能是RDB格式编码也可能是AOF格式编码:
appendonly.aof.1.base.rdb // 开启RDB preamble
appendonly.aof.1.base.aof // 关闭RDB preamble
appendonly.aof.1.incr.aof
appendonly.aof.2.incr.aof
兼容老版本升级
如果appenddirname目录不存在 或者appenddirname目录存在,但是目录中没有对应的manifest清单文件 如果appenddirname目录存在且目录中存在manifest清单文件,且清单文件中只有BASE AOF相关信息,且这个BASE AOF的名字和server.aof_filename相同,且appenddirname目录中不存在名为server.aof_filename的文件
/* Load the AOF files according the aofManifest pointed by am. */
int loadAppendOnlyFiles(aofManifest *am) {
// 此处省略其他细节...
/* If the 'server.aof_filename' file exists in dir, we may be starting
* from an old redis version. We will use enter upgrade mode in three situations.
*
* 1. If the 'server.aof_dirname' directory not exist
* 2. If the 'server.aof_dirname' directory exists but the manifest file is missing
* 3. If the 'server.aof_dirname' directory exists and the manifest file it contains
* has only one base AOF record, and the file name of this base AOF is 'server.aof_filename',
* and the 'server.aof_filename' file not exist in 'server.aof_dirname' directory
* */
if (fileExist(server.aof_filename)) {
if (!dirExists(server.aof_dirname) ||
(am->base_aof_info == NULL && listLength(am->incr_aof_list) == 0) ||
(am->base_aof_info != NULL && listLength(am->incr_aof_list) == 0 &&
!strcmp(am->base_aof_info->file_name, server.aof_filename) && !aofFileExist(server.aof_filename)))
{
aofUpgradePrepare(am);
}
}
// 此处省略其他细节...
}
使用server.aof_filename作为文件名来构造一个BASE AOF信息 将该BASE AOF信息持久化到manifest文件 使用rename 将旧AOF文件移动到appenddirname目录中
void aofUpgradePrepare(aofManifest *am) {
// 此处省略其他细节...
/* 1. Manually construct a BASE type aofInfo and add it to aofManifest. */
if (am->base_aof_info) aofInfoFree(am->base_aof_info);
aofInfo *ai = aofInfoCreate();
ai->file_name = sdsnew(server.aof_filename);
ai->file_seq = 1;
ai->file_type = AOF_FILE_TYPE_BASE;
am->base_aof_info = ai;
am->curr_base_file_seq = 1;
am->dirty = 1;
/* 2. Persist the manifest file to AOF directory. */
if (persistAofManifest(am) != C_OK) {
exit(1);
}
/* 3. Move the old AOF file to AOF directory. */
sds aof_filepath = makePath(server.aof_dirname, server.aof_filename);
if (rename(server.aof_filename, aof_filepath) == -1) {
sdsfree(aof_filepath);
exit(1);;
}
// 此处省略其他细节...
}
多文件加载及进度计算
int loadAppendOnlyFiles(aofManifest *am) {
// 此处省略其他细节...
/* Here we calculate the total size of all BASE and INCR files in
* advance, it will be set to `server.loading_total_bytes`. */
total_size = getBaseAndIncrAppendOnlyFilesSize(am);
startLoading(total_size, RDBFLAGS_AOF_PREAMBLE, 0);
/* Load BASE AOF if needed. */
if (am->base_aof_info) {
aof_name = (char*)am->base_aof_info->file_name;
updateLoadingFileName(aof_name);
loadSingleAppendOnlyFile(aof_name);
}
/* Load INCR AOFs if needed. */
if (listLength(am->incr_aof_list)) {
listNode *ln;
listIter li;
listRewind(am->incr_aof_list, &li);
while ((ln = listNext(&li)) != NULL) {
aofInfo *ai = (aofInfo*)ln->value;
aof_name = (char*)ai->file_name;
updateLoadingFileName(aof_name);
loadSingleAppendOnlyFile(aof_name);
}
}
server.aof_current_size = total_size;
server.aof_rewrite_base_size = server.aof_current_size;
server.aof_fsync_offset = server.aof_current_size;
stopLoading();
// 此处省略其他细节...
}
AOFRW Crash Safety
BASE AOF的名字中包含文件序号,保证每次创建的BASE AOF不会和之前的BASE AOF冲突; 先执行AOF的rename 操作,再修改manifest文件;
fileappendonly.aof.1.base.rdbseq 1 typeb
fileappendonly.aof.1.incr.aofseq 1 typei
fileappendonly.aof.1.base.rdbseq 1 typeb
fileappendonly.aof.1.incr.aofseq 1 typei
fileappendonly.aof.2.incr.aofseq 2 typei
fileappendonly.aof.2.base.rdbseq 2 typeb
fileappendonly.aof.1.base.rdbseq 1 typeh
fileappendonly.aof.1.incr.aofseq 1 typeh
fileappendonly.aof.2.incr.aofseq 2 typei
在修改内存中的server.aof_manifest前,先dup一份临时的manifest结构,接下来的修改都将针对这个临时的manifest进行。这样做的好处是,一旦后面的步骤出现失败,我们可以简单的销毁临时manifest从而回滚整个操作,避免污染server.aof_manifest全局数据结构; 从临时manifest中获取新的BASE AOF文件名(记为new_base_filename),并将之前(如果有)的BASE AOF标记为HISTORY; 将子进程产生的temp-rewriteaof-bg-pid.aof临时文件重命名为new_base_filename; 将临时manifest结构中上一次的INCR AOF全部标记为HISTORY类型; 将临时manifest对应的信息持久化到磁盘(persistAofManifest内部会保证manifest本身修改的原子性); 如果上述步骤都成功了,我们可以放心的将内存中的server.aof_manifest指针指向临时的manifest结构(并释放之前的manifest结构),至此整个修改对Redis可见; 清理HISTORY类型的AOF,该步骤允许失败,因为它不会导致数据一致性问题。
voidbackgroundRewriteDoneHandler(int exitcode, int bysignal){
snprintf(tmpfile, 256, "temp-rewriteaof-bg-%d.aof",
(int)server.child_pid);
/* 1. Dup a temporary aof_manifest for subsequent modifications. */
temp_am = aofManifestDup(server.aof_manifest);
/* 2. Get a new BASE file name and mark the previous (if we have)
* as the HISTORY type. */
new_base_filename = getNewBaseFileNameAndMarkPreAsHistory(temp_am);
/* 3. Rename the temporary aof file to 'new_base_filename'. */
if (rename(tmpfile, new_base_filename) == -1) {
aofManifestFree(temp_am);
goto cleanup;
}
/* 4. Change the AOF file type in 'incr_aof_list' from AOF_FILE_TYPE_INCR
* to AOF_FILE_TYPE_HIST, and move them to the 'history_aof_list'. */
markRewrittenIncrAofAsHistory(temp_am);
/* 5. Persist our modifications. */
if (persistAofManifest(temp_am) == C_ERR) {
bg_unlink(new_base_filename);
aofManifestFree(temp_am);
goto cleanup;
}
/* 6. We can safely let `server.aof_manifest` point to 'temp_am' and free the previous one. */
aofManifestFreeAndUpdate(temp_am);
/* 7. We don't care about the return value of `aofDelHistoryFiles`, because the history
* deletion failure will not cause any problems. */
aofDelHistoryFiles();
}
支持AOF truncate
if (ftruncate(server.aof_fd, server.aof_last_incr_size) == -1) {
//此处省略其他细节...
}
AOFRW限流
if (server.aof_state == AOF_ON &&
!hasActiveChildProcess() &&
server.aof_rewrite_perc &&
server.aof_current_size > server.aof_rewrite_min_size &&
!aofRewriteLimited())
{
longlongbase = server.aof_rewrite_base_size ?
server.aof_rewrite_base_size : 1;
longlong growth = (server.aof_current_size*100/base) - 100;
if (growth >= server.aof_rewrite_perc) {
rewriteAppendOnlyFileBackground();
}
}
五 总结
搜索与推荐技术实战训练营
关键词
数据
AOF文件
子进程
问题
写命令
最新评论
推荐文章
作者最新文章
你可能感兴趣的文章
Copyright Disclaimer: The copyright of contents (including texts, images, videos and audios) posted above belong to the User who shared or the third-party website which the User shared from. If you found your copyright have been infringed, please send a DMCA takedown notice to [email protected]. For more detail of the source, please click on the button "Read Original Post" below. For other communications, please send to [email protected].
版权声明:以上内容为用户推荐收藏至CareerEngine平台,其内容(含文字、图片、视频、音频等)及知识版权均属用户或用户转发自的第三方网站,如涉嫌侵权,请通知[email protected]进行信息删除。如需查看信息来源,请点击“查看原文”。如需洽谈其它事宜,请联系[email protected]。
版权声明:以上内容为用户推荐收藏至CareerEngine平台,其内容(含文字、图片、视频、音频等)及知识版权均属用户或用户转发自的第三方网站,如涉嫌侵权,请通知[email protected]进行信息删除。如需查看信息来源,请点击“查看原文”。如需洽谈其它事宜,请联系[email protected]。