NCCL源码解析⑥：Channel搜索

作者｜KIDGINBROOK

更新｜潘丽晨

上节讲到已经计算出GPU和NIC节点到其他任意节点的最优路径了，本节看下NCCL中channel的搜索过程。

NCCL中channel的概念表示一个通信路径，为了更好地利用带宽和网卡，以及同一块数据可以通过多个channel并发通信，另外后续可以看到一个channel对应了一个GPU SM，所以基于这些原因，NCCL会使用多channel，搜索的过程就是搜索出来一组channel。

如上节所述，单机的情况下会在ncclTopoTrimSystem函数里删除网卡，因此我们先看下单机八卡这种简化的情况，最后再看下多机引入网卡之后的情况。

staticfloatgetMaxWidth(struct ncclTopoSystem* system, struct ncclTopoNode* gpu, int type){float maxWidth = 0.0;for (int i=0; i<system->nodes[type].count; i++) {structncclTopoLinkList* path = gpu->paths[type]+i;float width = path->width;if (path->count == 0) continue; maxWidth = std::max(maxWidth, width); }return maxWidth;}ncclResult_t ncclTopoSearchInit(struct ncclTopoSystem* system){ system->maxWidth = 0.0;int inter = system->nodes[NET].count;if (inter == 0 && system->nodes[GPU].count == 1) { system->maxWidth = LOC_WIDTH;return ncclSuccess; }for (int g=0; g<system->nodes[GPU].count; g++) {structncclTopoNode* gpu = system->nodes[GPU].nodes+g; system->maxWidth = std::max(system->maxWidth, getMaxWidth(system, gpu, inter ? NET : GPU)); }return ncclSuccess;}

ncclTopoSearchInit就是初始化system->maxWidth，如果是单机单卡的情况，那么maxWidth设置为LOC_WIDTH，否则就遍历每个GPU节点，查看到其他所有GPU节点或者网卡最大带宽。

struct ncclTopoGraph ringGraph; ringGraph.id = 0; ringGraph.pattern = NCCL_TOPO_PATTERN_RING; ringGraph.crossNic = ncclParamCrossNic(); ringGraph.collNet = 0; ringGraph.minChannels = 1; ringGraph.maxChannels = MAXCHANNELS/2; NCCLCHECK(ncclTopoCompute(comm->topo, &ringGraph)); NCCLCHECK(ncclTopoPrintGraph(comm->topo, &ringGraph));

nccl执行集合通信时支持ring，tree和collnet三种算法，这里我们以ring来举例，后续专门介绍ring和tree。

structncclTopoGraph {// Input / outputint id; // ring : 0, tree : 1, collnet : 2int pattern;int crossNic;int collNet;int minChannels;int maxChannels;// Outputint nChannels; // 搜索到的channel数量float speedIntra; // 节点内单个channel带宽float speedInter; // 节点间单个channel带宽int typeIntra; // 节点内channel的路径类型int typeInter; // 节点间channel的路径类型int sameChannels; // channel是否一样int nHops;int intra[MAXCHANNELS*NCCL_TOPO_MAX_NODES]; // 节点内每个channel路径int inter[MAXCHANNELS*2]; // 节点间每个channel路径};

ncclTopoGraph记录了搜索到的结果，具体含义见注释。

然后看下ncclTopoCompute，这里就是实际搜索channel的过程，目标是搜索出来尽可能多，带宽尽可能大的一系列channel，本质就是暴力搜索，先设置一系列的条件搜答案，如果搜不出来则降低条件继续搜。

由于此时没有NET节点，所以crossNic为0，然后初始化graph，首先设置最高的条件，限制节点内部只能使用不超过PATH_NVL路径，节点间只能使用不超过PATH_PIX的路径，然后通过system-maxWidth设置speedIntra和speedInter，接着执行ncclTopoSearchRec搜索出一个答案存储到tmpGraph中。

如果此时就是最优的结果，channel数等于maxChannel，并且speedInter也等于maxWidth，则直接退出，否则就开始逐步降低条件，比如将sameChannel设置为0，允许channel之间不一样；调大typeIntra和typeInter；允许crossNic；调小speedInter和speedIntra。

ncclResult_t ncclTopoCompute(ncclTopoSystem* system, struct ncclTopoGraph* graph) { int ngpus = system->nodes[GPU].count; int crossNic = (system->nodes[NET].count > 1) && graph->crossNic ? 1 : 0; graph->speedIntra = graph->speedInter = 0;if (graph->crossNic == 2) graph->crossNic = 0; graph->typeIntra = ngpus == 1 ? PATH_LOC : PATH_NVL; graph->typeInter = PATH_PIX; graph->nChannels = 0; graph->sameChannels = 1;if (ngpus == 1) if (graph->pattern != NCCL_TOPO_PATTERN_RING) graph->pattern = NCCL_TOPO_PATTERN_TREE; struct ncclTopoGraph tmpGraph; memcpy(&tmpGraph, graph, sizeof(struct ncclTopoGraph));// First try crossnic, then decrease speed and finally increase speedIntra. tmpGraph.pattern = graph->pattern; int pass = 1; int speedIndex = 0;while (speedArray[speedIndex] > system->maxWidth && speedIndex < NSPEEDS-1) speedIndex++; tmpGraph.speedIntra = tmpGraph.speedInter = speedArray[speedIndex]; int64_t globalTimeout = NCCL_SEARCH_GLOBAL_TIMEOUT;search: int time = tmpGraph.sameChannels ? NCCL_SEARCH_TIMEOUT_SAMECHANNELS : tmpGraph.pattern == NCCL_TOPO_PATTERN_TREE ? NCCL_SEARCH_TIMEOUT_TREE : NCCL_SEARCH_TIMEOUT; tmpGraph.nChannels = 0; globalTimeout -= time; NCCLCHECK(ncclTopoSearchRec(system, &tmpGraph, graph, &time));// Optimal solution, stop hereif (graph->nChannels == graph->maxChannels && graph->speedInter == system->maxWidth) goto done;if (pass == 1) {// First pass, we don't have a solution yet ; try other options// Try having different channelsif (tmpGraph.sameChannels == 1) { tmpGraph.sameChannels = 0;goto search; } tmpGraph.sameChannels = 1;if (time != -1) globalTimeout += time;else globalTimeout = NCCL_SEARCH_GLOBAL_TIMEOUT;if (globalTimeout < 0) goto done; int maxTypeIntra = system->nodes[NET].count > 0 ? tmpGraph.typeInter : PATH_SYS;if (tmpGraph.typeIntra < maxTypeIntra && (graph->nChannels == 0 || tmpGraph.typeIntra < graph->typeIntra)) { tmpGraph.typeIntra += 1;goto search; } tmpGraph.typeIntra = ngpus == 1 ? PATH_LOC : PATH_NVL;

if (system->nodes[NET].count > 0 && tmpGraph.typeInter < PATH_SYS && (graph->nChannels == 0 || tmpGraph.typeInter < graph->typeInter || tmpGraph.typeInter < PATH_PXB)) {

tmpGraph.typeInter += 1;goto search; } tmpGraph.typeInter = PATH_PIX;// Try a simpler treeif (tmpGraph.pattern == NCCL_TOPO_PATTERN_SPLIT_TREE_LOOP) { tmpGraph.pattern = NCCL_TOPO_PATTERN_SPLIT_TREE;goto search; }if (tmpGraph.pattern == NCCL_TOPO_PATTERN_SPLIT_TREE) { tmpGraph.pattern = NCCL_TOPO_PATTERN_TREE;goto search; } tmpGraph.pattern = graph->pattern;if (crossNic && tmpGraph.crossNic == 0) {// Try again with crossNic if permitted tmpGraph.crossNic = crossNic;goto search; } tmpGraph.crossNic = 0;// Decrease speed until we find a solutionif ((speedIndex < NSPEEDS-1) && (graph->nChannels == 0 || (speedArray[speedIndex+1]/graph->speedInter > .49))) { tmpGraph.speedInter = tmpGraph.speedIntra = speedArray[++speedIndex];goto search; } speedIndex = 0;while (speedArray[speedIndex] > system->maxWidth && speedIndex < NSPEEDS-1) speedIndex++; tmpGraph.speedIntra = tmpGraph.speedInter = speedArray[speedIndex]; }done:// We have a solution. Start from that solution and move to pass 2.if (pass == 1) { time = -1; memcpy(&tmpGraph, graph, sizeof(tmpGraph)); speedIndex = 0;while (speedArray[speedIndex] > graph->speedInter && speedIndex < NSPEEDS-1) speedIndex++; tmpGraph.speedIntra = tmpGraph.speedInter = speedArray[speedIndex]; tmpGraph.minChannels = graph->nChannels; pass = 2; }// 3. See if we can increase speedIntra for trees (2 nodes or collnet)if (pass == 2) {if (time != 0 && graph->pattern != NCCL_TOPO_PATTERN_RING && tmpGraph.speedIntra == graph->speedIntra && tmpGraph.speedIntra < tmpGraph.speedInter*2 && speedIndex > 0) { tmpGraph.speedIntra = speedArray[--speedIndex];goto search; } time = -1; memcpy(&tmpGraph, graph, sizeof(tmpGraph)); }if (graph->nChannels == 0 && graph->collNet == 0) { WARN("Could not find a path for pattern %d, falling back to simple order\n", graph->pattern);for (int i=0; i<ngpus; i++) graph->intra[i] = system->nodes[GPU].nodes[i].gpu.rank; graph->inter[0] = graph->inter[1] = 0; graph->speedIntra = graph->speedInter = 0.1; graph->typeIntra = graph->typeInter = PATH_SYS; graph->nChannels = 1; }if (graph->speedIntra >= 25.0) { int dupChannels = std::min(graph->nChannels*2, graph->maxChannels); memcpy(graph->intra+graph->nChannels*ngpus, graph->intra, (dupChannels-graph->nChannels)*ngpus*sizeof(int)); memcpy(graph->inter+graph->nChannels*2,graph->inter, (dupChannels-graph->nChannels)*2*sizeof(int)); graph->speedIntra /= DIVUP(dupChannels, graph->nChannels); graph->speedInter /= DIVUP(dupChannels, graph->nChannels); graph->nChannels = dupChannels; }return ncclSuccess;}

然后开始搜索channel，对于ringGraph来说其实就是搜索出来一系列的环，每个rank对应这个环的一个节点，记录了环的prev和next，这里是一个回溯的过程，执行一次ncclTopoSearchRec就会得到一个环，执行一次ncclTopoSearchTryGpu看选择出来的下一个点能不能到达，执行一次ncclTopoSearchRecGpu用来找下一个GPU，接下来具体看下。

ncclResult_t ncclTopoSearchRec(struct ncclTopoSystem* system, struct ncclTopoGraph* graph, struct ncclTopoGraph* saveGraph, int* time) {

int backToNet, backToFirstRank; NCCLCHECK(ncclTopoSearchParams(system, graph->pattern, &backToNet, &backToFirstRank));if (system->nodes[NET].count) {// Start from NET ncclTopoSearchRecNet(system, graph, saveGraph, backToNet, backToFirstRank, time); } else {// Intra-node only.if (graph->nChannels == 0) {// Try PCI order first

 NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph, 0, backToNet, backToFirstRank, FORCED_ORDER_PCI, time, -1, -1, 0));

} else {// Also try to replay previous channelint g; NCCLCHECK(ncclTopoReplayGetGpu(system, graph, -1, &g));

 NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph, 0, backToNet, backToFirstRank, FORCED_ORDER_REPLAY, time, -1, -1, g));

}if (graph->sameChannels == 0 || graph->nChannels == 0) {// Finally, try all other possibilities unless we are forced to use the same channelsfor (int g=0; g<system->nodes[GPU].count; g++) { NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph, 0, backToNet, backToFirstRank, 0, time, -1, -1, g)); } } }return ncclSuccess;}

通过ncclTopoSearchParams设置backToNet和backToFirstRank参数，单机八卡ringGraph场景下这俩参数会分别设置为-1和7，此时nchannel为0，执行ncclTopoSearchTryGpu，强制为pci顺序，就是devid的顺序，从dev0开始。

ncclResult_t ncclTopoSearchParams(struct ncclTopoSystem* system, int pattern, int* backToNet, int* backToFirstRank) {if (system->nodes[NET].count) {if (pattern == NCCL_TOPO_PATTERN_RING) *backToNet = system->nodes[GPU].count-1;elseif (pattern == NCCL_TOPO_PATTERN_TREE) *backToNet = 0;else *backToNet = 1;if (pattern == NCCL_TOPO_PATTERN_SPLIT_TREE_LOOP) *backToFirstRank = system->nodes[GPU].count-1;else *backToFirstRank = -1; } else { *backToNet = -1;

if (pattern == NCCL_TOPO_PATTERN_RING || pattern == NCCL_TOPO_PATTERN_SPLIT_TREE_LOOP) *backToFirstRank = system->nodes[GPU].count-1;

else *backToFirstRank = -1; }return ncclSuccess;}

然后执行ncclTopoSearchTryGpu，这里会判断下一个点能不能到达，因为type为-1，ncclTopoFollowPath会设置gpu为0号卡，直接执行ncclTopoSearchRecGpu，从0号卡开始搜，step为0。

ncclResult_t ncclTopoSearchTryGpu(struct ncclTopoSystem* system, struct ncclTopoGraph* graph, struct ncclTopoGraph* saveGraph, int step, int backToNet, int backToFirstRank, int forcedOrder, int *time, int type, intindex, int g) {

const uint64_t flag = 1ULL<<(graph->nChannels); struct ncclTopoNode* gpu; NCCLCHECK(ncclTopoFollowPath(system, graph, type, index, GPU, g, 1, &gpu));if (gpu) { gpu->used ^= flag; NCCLCHECK(ncclTopoSearchRecGpu(system, graph, saveGraph, gpu, step, backToNet, backToFirstRank, forcedOrder, time)); gpu->used ^= flag; NCCLCHECK(ncclTopoFollowPath(system, graph, type, index, GPU, g, -1, &gpu)); }return ncclSuccess;}

ncclResult_t ncclTopoSearchRecGpu(struct ncclTopoSystem* system, struct ncclTopoGraph* graph, struct ncclTopoGraph* saveGraph, struct ncclTopoNode* gpu, int step, int backToNet, int backToFirstRank, int forcedOrder, int *time) {

if ((*time) <= 0) return ncclSuccess; (*time)--;int ngpus = system->nodes[GPU].count;if (step == ngpus) { ... } graph->intra[graph->nChannels*ngpus+step] = gpu->gpu.rank;int g = gpu - system->nodes[GPU].nodes;if (step == backToNet) { ... } elseif (step < system->nodes[GPU].count-1) {// Go to next GPUintnext[NCCL_TOPO_MAX_NODES];int count;if (forcedOrder == FORCED_ORDER_PCI) { // Try the PCI ordernext[0] = step+1; count = 1; } elseif (forcedOrder == FORCED_ORDER_REPLAY) { // Try last channel order NCCLCHECK(ncclTopoReplayGetGpu(system, graph, step, next)); count = 1; } else { // Normal search

 NCCLCHECK(ncclTopoSearchNextGpuSort(system, graph, gpu, next, &count, backToNet == -1 ? 0 : backToNet == step+1 ? 1 : -1 ));

}for (int i=0; i<count; i++) {

 NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph, step+1, backToNet, backToFirstRank, forcedOrder, time, GPU, g, next[i]));

} } elseif (step == backToFirstRank) { ... } else {// Next path NCCLCHECK(ncclTopoSearchRecGpu(system, graph, saveGraph, gpu, ngpus, -1, -1, forcedOrder, time)); }return ncclSuccess;}

然后看下ncclTopoSearchRecGpu，这里会选择下一个节点，先将0号卡节点写入到graph->intra的对应位置；由于当前step是0，因此会在xx行选择下一个GPU，next数组表示候选的GPU节点，由于forcedOrder == FORCED_ORDER_PCI，所以候选只有一个，即1号卡，然后对所有候选执行ncclTopoSearchTryGpu判断这一步是否可行并继续选择下一个节点。

然后回到ncclTopoSearchRec开始尝试判断是否可达1号卡，看下ncclTopoFollowPath，这个函数就是判断能否从type1的index1节点到达type2的index2节点，这里可以看到之前在选起点的时候type1为-1，因此直接将node设置为type2的index2就返回；这次我们要判断gpu0到gpu1是否可达，获取index1到index2的路径path，如果index1和index2的类型都是GPU那么speed就设置为graph->speedIntra，即搜索之前设置的条件，mult是函数的入参，表示需要在path上加还是减去speed，向下搜环的时候需要在path上减去speed，当回溯回去的时候需要将speed加回去，然后判断path的type是否大于之前设置的type，即graph->typeIntra，大于的话说明不可达，然后通过followPath将path上的边全都减去speed，如果有边剩下的带宽不够speed，那么通过rewind加回去，此时路径不可达；如果足够的话，则设置node为index2。

static ncclResult_t ncclTopoFollowPath(struct ncclTopoSystem* system, struct ncclTopoGraph* graph, int type1, int index1, int type2, int index2, int mult, struct ncclTopoNode** node) {

// First handle easy cases *node = system->nodes[type2].nodes+index2;if (type1 == -1) return ncclSuccess;struct ncclTopoNode* node1 = system->nodes[type1].nodes+index1;struct ncclTopoLinkList* path = node1->paths[type2]+index2;if (path->count == 0 ) return ncclSuccess;// Now check link type *node = NULL;int intra = type1 == GPU && type2 == GPU;float speed = intra ? graph->speedIntra : graph->speedInter;int type = intra ? graph->typeIntra : graph->typeInter;if (mult == 1 && (path->type > type)) return ncclSuccess; speed *= mult;// Check there is enough bandwidth on paths.int step = 0; NCCLCHECK(followPath(path, node1, path->count, speed, &step));if (step < path->count) goto rewind;// Enough bandwidth : return destination node. graph->nHops += mult*path->count; *node = system->nodes[type2].nodes+index2;return ncclSuccess;rewind:// Not enough bandwidth : rewind and exit. NCCLCHECK(followPath(path, node1, step, -speed, &step));return ncclSuccess;}

接着递归执行ncclTopoSearchRecGpu，重复上述过程，直到gpu7，这个时候graph->intra中的第一个环是[0,1,2,3,4,5,6,7]，此时step为backToFirstRank，然后通过获取第一个gpu，即gpu0，然后继续执行ncclTopoFollowPath判断7到0是否可达，如果可达的话继续递归执行ncclTopoSearchRecGpu，此时step == ngpus，即搜索到了一个环，那会将现有的graph去更新最优的saveGraph，判断标准主要是看总的带宽，即环的数量乘以speedIntra；如果搜到的环的数量已经达到maxChannel了，则结束本次搜索，否则继续递归执行ncclTopoSearchRec搜索下一个环。

ncclResult_t ncclTopoSearchRecGpu(struct ncclTopoSystem* system, struct ncclTopoGraph* graph, struct ncclTopoGraph* saveGraph, struct ncclTopoNode* gpu, int step, int backToNet, int backToFirstRank, int forcedOrder, int *time) {

if ((*time) <= 0) return ncclSuccess; (*time)--;int ngpus = system->nodes[GPU].count;if (step == ngpus) {// Determine whether we found a better solution ornotint copy = 0; graph->nChannels++; NCCLCHECK(ncclTopoCompareGraphs(graph, saveGraph, &copy));if (copy) { memcpy(saveGraph, graph, sizeof(struct ncclTopoGraph));if (graph->nChannels == graph->maxChannels) *time = -1; }if (graph->nChannels < graph->maxChannels) { NCCLCHECK(ncclTopoSearchRec(system, graph, saveGraph, time)); } graph->nChannels--;return ncclSuccess; } graph->intra[graph->nChannels*ngpus+step] = gpu->gpu.rank;int g = gpu - system->nodes[GPU].nodes;if (step == backToNet) { ... } elseif (step < system->nodes[GPU].count-1) { ... } elseif (step == backToFirstRank) {// Find first GPU and loop back to itint p; NCCLCHECK(getGpuIndex(system, graph->intra[graph->nChannels*ngpus], &p)); struct ncclTopoNode* firstGpu; NCCLCHECK(ncclTopoFollowPath(system, graph, GPU, g, GPU, p, 1, &firstGpu));if (firstGpu) { NCCLCHECK(ncclTopoSearchRecGpu(system, graph, saveGraph, firstGpu, step+1, backToNet, -1, forcedOrder, time)); NCCLCHECK(ncclTopoFollowPath(system, graph, GPU, g, GPU, p, -1, &firstGpu)); } } else {// Next path NCCLCHECK(ncclTopoSearchRecGpu(system, graph, saveGraph, gpu, ngpus, -1, -1, forcedOrder, time)); }return ncclSuccess;}

假设现在开始搜索下一个环，回到ncclTopoSearchRec，接下来会尝试复制刚刚的环，ncclTopoReplayGetGpu会获取上一个环的第step + 1个gpu，这里其实就是gpu0，然后继续执行ncclTopoSearchTryGpu，这里设置forcedOrder为FORCED_ORDER_REPLAY。

ncclResult_t ncclTopoSearchRec(struct ncclTopoSystem* system, struct ncclTopoGraph* graph, struct ncclTopoGraph* saveGraph, int* time) {

{// Also try to replay previous channelint g; NCCLCHECK(ncclTopoReplayGetGpu(system, graph, -1, &g));

 NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph, 0, backToNet, backToFirstRank, FORCED_ORDER_REPLAY, time, -1, -1, g));

}}ncclResult_t ncclTopoReplayGetGpu(struct ncclTopoSystem* system, struct ncclTopoGraph* graph, int step, int* g) { *g = -1;if (graph->nChannels == 0) return ncclInternalError;int ngpus = system->nodes[GPU].count;int nextRank = graph->intra[(graph->nChannels-1)*ngpus+step+1];for (int i=0; i<ngpus; i++) if (system->nodes[GPU].nodes[i].gpu.rank == nextRank) { *g = i;return ncclSuccess; }if (*g == -1) return ncclInternalError;return ncclSuccess;}

然后FORCED_ORDER_REPLAY会在寻找下一个节点时通过ncclTopoReplayGetGpu获取上一个环对应step的gpu，因此就是一直在复制上一个环。

ncclResult_t ncclTopoSearchRecGpu(struct ncclTopoSystem* system, struct ncclTopoGraph* graph, struct ncclTopoGraph* saveGraph, struct ncclTopoNode* gpu, int step, int backToNet, int backToFirstRank, int forcedOrder, int *time) {

...elseif (step < system->nodes[GPU].count-1) {// Go to next GPUintnext[NCCL_TOPO_MAX_NODES];int count;if (forcedOrder == FORCED_ORDER_PCI) { // Try the PCI ordernext[0] = step+1; count = 1; } elseif (forcedOrder == FORCED_ORDER_REPLAY) { // Try last channel order NCCLCHECK(ncclTopoReplayGetGpu(system, graph, step, next)); count = 1; } else { // Normal search

 NCCLCHECK(ncclTopoSearchNextGpuSort(system, graph, gpu, next, &count, backToNet == -1 ? 0 : backToNet == step+1 ? 1 : -1 ));

}for (int i=0; i<count; i++) {

 NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph, step+1, backToNet, backToFirstRank, forcedOrder, time, GPU, g, next[i]));

} } ...}

到这里就完成了第一次搜索，如前文所述，如果搜索出来的结果没有达到条件，就开始逐步降低条件继续搜索，接下来的过程比较类似，就不再赘述了。

然后看下多机场景下，比如两机十六卡场景，这个时候有网卡，所以ncclTopoSearchParams设置参数为backToFirstRank = -1，backToNet = 7，ncclTopoSearchRec直接执行ncclTopoSearchRecNet。

ncclResult_t ncclTopoSearchRec(struct ncclTopoSystem* system, struct ncclTopoGraph* graph, struct ncclTopoGraph* saveGraph, int* time) {

ncclTopoSearchRecNet会搜索出来一个答案，这里会遍历每个网卡，尝试用每个网卡作为起点搜索环，首先是网卡0，将0写入到inter中第一个channel中，然后将网卡0的带宽减去speedInter，maxChannel减去1，然后后边过程和上述很像，会通过ncclTopoSearchTryGpu搜索出一个环。

ncclResult_t ncclTopoSearchRecNet(struct ncclTopoSystem* system, struct ncclTopoGraph* graph, struct ncclTopoGraph* saveGraph, int backToNet, int backToFirstRank, int* time) {

const int speed = graph->speedInter;for (int n=0; n<system->nodes[NET].count; n++) { struct ncclTopoNode* net = system->nodes[NET].nodes+n; struct ncclTopoNode* gpu;if (graph->collNet && net->net.collSupport == 0) continue;if (net->net.width < speed) continue;if (net->net.maxChannels == 0) continue; graph->inter[graph->nChannels*2] = net->id;for (int i=0; i<system->nodes[NET].count; i++) {if ((system->nodes[NET].nodes[i].net.asic == net->net.asic) && (system->nodes[NET].nodes[i].net.port == net->net.port)) { system->nodes[NET].nodes[i].net.width -= speed; } } net->net.maxChannels--;// First try to replay the last channelif (graph->nChannels > 0) { int g; NCCLCHECK(ncclTopoReplayGetGpu(system, graph, -1, &g));

 NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph, 0, backToNet, backToFirstRank, FORCED_ORDER_REPLAY, time, NET, n, g));

}if (graph->nChannels == 0 || graph->sameChannels == 0) {if (graph->nChannels == 0) {// Always try the PCI order first to set a reference, but don't count in the timeout nor let it run for long int t = 1 << 10;

 NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph, 0, backToNet, backToFirstRank, FORCED_ORDER_PCI, &t, NET, n, 0));

if (t == -1) *time = -1; } ... }return ncclSuccess;}

ncclTopoSearchTryGpu还是会调用ncclTopoSearchRecGpu，当没有遍历完所有GPU节点时，仍然通过递归执行ncclTopoSearchRecGpu来填充graph->intra，最后遍历所有GPU之后step等于7，即backToNet，这里首先拿出来起始网卡，即网卡0，如果搜索参数支持crossNic的话就选一个合法的网卡即可，如果不支持的话就判断网卡0是否合法，合法的话将网卡0填充到graph->inter，一个环就搜索完成了。这里有一个小的疑惑点，在将出口网卡选择好后，并没有将该网卡的带宽减去speed。

ncclResult_t ncclTopoSearchRecGpu(struct ncclTopoSystem* system, struct ncclTopoGraph* graph, struct ncclTopoGraph* saveGraph, struct ncclTopoNode* gpu, int step, int backToNet, int backToFirstRank, int forcedOrder, int *time) {

if ((*time) <= 0) return ncclSuccess; (*time)--;int ngpus = system->nodes[GPU].count;if (step == ngpus) {// Determine whether we found a better solution ornotint copy = 0; graph->nChannels++; NCCLCHECK(ncclTopoCompareGraphs(graph, saveGraph, &copy));if (copy) { memcpy(saveGraph, graph, sizeof(struct ncclTopoGraph));if (graph->nChannels == graph->maxChannels) *time = -1; }if (graph->nChannels < graph->maxChannels) { NCCLCHECK(ncclTopoSearchRec(system, graph, saveGraph, time)); } graph->nChannels--;return ncclSuccess; } graph->intra[graph->nChannels*ngpus+step] = gpu->gpu.rank;int g = gpu - system->nodes[GPU].nodes;if (step == backToNet) {// first get back to NICif (system->nodes[NET].count) {int startNetIndex; NCCLCHECK(getNetIndex(system, graph->inter[graph->nChannels*2], &startNetIndex)); struct ncclTopoNode* startNet = system->nodes[NET].nodes+startNetIndex;for (int n=0; n<system->nodes[NET].count; n++) { struct ncclTopoNode* net = system->nodes[NET].nodes+n;if (graph->pattern == NCCL_TOPO_PATTERN_TREE && net->id != startNet->id) continue; // Trees are symmetricif (graph->crossNic != 1 && (net->net.asic != startNet->net.asic || net->net.port != startNet->net.port)) continue; NCCLCHECK(ncclTopoFollowPath(system, graph, GPU, g, NET, n, 1, &net));if (net) { graph->inter[graph->nChannels*2+1] = net->id; NCCLCHECK(ncclTopoSearchRecGpu(system, graph, saveGraph, gpu, step, -1, backToFirstRank, forcedOrder, time)); NCCLCHECK(ncclTopoFollowPath(system, graph, GPU, g, NET, n, -1, &net)); } } } } elseif (step < system->nodes[GPU].count-1) {// Go to next GPUintnext[NCCL_TOPO_MAX_NODES];int count;if (forcedOrder == FORCED_ORDER_PCI) { // Try the PCI ordernext[0] = step+1; count = 1; } elseif (forcedOrder == FORCED_ORDER_REPLAY) { // Try last channel order NCCLCHECK(ncclTopoReplayGetGpu(system, graph, step, next)); count = 1; } else { // Normal search

 NCCLCHECK(ncclTopoSearchNextGpuSort(system, graph, gpu, next, &count, backToNet == -1 ? 0 : backToNet == step+1 ? 1 : -1 ));

}for (int i=0; i<count; i++) {

 NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph, step+1, backToNet, backToFirstRank, forcedOrder, time, GPU, g, next[i]));

} } elseif (step == backToFirstRank) { ... } else {// Next path NCCLCHECK(ncclTopoSearchRecGpu(system, graph, saveGraph, gpu, ngpus, -1, -1, forcedOrder, time)); }return ncclSuccess;}

回到ncclTopoSearchRecNet，接下来会尝试复制刚刚搜索出来的环，当搜索出一个答案后，回到第一次ncclTopoSearchRecNet，接下来会尝试从离网卡0最近的GPU开始搜索，而不是从GPU0开始，假设为GPUn，这里会先判断GPUn到PCIe switch的双向带宽是否还有空闲，如果有空闲的话才从GPUn开始搜索。但是和这里的注释表述不太相符，注释的意思是说不会将一个GPU既用来发送，又用来接收（说这种情况会影响带宽，这一点比较疑惑）。

ncclResult_t ncclTopoSearchRecNet(struct ncclTopoSystem* system, struct ncclTopoGraph* graph, struct ncclTopoGraph* saveGraph, int backToNet, int backToFirstRank, int* time) {

const int speed = graph->speedInter;for (int n=0; n<system->nodes[NET].count; n++) { ... // Then try the most local GPUs float maxWidth = 0;int minHops = 0xfffffff; struct ncclTopoLinkList* paths = net->paths[GPU];for (int g=0; g<system->nodes[GPU].count; g++) {if (paths[g].width > maxWidth) { maxWidth = paths[g].width; minHops = paths[g].count; } elseif (paths[g].width == maxWidth && paths[g].count < minHops) { minHops = paths[g].count; } }if (maxWidth >= speed) {// In the first loop, avoid using GPUs in both directions between channels (one channel // sending from that GPU and one channel receiving to that GPU), since that usually leads // to lower BW.for (int tryGpuBidir=0; tryGpuBidir<2; tryGpuBidir++) {for (int g=0; g<system->nodes[GPU].count; g++) {if (paths[g].width == maxWidth && paths[g].count == minHops) { gpu = system->nodes[GPU].nodes+g;int gpuUsed = gpuPciWidth(gpu) > 0 ? 0 : 1;if (tryGpuBidir == gpuUsed) { NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph, 0, backToNet, backToFirstRank, 0, time, NET, n, g)); } } } } } } net->net.maxChannels++;for (int i=0; i<system->nodes[NET].count; i++) {if ((system->nodes[NET].nodes[i].net.asic == net->net.asic) && (system->nodes[NET].nodes[i].net.port == net->net.port)) {system->nodes[NET].nodes[i].net.width += speed; } } }return ncclSuccess;}

到这里就完成了channel的搜索。总结一下，本节就是基于机器拓扑，搜索出一组channel用于数据的通信，并记录到ncclTopoGraph。

（本文经授权后由OneFlow发布。原文：https://blog.csdn.net/KIDGIN7439/article/details/128074716）

其他人都在看

揭示GPT Tokenizer的工作原理
通俗解构语言大模型的工作原理
语言大模型100K上下文窗口的秘诀
大模型的无限上下文与数据集组合艺术
GPT-4.5前瞻：代码解释器，编程新纪元
Transformer作者：指令型智能体的构建之法
OpenAI对齐负责人：“驾驭”超级智能四年计划
试用OneFlow: github.com/Oneflow-Inc/oneflow/

继续阅读

阅读原文