作者简介:
赵捷(清华大学工学学士,法国巴黎高等师范学校工学博士),长期从事编译器高级优化和代码生成方面的研究,目前主要从事深度学习编译器的研究。
李宝亮(清华大学获工学学士,国防科学技术大学工学博士),长期从事体系结构性能分析优化方面的研究,目前主要从事深度学习专用加速芯片的研制和相关系统软件和编程语言的开发。
随着人工智能、大数据和云计算的飞速发展,基于领域特定体系结构(DSA)的芯片[1-5]设计受到越来越多的关注。然而,领域特定体系结构在微架构和编程模型等方面也与通用CPU和GPU存在巨大差异[6-8],为了充分发挥DSA的性能优势并提高编程灵活性,编译器需要弥合上层编程模型与底层硬件结构的巨大鸿沟,在充分发挥硬件性能的前提下尽可能降低软件的编程难度。系统性能优化涵盖的优化内容纷繁复杂,在各种层级上都可以实现面向不同粒度、不同目标的优化。具体到面向深度学习的编译优化,通过这几年的不断努力尝试[9-14],业界逐渐形成了面向深度学习领域的计算图优化方法和面向通用计算领域的经典编译优化方法相结合的方式实现深度学习应用的端到端优化体系。
首先,将深度学习应用转换为计算图描述、在计算图上实现硬件结构无关的图层编译优化[15-21],然后再将计算图的子图下降到以循环为核心的计算描述,并实现硬件相关的循环编译优化[12,14]的编译流程。在硬件相关的循环优化阶段,编译器所做的工作也可以划分为两部分,高层循环优化负责实现包括芯片和体系结构领域相关从业人员相对熟悉的tiling(循环分块)、fusion(循环合并)等循环变换,而底层循环优化主要在循环内部实现比较成熟的寄存器分配、指令流水等优化。在这篇文章中,我们主要讨论高层循环优化。

如何借助编译优化理论和方法,将计算图描述的深度学习算法部署在具体硬件上并让算法高效运行,是学术界和工业界一个重要的研究课题。更多深度芯片编译技术内容在清华大学出版社新书《多面体编译理论与深度学习实践》中,现有5折预售优惠。(5折活动持续到2022年12月20日)

在整个过程中,循环优化起到至关重要的作用,因为整个深度学习算法的核心任务都是在循环中完成的。循环优化涉及依赖分析、循环变换、指令调度、存储管理和代码生成等内容,是一个非常复杂的步骤。近些年深度学习算法上编译技术的革新,涌现出了以循环分块为核心[22,23]、以循环合并为优化重点[12,14,25,26]的编译框架,但是高层循环优化所涵盖的内容却不仅仅是只有分块和合并这两种变换。在把一个用计算图描述的算法通过分析和优化转换成目标平台上能够运行的源程序或可执行代码之前,高层循环编译优化至少要经过如下图所示的十个步骤。
① IR下降/上升(Lowering/Hoisting to IR)
IR(Intermediate Representation)即中间表示,是编译器为了实现支持不同编程模型的语言规范和不同后端的代码生成实现的统一编译器中间表示[27-31]。相对于上层编程模型而言,IR的层级较低,但是随着编译系统以pass为构建基础的思维方式在各种领域的广泛应用[13],编译器的IR也出现了各种层级。将上层算法描述下降到编译器能够接受的IR是循环编译优化的第一个步骤,但这个步骤并不一定是从某种更高层次编程语言到编译IR之间的下降,也可能是从另外一种层级更低的IR到层级更高IR的上升。
但是,无论是从高级语言到IR的下降还是从更低层次IR到当前IR的上升,编译器都需要从该阶段提取循环内的各种信息,包括但不限于提取出循环的边界、步长、类型、控制流语句的约束、循环内数组和标量访存关系等。这些信息的提取是后面实现充分优化的基础。为了准确地表示这些信息,循环优化编译器的前端需要利用各种现成的理论基础,例如基本的集合和映射、区间分析、抽象解释、谓词逻辑等。与此同时,为了尽量完整地保留程序的语义,这些理论基础的应用范围要尽量宽泛。因此,如何选择一个足够强大的理论支撑来准确地描述和涵盖这些信息是该阶段的重要问题。
② 依赖关系分析(Dependence Analysis)
在提取出循环相关信息后,优化编译器可以对程序的依赖关系进行判定。特别地,这里的依赖关系分析主要是指循环之间的数据依赖关系[32]。依赖关系决定了后续循环优化是否能够成功实施。依赖关系主要是指对相同内存地址单元进行先写后读的流依赖、先读后写的反依赖和两次写入的输出依赖。程序的数据流分析则特指先写后读的流依赖,因为只有这种依赖关系是算法执行步骤必须的数据传输过程。相反,反依赖和输出依赖都是由于计算机资源有限,通过共享有限的存储资源而导致的依赖,在资源允许的情况下,这些依赖都是可以消除的。对相同内存地址单元进行多次读取的操作之间是不存在依赖的,但是有效利用这种连续读取操作特征可以提高程序的空间局部性。
根据上述几种依赖关系类型,优化编译器可以对循环语句之间是否存在依赖进行判定。依赖的判定可以是比较简单的,因为在一些领域特定的程序中,语句之间的数组访存关系比较简单,例如深度学习算法中各个语句之间的依赖关系。但是在更一般的情况下,依赖关系的判定可以是一个非常复杂的过程,在一些早期的优化编译器中,由于计算能力的有些只能通过近似计算或对依赖进行证明或证伪来判定依赖[33]。随着现代计算机计算能力的提升,依赖关系可以转换成比较复杂的整数线性规划[34-36]过程来求解,优化编译器不仅可以对依赖关系进行定性分析,而且可以进行更精确的定量分析;不仅可以对语句之间的依赖关系进行判定,也可以对语句不同动态运行实例之间的依赖关系进行细粒度的分析。利用依赖关系分析的结果,优化编译器可以实现有利于程序性能提升的变换和优化。
③ 调度(Scheduling for Validity)
优化编译器对程序进行调度的过程可以看作是在保证依赖关系的前提下,为实现达到某种或几种程序属性最大化而实现的程序变换过程,实现这种调度的过程被称作调度器或调度算法[37-40]。保证变换后的程序满足依赖关系的约束是调度算法必须满足的条件,但如何利用好依赖关系实现一种或多种程序属性的最大化是判定一个调度算法好坏的关键。如②中所述,对相同内存地址单元多次进行读取的关系也可以被调度算法利用起来,以达到提升程序空间局部性的目的。此外,对产生流依赖的语句实例之间的距离进行最小化变换,还可以达到提升程序时间局部性的目的。
除了提升程序的局部性之外,调度算法还需要考虑程序的并行性。特别地,循环优化编译器主要研究的是循环不同迭代之间的并行性。在程序语句被多层循环嵌套的情况下,调度算法应该尽量满足外层并行的特征,这样能够最大限度地提升程序的并行粒度;在循环不同迭代的语句实例之间存在依赖时,这种外层的并行还可以减少并行计算功能部件之间的通信或同步开销。另外,由于循环分块在现代大型应用程序中已经成为一种不可或缺的循环变换,调度算法还要充分考虑变换后程序多层循环是否可以实现循环分块的合法性。因此,调度算法已经从过去简单实现一种程序属性最大化的目标,转换成了实现多种程序属性目标最大化的优化问题[41,42]。如何在这些程序属性之间进行权衡取舍也应该是调度算法应该解决的问题。
④ 循环合并(Fusion for Consecutivity)
对于深度学习算法编译优化的从业人员而言,循环合并应该是一个非常熟悉的循环变换手段了。对于计算模式相对简单的深度学习算法而言,编译器工程师可以通过指定特定的循环合并方式来实现这一优化手段,但在更一般的应用中,由于语句之间的计算和访存方式有更复杂的表现形式,需要设计特定的循环合并算法来自动实现这一过程。在调度算法实现了几种程序属性最大化的前提下,循环合并可以将一些原本需要在更外层循环进行交换的数组转换为内层循环之间的通信,这种通信或同步的粒度降低可以实现一些优化目的,如减少对带宽的压力,把对访存速率较慢内层地址单元的访问转换为高速缓存的访问等。
循环合并可能带来的一个副作用就是破坏程序的某种属性。例如,在对多个循环进行合并时,合并后的循环可能失去并行性或被合并后循环所在层能够进行循环分块的属性遭到破坏。与此同时,循环也不是无限制地进行合并,因为合并后带来的性能收益还取决于高速缓存的存储能力,如果被合并后由于计算资源的限制没有实现访存优化,那么循环合并带来的性能收益微乎其微。因此,与调度算法类似,循环合并的过程也是与并行性、循环分块等进行权衡取舍的过程。
⑤ 循环分块(Tiling for Atomicity)
循环分块几乎是从事编译优化的研究人员最熟悉的一种循环变换了,特别是近些年在深度学习算法的编译优化中还提出了以循环分块为核心的调度和优化框架。从③中的描述可以看出,调度算法在许多情况下也是为循环分块的合法性服务的变换过程;从④中的描述也可以发现,循环分块还必须与循环合并等变换进行权衡取舍。循环分块的实现可以细分为循环分块形状的判定和循环分块大小的确定。循环分块形状的判定往往由程序的依赖关系来决定。在传统的高性能计算应用中,比较规则的矩形、平行四边形等循环分块形状是最常用的分块形状[39],但随着对特殊计算模式实现循环分块的需求,各种不同形状的循环分块也被逐渐开发出来[43,44,45],最经典的分块形状就是深度学习算法中卷积运算涉及到的overlapped分块形状[25]。不过,特定的循环分块形状都是牺牲一种程序属性作为代价来达到提升另外一种属性的过程,如overlapped分块形状就是以冗余计算为代价提升分块之间的并行性。
循环分块大小的确定则是与程序依赖关系无关的一个研究内容,但循环分块大小对最终程序性能的提升起到十分关键的作用。目前,循环分块大小的确定一种是通过程序性能调优[46-49]的手段来达到选择最优的目的,另外一种则是通过参数化循环分块的方式,将静态编译时分析的压力转移到运行时阶段,通过运行判定计算资源使用率、访存效率的手段来确定最终具体的分块大小数值。参数化的循环分块方式与当前深度学习算法编译优化中主要攻坚的动态shape问题[15,50-52]也密切相关。
⑥ 二次调度(Intra-tile Rescheduling)
在步骤③中实现的调度是为了实现有利于循环合并和循环分块的程序熟悉的最大化而设计的调度算法,当这些变换都实现之后,程序性能的提升可能对程序属性最大化的需求可能有所变换,甚至与之前的需求完全相反[53]。以循环合并为例,在实现循环分块之前,将对相同内存地址单元的数据访存尽量靠近在一起可能有助于减少程序对外围内存地址单元的访问频率和次数,但循环分块之后,将一个循环分块作为一个原子操作,可以有助于将不同的计算语句分配到不同的计算功能部件上,这需要将原来合并的语句尽量分布出来。
二次调度另外一个需要考虑的问题是内层循环的并行,这与传统意义上的向量化或深度学习算法编译中的张量化过程要实现的目标一致。由于对程序属性的需求有所改变,优化编译器可能需要实现一种不同的调度算法。特别地,深度学习算法中的张量化过程可能还会涉及到不同张量形状所带来的性能差异,二次调度过程需要支持不同张量形状所导致的张量化程序结果,如何在不同的张量化程序之间选择最优,可能又需要程序性能调优的介入。
⑦ 存储管理(Storage Management)
与CPU上Cache通过硬件实现存储管理的模式不同,DSA架构上高速缓存的存储管理通常需要手工或软件进行实现,例如在GPU上将数据从全局内存搬移到共享内存上,或者将已写入的共享内地地址单元上的数据再搬回到全局内存。在程序中加入这些额外的数据搬移语句不仅要考虑了合适的插入位置,还要考虑搬移数据量的大小是否满足程序的合法性约束。如果这些数据搬移语句插入的位置过于靠内,那么数据搬移的次数就会随之增长,程序性能也会相应受到影响;同时,数据搬移语句还要与相应的计算步骤相对应,以确保程序执行计算时数据已经在相应的内存地址单元内就位。
数组在高速缓存上的分配可能还会受到循环分块形状的影响。在一些特殊的循环分块形状下,计算分块的内存访问空间可能是一个非规则的形状[6,14,54],在内存地址上开辟、分配这些非规则的形状可能是不太现实,在这样的情况就需要对数据所占据的内存空间进行向上近似,有可能会造成内存地址空间的浪费。如何最小化这种内存地址空间的浪费是该阶段要解决的问题。
⑧ IR上升/下降(Hoisting/Lowering to IR)
该阶段可以是步骤①的逆过程,将优化后的编译器IR上升/下降到一种统一的表示,可以是抽象语法树形式,也可以是其步骤①的输入IR。这是由于优化编译器往往需要支持不同的后端,这种统一的表示可以为后续面向不同后端的代码生成定制化提供支持。
值得注意的是,IR上升/下降阶段并不仅仅是简单的IR翻译过程[55],一些特殊的循环变换在该阶段进行实现更加便捷。一个典型的例子就是对循环按照特定的因子进行展开,在前面几个阶段实现的优化前提下,对某一个最内层循环进行展开可以是只对该层循环进行标记,并在当前这个阶段进行实际的循环展开操作。循环展开因子的确定和其它程序属性之间又是一种权衡取舍的关系。另外,由于循环变换可能导致该阶段所见到IR的循环边界非常复杂,在该阶段还可以对循环边界进行特定的简化、循环分布等操作,可以实现生成代码复杂性的降低[56,57]。在对循环边界约束的正确判定前提下,还可以在该阶段消除一些冗余的循环控制条件,达到降低程序控制流复杂度甚至消除无效循环的目的。
⑨ 代码生成(Code Generation)
在步骤⑧的基础上,优化编译器可以生成满足目标体系结构的源程序或更底层的二级制代码。该步骤的过程相对简单,就是将步骤⑧的结果进行逐句翻译和输出。不过,该阶段还是可以根据目标平台的特征来进行一些更细致的定制化处理,例如循环索引变量的类型,在一些特定的平台上可以定义成带符号和无符号的整数变量定义等[54],不过相对于前面的优化而言,这部分的定制化过程相对简单得多。
⑩ 性能调优(Tuning for Performance)
性能调优[46-49,58-61]的目的是在生成的多个代码版本中选择性能最优的作为最终目标输出给用户,用于在目标平台上高效运行。虽然前面的许多步骤对程序按照目标平台的特征进行了一系列的变换和优化,但是程序最终的性能仍然是一个无法完全预测的过程。这是由于一些特定的程序属性在前面的优化过程中无法完全确定,例如循环分块的大小、循环展开因子的大小。另外,上面几个循环变换的过程并不是一个统一的变换序列,如何通过调整这些变换的顺序以达到更好的程序性能也是性能调优要解决的问题。性能调优也是目前深度学习算法编译优化研究的一个重要方向,近几年在该方向上取得了比较多的突破,成为许多芯片产品交付的重要组成部分。
综合上面几个过程的分析不难看出,高层循环编译优化的过程是一个非常复杂的过程。这篇文章的目的是为想从事该领域研究和配套软件开发的从业人员提供一个初步的认识,文中所描述的过程并不是一个统一的、标准的流程,更多具体的内容可能还要结合实际情况进行调整和扩展。另外,正如前面所述,循环优化是一个非常复杂的过程,涉及依赖分析、循环变换、指令调度和存储管理等诸多相互耦合的内容,学术界和工业界经过漫长的探索,逐渐形成了以多面体优化理论为基础的循环优化体系。多面体优化理论以集合和关系为基础,将循环相关的表示和优化统一在一个理论体系中,并在学术界和工业界得到了成功的应用。本书则是对前人工作的一个总结。
-  完  -

参考资料
[1] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (Toronto, ON, Canada) (ISCA’17). ACM, New York, NY, USA, 1-12. https://doi.org/10.1145/3079856.3080246
[2] Zhe Jia, Blake Tillman, Marco Maggioni, and Daniele Paolo Scarpazza. 2019. Dissecting the Graphcore IPU Architecture via Microbenchmarking. arXiv:1912.03413 [cs.DC]
[3] Kamil Rocki, Dirk Van Essendelft, Ilya Sharapov, Robert Schreiber, Michael Morrison, Vladimir Kibardin, Andrey Portnoy, Jean Francois Dietiker, Madhava Syamlal, and Michael James. 2020. Fast Stencil Code Computation on a Wafer-Scale Processor. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Atlanta, Georgia) (SC’20). IEEE Press, Article 58, 14 pages.
[4] Yongwei Zhao, Zidong Du, Qi Guo, Shaoli Liu, Ling Li, Zhiwei Xu, Tianshi Chen, and Yunji Chen. 2019. Cambricon-F: Machine Learning Computers with Fractal von Neumann Architecture. In Proceedings of the 46th International Symposium on Computer Architecture (Phoenix, Arizona) (ISCA’19). ACM, New York, NY, USA, 788-801. https://doi.org/10.1145/3307650.3322226
[5] Liao, H., Tu, J., Xia, J., Liu, H., Zhou, X., Yuan, H., and Hu, Y. Ascend: a scalable and unified architecture for ubiquitous deep neural network computing : Industry track paper. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 789–801, 2021. doi: 10.1109/HPCA51647.2021.00071.
[6] Jie Zhao, Bojie Li, Wang Nie, Zhen Geng, Renwei Zhang, Xiong Gao, Bin Cheng, Chen Wu, Yun Cheng, Zheng Li, Peng Di, Kun Zhang, and Xuefeng Jin. 2021. AKG: Automatic Kernel Generation for Neural Processing Units Using Polyhedral Transformations. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (Virtual, Canada) (PLDI 2021). Association for Computing Machinery, New York, NY, USA, 1233–1248. https://doi.org/10.1145/3453483.3454106
[7] Zheng, S., Chen, R., Wei, A., Jin, Y., Han, Q., Lu, L., ... & Liang, Y. (2022, June). AMOS: enabling automatic mapping for tensor computations on spatial accelerators with hardware abstraction. In ISCA (pp. 874-887).
[8] Zhou, Y., Dong, X., Meng, T., Tan, M., Akin, B., Peng, D., ... & Laudon, J. (2022). Towards the Co-design of Neural Networks and Accelerators. Proceedings of Machine Learning and Systems, 4, 141-152.
[9] Google. 2017. XLA: Optimizing Compiler for Machine Learning. https://www.tensorflow.org/xla
[10] Venmugil Elango, Norm Rubin, Mahesh Ravishankar, Hariharan Sandanagobalane, and Vinod Grover. 2018. Diesel: DSL for Linear Algebra and Neural Net Computations on GPUs. In Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (Philadelphia, PA, USA) (MAPL 2018). ACM, New York, NY, USA, 42–51. https://doi.org/10.1145/3211346.3211354
[11] Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. 2019. Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization (Washington, DC, USA) (CGO 2019). IEEE Press, Piscataway, NJ, USA, 193-205. http://dl_acm.gg363.site/citation.cfm?id=3314872.3314896
[12] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie
Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-end Optimizing Compiler for Deep Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design
and Implementation (Carlsbad, CA, USA) (OSDI’18). USENIX Association, Berkeley, CA, USA, 579-594. http://dl.acm.org/citation.cfm?id=3291168.3291211
[13] Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2021. MLIR: Scaling Compiler Infrastructure for Domain Specific Computation. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 2-14. https://doi.org/10.1109/CGO51591.2021.9370308
[14] Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary Devito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2019. The Next 700 Accelerated Layers: From Mathematical Expressions of Network Computation Graphs to Accelerated GPU Kernels, Automatically. ACM Trans. Archit. Code Optim. 16, 4, Article 38 (Oct. 2019), 26 pages. https://doi.org/10.1145/3355606
[15] Kwon, W., Yu, G.-I., Jeong, E., and Chun, B.-G. Nimble: Lightweight and parallel gpu task scheduling for deep learning. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 8343–
8354. Curran Associates, Inc., 2020. https://proceedings.neurips.cc/paper/2020/file/5f0ad4db43d8723d18169b2e4817a160- Paper.pdf.
[16] Jie Zhao, Xiong Gao, Ruijie Xia, Zhaochuang Zhang, Deshi Chen, Lei Chen, Renwei Zhang, Zhen Geng, Bin Cheng, and Xuefeng Jin. 2022. Apollo: Automatic Partition-based Operator Fusion through Layer by Layer Optimization. In Proceedings of Machine Learning and Systems, Diana Marculescu, Yuejie Chi, and Carole-Jean Wu (Eds.), Vol. 4. 1–19.
[17] Zhou, Y., Roy, S., Abdolrashidi, A., Wong, D., Ma, P., Xu, Q., ... & Laudon, J. (2020). Transferable graph optimizers for ml compilers. Advances in Neural Information Processing Systems, 33, 13844-13855.
[18] Yaoyao Ding, Ligeng Zhu, Zhihao Jia, Gennady Pekhimenko, and Song Han. Ios: Inter-operator scheduler for cnn acceleration. In A. Smola, A. Dimakis, and I. Stoica, editors, Proceedings of Machine Learning and Systems, volume 3, pages 1–14, 2021.
[19] Jia, Z., Padon, O., Thomas, J., Warszawski, T., Zaharia, M., and Aiken, A. Taso: Optimizing deep learning computation with automatic generation of graph substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP’19, pp. 47–62, New York, NY, USA, 2019a. ACM. https://doi.org/10.1145/3341301.3359630.
[20] Yang, Y., Phothilimthana, P., Wang, Y., Willsey, M., Roy, S., & Pienaar, J. (2021). Equality saturation for tensor graph superoptimization. Proceedings of Machine Learning and Systems, 3, 255-268.
[21] Wang, H., Zhai, J., Gao, M., Ma, Z., Tang, S., Zheng, L., ... & Jia, Z. (2021). {PET}: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21) (pp. 37-54).
[22] Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. 2020. Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 881-897. https://www.usenix.org/conference/osdi20/presentation/ma
[24] Hongyu Zhu, Ruofan Wu, Yijia Diao, Shanbin Ke, Haoyu Li, Chen Zhang, Jilong Xue, Lingxiao Ma, Yuqing Xia, Wei Cui, Fan Yang, Mao Yang, Lidong Zhou, Asaf Cidon, and Gennady Pekhimenko. ROLLER: Fast and efficient tensor compilation for deep learning.
In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 233–248, Carlsbad, CA, July 2022. USENIX Association.
[25] Jie Zhao and Peng Di. 2020. Optimizing the Memory Hierarchy by Compositing Automatic Transformations on Computations and Data. In Proceedings of the 53rd IEEE/ACM International Symposium on Microarchitecture (Virtual Event, Athens, Greece) (MICRO-53). IEEE Press, Piscataway, NJ, USA, 427ś441. https://doi.org/10.1109/MICRO50266.2020.00044
[26] Abhinav Jangda and Uday Bondhugula. 2018. An Effective Fusion and Tile Size Model for Optimizing Image Processing Pipelines. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Vienna, Austria) (PPoPP’18). ACM,
New York, NY, USA, 261ś275. https://doi.org/10.1145/3178487.3178507
[27] Feng, S., Hou, B., Jin, H., Lin, W., Shao, J., Lai, R., ... & Chen, T. (2022). Tensorir: An abstraction for automatic tensorized program optimization. arXiv preprint arXiv:2207.04296.
[28] Tillet, P., Kung, H. T., & Cox, D. (2019, June). Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (pp. 10-19).
[29] Roesch, J., Lyubomirsky, S., Kirisame, M., Weber, L., Pollock, J., Vega, L., ... & Tatlock, Z. (2019). Relay: A high-level compiler for deep learning. arXiv preprint arXiv:1904.08368.
[30] Cyphers, S., Bansal, A. K., Bhiwandiwalla, A., Bobba, J., Brookhart, M., Chakraborty, A., Constable, W., Convey, C., Cook, L., Kanawi, O., Kimball, R., Knight, J., Korovaiko, N., Kumar, V., Lao, Y., Lishka, C. R., Menon, J., Myers, J., Narayana, S. A., Procter, A., and Webb, T. J. Intel ngraph: An intermediate representation, compiler, and executor for deep learning, 2018.
[31] Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Garret Catron, Summer Deng, Roman Dzhabarov, Nick Gibson, James Hegeman, Meghan Lele, Roman Levenstein, Jack Montgomery, Bert Maher, Satish Nadathur, Jakob Olesen, Jongsoo Park, Artem Rakhov, Misha Smelyanskiy, and Man Wang. 2018. Glow: Graph Lowering Compiler Techniques for
Neural Networks. arXiv:1805.00907 [cs.PL]
[32] Ken Kennedy and John R. Allen. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2002.
[33] Utpal Banerjee, Rudolf Eigenmann, Alexandru Nicolau, et al. Automatic program parallelization. Proceedingsofthe IEEE 81.2 (1993), pp. 211–243.
[34] William Pugh. A practical algorithm for exact array dependence analysis. Communications of the ACM 35.8 (1992), pp. 102-114.
[35] Paul Feautrier. Dataflow analysis of array and scalar references. International Journal of Parallel Programming, 20(1):23–53, Feb 1991.
[36] William Pugh and David Wonnacott. An exact method for analysis of value-based array data dependences. In Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing, pages 546–566, London, UK, 1994. Springer-Verlag.
[37] Paul Feautrier. Some efficient solutions to the affine scheduling problem. i. one-dimensional time. International Journal of Parallel Programming, 21(5):313–347, Oct 1992.
[38] Paul Feautrier. 1992. Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time. International journal of parallel programming 21, 6 (1992), 389-420.
[39] Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (Tucson, AZ, USA) (PLDI’08). ACM, New York, NY, USA, 101-113. https://doi.org/10.1145/1375581.1375595
[40] Sven Verdoolaege and Gerda Janssens. 2017. Scheduling for PPCG. Report CW 706 (2017).
[41] Martin Kong and Louis-Noël Pouchet. 2019. Model-Driven Transformations for Multi- and Many-Core CPUs. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (Phoenix, AZ, USA) (PLDI 2019). ACM, New York, NY,
USA, 469-484. https://doi.org/10.1145/3314221.3314653
[42] Lorenzo Chelini, Tobias Gysi, Tobias Grosser, Martin Kong, and Henk Corporaal. 2020. Automatic Generation of Multi-Objective Polyhedral Compiler Transformations. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (Virtual Event, GA, USA) (PACT’20). ACM, New York, NY, USA, 83-96. https://doi.org/10.1145/3410463.3414635
[43] S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan, “Effective automatic parallelization of stencil computations,” in Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation,
ser. PLDI ’07. New York, NY, USA: ACM, 2007, pp. 235–244. http://doi.acm.org/10.1145/1250734.1250761
[44] T. Grosser, A. Cohen, J. Holewinski, P. Sadayappan, and S. Verdoolaege, “Hybrid hexagonal/classical tiling for gpus,” in Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, ser. CGO ’14. New York, NY, USA: ACM, 2014, pp. 66:66–66:75. [Online]. Available: http://doi.acm.org/10.1145/2544137.2544160
[45] U. Bondhugula, V. Bandishti, and I. Pananilath, “Diamond tiling: Tiling techniques to maximize parallelism for stencil computations,” IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 5, pp. 1285–1298, Oct. 2017.
[46] Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. 2020. Ansor: Generating High-Performance Tensor Programs for Deep Learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 863ś879. https://www.usenix.org/conference/osdi20/presentation/zheng
[47] Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. 2020. FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS’20). ACM, New York, NY, USA, 859-873. https://doi.org/10.1145/3373376.3378508
[48] Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu-Mao Li, Michaël Gharbi, Benoit Steiner, Steven Johnson, Kayvon Fatahalian, Frédo Durand, and Jonathan Ragan-Kelley. 2019. Learning to Optimize Halide with Tree Search and Random Programs. ACM Trans. Graph. 38, 4, Article 121 (July 2019), 12 pages. https://doi.org/10.1145/3306346.3322967
[49] Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to optimize tensor programs. In Advances in Neural Information Processing Systems. 3389-3400.
[50] Zhu, K., Zhao, W. Y., Zheng, Z., Guo, T. Y., Zhao, P. Z., Bai, J. J., ... & Lin, W. (2021, April). DISC: A dynamic shape compiler for machine learning workloads. In Proceedings of the 1st Workshop on Machine Learning and Systems (pp. 89-95).
[51] Fegade, P., Chen, T., Gibbons, P., & Mowry, T. (2022). The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding. Proceedings of Machine Learning and Systems, 4, 721-747.
[52] Zheng, B., Jiang, Z., Yu, C. H., Shen, H., Fromm, J., Liu, Y., ... & Pekhimenko, G. (2022). DietCode: Automatic Optimization for Dynamic Tensor Programs. Proceedings of Machine Learning and Systems, 4, 848-863.
[53] Martin Kong, Richard Veras, Kevin Stock, Franz Franchetti, Louis-Noël Pouchet, and P. Sadayappan. 2013. When Polyhedral Transformations Meet SIMD Code Generation. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (Seattle, Washington, USA) (PLDI’13). ACM, New York, NY, USA, 127-138. https://doi.org/10.1145/2491956.2462187
[54] Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral Parallel Code Generation for CUDA. ACM Trans. Archit. Code Optim. 9, 4, Article 54 (Jan. 2013), 23 pages. https://doi.org/10.1145/2400682.2400713
[55] Tobias Grosser, Sven Verdoolaege, and Albert Cohen. 2015. Polyhedral AST Generation Is More Than Scanning Polyhedra. ACM Trans. Program. Lang. Syst. 37, 4, Article 12 (July 2015), 50 pages. https://doi.org/10.1145/2743016
[56] Cedric Bastoul. Code generation in the polyhedral model is easier than you think. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, PACT ’04, pages 7–16, Washington, DC, USA, 2004. IEEE Computer Society.
[57] Chun Chen. Polyhedra scanning revisited. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI’12.
[58] Byung Hoon Ahn, Prannoy Pilligundla, Amir Yazdanbakhsh, and Hadi Esmaeilzadeh. 2020. Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation. In International Conference on Learning Representations. https://openreview.net/forum?id=rygG4AVFvH
[59] Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan RaganKelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. 2014. OpenTuner: An Extensible Framework for Program Autotuning. In Proc. of the 23rd Intl. Conf. on Parallel Architectures and Compilation (Edmonton, AB, Canada) (PACT’14). ACM, New York, NY, USA, 303–316. https://doi.org/10.1145/2628071.2628092
[60] Riyadh Baghdadi, Massinissa Merouani, Mohamed-Hicham Leghettas, Kamel Abdous, Taha Arbaoui, Karima Benatchba, and Saman Amarasinghe. 2021. A Deep Learning Based Cost Model for Automatic Code Optimization. In Proceedings of Machine Learning and Systems, A. Smola, A. Dimakis, and I. Stoica (Eds.), Vol. 3. 181–193. https://proceedings.mlsys.org/paper/2021/file/3def184ad8f4755ff269862ea77393dd-Paper.pdf
[61] Kartik Hegde, Po-An Tsai, Sitao Huang, Vikas Chandra, Angshuman Parashar, and Christopher W. Fletcher. 2021. Mind Mappings: Enabling Efficient Algorithm-Accelerator Mapping Space Search. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (Virtual, USA) (ASPLOS 2021). Association for Computing Machinery, New York, NY, USA, 943–958. https://doi.org/10.1145/3445814.3446762

题图来自网络,版权归原作者所有
本文为个人兴趣之作,仅代表本人观点,与就职单位无关
继续阅读
阅读原文