首届机器学习与统计会议将于2023年8月24日-26日在华东师范大学普陀校区召开,本次会议由中国现场统计研究会机器学习分会主办,华东师范大学统计学院、统计交叉科学研究院、统计与数据科学前沿理论及应用教育部重点实验室及统计应用与理论研究创新引智基地联合承办。会议旨在促进机器学习与统计领域的国内外学者进行学术交流,引领机器学习与统计共同交叉发展的学术文化,推动作为数据科学与人工智能的奠基性学科的进步,以此助力相关数字经济产业的发展。
主题报告专场(十一)
Trustworthy Machine Learning
报告时间:
2023年8月25日 15:30-17:00
报告地址:
华东师范大学普陀校区 文史楼201
组 织 者:
常象宇 西安交通大学
01
 常象宇  西安交通大学
题目:2D-Shapley: A Framework for Fragmented Data Valuation
摘要Data valuation---quantifying the contribution of individual data sources to certain predictive behaviors of a model---is of great importance to enhancing the transparency of machine learning and designing incentive systems for data sharing. Existing work has focused on evaluatindg data sources with the shared feature or sample space. How to evaluate fragmented data sources of which each only contains partial features and samples remains an open question. We start by presenting a method to calculate the counterfactual of removing a fragment from the aggregated data matrix. Based on the counterfactual calculation, we further propose 2D-Shapley, a theoretical framework for fragmented data valuation that uniquely satisfies some appealing axioms in the fragmented data context. 2D-Shapley empowers a range of new use cases, such as selecting useful data fragments, providing interpretation for sample-wise data values, and fine-grained data issue diagnosis.
简介:常象宇,西安交通大学管理学院教授。目前担任中国现场统计研究会机器学习分会秘书长,数据科学与统计学社区“统计之都”主席。主要研究方向为统计机器学习。现致力于研究人工智能模型与算法应用于数据决策场景中的社会化问题:特别关注机器学习中的数据要素定价,公平性机器学习与隐私保护机器学习等方向。
02
郭骁  西北大学
题目:Privacy-Preserving Community Detection for Locally Distributed Multiple Networks
摘要Modern network analysis often involves multi-layer network data in which the nodes are aligned, but the edges on different layers come from various sources (e.g., hospitals, companies, and banks). The multi-layer stochastic block model (SBM) has been popularly used for analyzing this type of network data. In the literature, different estimation and clustering methods have been proposed for multi-layer SBMs based on the assumption that the networks are collected and preserved centrally. However, in practice, the networks are commonly stored and analyzed in a local and distributed fashion because of the privacy, ownership, and communication costs. This paper proposes a new method for consensus community detection and estimation in a multi-layer SBM using locally stored network data with privacy protection and local computational constraints. A novel algorithm named privacy-preserving Distributed Spectral Clustering (ppDSC) is developed. To preserve the edges' privacy, we adopt the randomized response (RR) mechanism to perturb the network edges, which satisfies the strong notion of differential privacy. The ppDSC algorithm is performed on the squared RR-perturbed adjacency matrices to prevent possible cancellation of communities among different layers. To remove the bias incurred by RR and the squared network matrices, we develop a two-step bias-adjustment procedure. Then we perform eigen-decomposition on the debiased matrices, aggregation of the local eigenvectors with weights computed by orthogonal Procrustes transformation (OPT), and k-means clustering. We provide theoretical analysis on the statistical errors of ppDSC in terms of eigen-vector estimation and clustering and show that under mild conditions, the errors from the privacy protection and the local computation are asymptotically negligible. In addition, the blessings and curses of network heterogeneity are well-explained by our bounds. Numerical and real data experiments support our theoretical findings.
简介:郭骁,2019年获西北大学统计学博士学位,2018-2019年哥伦比亚大学统计系联合培养博士。现为西北大学数学学院统计系讲师。目前主要从事数据隐私保护、网络数据分析等研究。
03
叶海山  西安交通大学
题目:Eigencurve: Optimal Learning Rate Schedule for SGD on Quadratic Objectives with Skewed Hessian Spectrums
摘要:Learning rate schedulers have been widely adopted in training deep neural networks. Despite their practical importance, there is a discrepancy between its practice and its theoretical analysis. For instance, it is not known what schedules of SGD achieve best convergence, even for simple problems such as optimizing quadratic objectives. In this paper, we propose Eigencurve, the first family of learning rate schedules that can achieve minimax optimal convergence rates (up to a constant) for SGD on quadratic objectives when the eigenvalue distribution of the underlying Hessian matrix is skewed. The condition is quite common in practice. Experimental results show that Eigencurve can significantly outperform step decay in image classification tasks on CIFAR-10, especially when the number of epochs is small. Moreover, the theory inspires two simple learning rate schedulers for practical applications that can approximate eigencurve. For some problems, the optimal shape of the proposed schedulers resembles that of cosine decay, which sheds light to the success of cosine decay for such situations. For other situations, the proposed schedulers are superior to cosine decay.
简介:叶海山,西安交通大学管理学院副教授,长期从事机器学习与优化等领域的研究。
04
刘鹏飞  上海交通大学
题目:生成式人工智能模型的安全可靠的讨论
摘要:以自然语言处理中预训练语言模型为核心的生成式人工智能已经逐渐从学术论文走入到产品,例如Jasper、ChatGPT等,并且其在应用层的潜力还在不断被人挖掘。生成式人工智能(AI)使复杂应用的开发成为可能,这些应用以预先训练好的大型模型为骨架,通过适当的引导就可以生成高质量的文本、图像和其他输出。然而,评估这些生成系统(例如评估ChatGPT生成文本的质量)是一个可能比生成本身更艰巨的任务,生成模型的结果往往难以控制,比如生成有攻击性、有偏见、事实错误的内容,如何自动地从多方面定量评估模型的可靠性成为一个关键的问题。本报告将会讨论如何评估并建立一个安全可靠的大模型。
简介:刘鹏飞,上海交通大学清源研究院副教授,生成式人工智能研究组(GAIR)负责人,专注于自然语言的预训、生成和评估等研究方向;在自然语言处理和人工智能领域发表学术论文 60 余篇。谷歌学术引用 6000 余次。ACL会议史上首次实现连续两年获得System & Demo Paper Award;提示工程(Prompt Engineering)概念最早提出者之一。代表作包括:ExplainaBoard, 高考英语AI, LIMA等工作。
本次会议无需注册费,请扫描下方二维码完成会议注册流程。
获取更多会议信息,请登录会议官网:
 https://ml-stat.github.io/MLSTAT2023/
往期回顾
REVIEW

会议通知 | 首届机器学习与统计会议暨中国现场统计研究会机器学习分会成立大会

继续阅读
阅读原文