刚入门时老师非要合并dataset时batcheffect困扰许久

作者：蓝俊明伟伦淑舜 | 来源：互联网 | 2023-05-19 15:44

一、什么是批次效应批次效应（batch effect），表示样品在不同批次中处理和测量产生的与试验期间记录的任何生物变异无关的技术差异。批次效应是高通量试验中常见的变异来源，受日期、环境、处理组

一、什么是批次效应

批次效应（batch effect），表示样品在不同批次中处理和测量产生的与试验期间记录的任何生物变异无关的技术差异。批次效应是高通量试验中常见的变异来源，受日期、环境、处理组、实验人员、试剂、平台等一些非生物因素的影响。
合并分析不同批次的数据时，平常的标准化方法不足以调整批次之间的差异。如果批次效应比较严重，这些差异就会干扰实验结果，我们就不能够判断得到的差异表达的基因是来源于想要研究的因素，还是和批次相关。
批次效应不能被消除，只有尽可能的降低。校正批次效应的目的是，减少批次之间的差异，尽量让多个批次的数据重新组合在一起，这样下游分析就可以只考虑生物学差异因素。

Batch effects are sub-groups of measurements that have qualitatively different behaviour across conditions and are unrelated to the biological or scientific variables in a study. For example, batch effects may occur if a subset of experiments was run on Monday and another set on Tuesday, if two technicians were responsible for different subsets of the experiments or if two different lots of reagents, chips or instruments were used.

Normalization is a data analysis technique that adjusts global properties of measurements for individual samples so that they can be more appropriately compared. Including a normalization step is now standard in data analysis of gene expression experiments. But normalization does not remove batch effects, which affect specific subsets of genes and may affect different genes in different ways. [1]

二、处理方法
目前已经有多种处理批次差异的方法。[2]

image.png

批次效应处理方法
三、哪种方法比较好

一项研究比较了6种去除批次效应的方法，其中包括ComBat方法（parametric prior method，ComBat_p和non-parametric method，ComBat_n）、代理变量法（Surrogate variable analysis，SVA）、基于比值的方法（Geometric ratio-based method，Ratio_G）、平均中心方法（Mean-centering，PAMR）和距离加权判别（Distance-weighted discrimination，DWD）方法，综合多种指标认为ComBat在精确性、准确性和整体性能方面（precision, accuracy and overall performance）总体优于其他方法。

A number of approaches have been used or developed for identifying and removing batch effects from microarray data, of which we have chosen six for evaluation. We systematically evaluated six of these programs using multiple measures of precision, accuracy and overall performance. ComBat, an Empirical Bayes method, outperformed the other five programs by most metrics. Its parametric and non-parametric algorithms both worked well in both kinds of data sets, controlling the variation attributable to batch effects, increasing the correlation among replicates, and producing the largest AUC in our assessment of overall performance. [3]

四、ComBat方法的算法

模型的假设是基于位置和尺度（Location and scale，L/S）的调整。L/S调整可以定义为一系列广泛的调整，其中为数据在批次内的位置（均值）和/或规模（方差）假设了一个模型，然后调整批次以满足假设模型的规范。因此，L/S批次调整假设批次效应可以通过标准化批次之间的均值和方差来建模。这些调整可以从简单的基因范围的均值和方差标准化，到复杂的基因间线性或非线性调整。

模型：Yijg = αg + Xβg + γig + δigεijg
Yijg表示来自批次i的样品j的基因g的表达值。其中αg是基因g的平均表达值，X是样本条件的设计矩阵，βg是对应于X的回归系数向量。误差项εijg服从期望值为0和方差为σg的正态分布N(0，σg )，γig和δig表示批次i中基因g加法和乘法的批次效应。

算法分为3步： [4]

image.png

算法

Location and scale (L/S) adjustments can be defined as a wide family of adjustments in which one assumes a model for the location (mean) and/or scale (variance) of the data within batches and then adjusts the batches to meet assumed model specifications. Therefore, L/S batch adjustments assume that the batch effects can be modeled out by standardizing means and variances across batches. These adjustments can range from simple gene-wise mean and variance standardization to complex linear or non-linear adjustments across the genes. [5]

五、推荐的解决方法

（1）根据分析目的确定批次效应处理方法：差异表达分析，在模型中添加批次因素；可视化，先对原数据进行校正，再使用校正后的数据进行分析。
（2）已知的批次，removeBatchEffect或ComBat；未知的批次，sva。
（3）removeBatchEffect和ComBat、sva输入的数据需要进行转换，例如取对数（rlog或logCPM）。
（4）read counts数据可使用ComBat-Seq或svaseq。

六、差异表达分析的批次校正

很多人以为去除批次效应是要改变你的表达矩阵，新的表达矩阵然后去走差异分析流程，其实大部分的差异分析流程包里面，人家内置好了考虑你的批次效应这样的混杂因素的函数用法设计。例如：
构建DESeq2对象时的设计公式：design = ~ batch + conditions

image.png

DESeq2
因为我们把批次这个因素，写在了DESeq2里面，所以它在帮我们做差异分析的时候，其实就考虑了，得到的差异分析结果，就是去除了批次效应的。

如果要合并不同批次的数据进行差异表达分析，建议直接把批次信息加入到构建模型里面。但是这种方法并没有改变原数据。如果你确实一定要亲眼看看批次效应到底是如何影响这个表达矩阵的，就需要对表达矩阵做另外的处理，例如removeBatchEffect或ComBat。但是处理之后会改变counts矩阵，之后就没办法走DESeq2差异分析流程啦，仅仅是为了拿到去除批次效应前后对比的表达矩阵而已。

image.png

PCA

七、使用limma的removeBatchEffect处理批次效应

removeBatchEffect这个函数用于进行聚类或无监督分析之前，移除与杂交时间或其他技术变异相关的批次效应。它是针对芯片设计的，因此不要直接使用read counts，数据需要经过一定的标准化操作，如log转化。

removeBatchEffect只用于衔接聚类、PCA等可视化展示，不要在线性建模之前使用。因为用矫正后的数据进行差异表达分析，有两个缺陷：一是批次因素和分组因素可能重叠，所以直接对原数据矫正批次可能会抵消一部分真实生物学因素；二是低估了误差。所以，如果想做差异表达分析，但数据中又有已知的批次问题，则最好把批次效应纳入线性模型中。

This function is useful for removing batch effects, associated with hybridization time or other technical variables, prior to clustering or unsupervised analysis such as PCA, MDS or heatmaps. The design matrix is used to describe comparisons between the samples, for example treatment effects, which should not be removed. The function (in effect) fits a linear model to the data, including both batches and regular treatments, then removes the component due to the batch effects. This function is intended for use with clustering or PCA, not for use prior to linear modelling. If linear modelling is intended, it is better to include the batch effect as part of the linear model.

image.png

removeBatchEffect用法

image.png

removeBatchEffect
八、使用SVA的ComBat处理批次效应

sva这个R包可以处理已知的和未知的批次效应，sva函数可以通过构建高维数据集的代理变量，移除批次效应和其它所有不需要的变异。如果是芯片数据用sva，高通量测序数据使用svaseq。ComBat函数可以处理已知的批次效应。

sva has functionality to estimate and remove artifacts from high dimensional data the sva function can be used to estimate artifacts from microarray data. the svaseq function can be used to estimate artifacts from count-based RNA-sequencing (and other sequencing) data. The ComBat function can be used to remove known batch effecs from microarray data. The fsva function can be used to remove batch effects for prediction problems.

ComBat allows users to adjust for batch effects in datasets where the batch covariate is known, using methodology described in Johnson et al. 2007. It uses either parametric or non-parametric empirical Bayes frameworks for adjusting data for batch effects. Users are returned an expression matrix that has been corrected for batch effects. The input data are assumed to be cleaned and normalized before batch effect removal.

image.png

ComBat用法

image.png

ComBat
九、使用sva处理未知的批次

SVA具有移除批次效应和高通量测序中其它不需要的变异的功能。使用sva识别和构建高维数据集代理变量(surrogate variables)，代理变量是由高维数据直接构建的协变量(covariates)，可用于后续分析，以调整未知、未建模或潜在的噪声源。
sva函数的输出本身就是代理变量。它们可以包含在全模型矩阵和空模型矩阵中，然后与数据矩阵一起传递给SVA包中的f.pvalue函数，以计算参数F检验p值，从而调整代理变量。

根据经验，当存在大量已知或未知的潜在混杂因素时，代理变量调整（sva）可能更合适。当一个或多个生物学分组已知是异质的，并且有已知的批次变量时，直接调整（ComBat）可能更合适。

sva考虑了两种类型的变量：调整变量和感兴趣的变量。例如，感兴趣变量(variables of interest)可能是癌症组与对照组；调整变量(adjustment variables)可能是病人的年龄、性别、测序时间。
建立两个模型矩阵：全模型(full model)和空模型(null model)。空模型包括所有调整变量，而不包含感兴趣的变量；全模型包括所有调整变量和感兴趣的变量。我们将试图分析感兴趣的变量与基因表达之间的关联，调整调整变量。模型矩阵可以使用model.matrix创建。sva的目标是消除所有不需要的变异来源，同时通过mod中包含的主要变量来检测对比。

注意：在我们最初的工作中，使用识别函数来测量在近似对称和连续尺度上的数据。对于通常表示为counts的测序数据，更合适的模型可能涉及使用适度的对数函数。例如，我们首先用log(counts+1)转换基因表达数据。

The goal of sva is to remove all unwanted sources of variation while protecting the contrasts due to the primary variables specified in the function call. This leads to the identification of features that are consistently different between groups, removing all common sources of latent variation.

As a rule of thumb, when there are a large number of known or unknown potential confounders, surrogate variable adjustment may be more appropriate. Alternatively, when one or more biological groups is known to be heterogeneous, and there are known batch variables, direct adjustment may be more appropriate. [6]

用法：
（1）使用sva得到代理变量

##准备数据 data(bladderdata) pheno = pData(bladderEset) head(pheno) #分组信息 edata = exprs(bladderEset) head(edata) #表达量信息 ##建立两个模型 pheno$cancer mod = model.matrix(~as.factor(cancer), data=pheno) mod #全模型包括所有调整变量和感兴趣的变量,这里只有感兴趣的变量 mod0 = model.matrix(~1,data=pheno) mod0 #空模型只有调整变量,这里没有调整变量,所以只有一个截距intercept ##进行SVA n.sv = num.sv(edata,mod,method="leek") #估计潜在因素的数量 n.sv svobj = sva(edata,mod,mod0,n.sv=n.sv) #应用SVA函数来估计代理变量 str(svobj) #包含4项的列表 head(svobj$sv) #SV是一个矩阵,其列对应于估计的代理变量,它们可以用于下游分析 #pprob.gam是每个基因与一个或多个潜在变量相关的后验概率 #pprob.b是每个基因与感兴趣的变量相关的后验概率 #n.sv是SVA估计的代理变量数

（2）使用f.pvalue函数调整代理变量（calculate parametric F-test P-values adjusted for surrogate variables）

#f.pvalue可以用来计算数据矩阵每一行(基因)的参数F检验p值 #F检验比较了模型mod和mod0 #没有调整代理变量的原始矩阵 pValues = f.pvalue(edata,mod,mod0) qValues = p.adjust(pValues,method="BH") #adjust them for multiple testing head(qValues) sum(qValues

（3）sva可以与差异表达分析程序一起使用，如limma、DESeq2。

##sva+DESeq2 dds 1,] dds 1,] head(dat) mod

References
[1] Chen C, Grennan K, Badner J, Zhang D, Gershon E, Jin L, et al. Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods[J]. PloS One. 2011;6(2):e17238.
[2] 李飒,赵毅强.基因表达数据批次效应去除方法的研究进展[J].南京农业大学学报,2019,42(03):389-397.
[3] Chen C, Grennan K, Badner J, Zhang D, Gershon E, Jin L, et al. Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods[J]. PloS One. 2011;6(2):e17238.
[4]陈天成,侯艳,李康.基因组学数据整合中的批次效应移除算法[J].中国卫生统计,2016,33(03):527-529+533.
[5] Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods[J]. Biostatistics. 2007;8(1):118-27.
[6] Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. The sva package for removing batch effects and other unwanted variation in high-throughput experiments[J]. Bioinformatics. 2012;28(6):882-3.

推荐阅读

input
九度OnlineJudge之1002：Grading问题的解决方法

本文介绍了九度OnlineJudge中的1002题目“Grading”的解决方法。该题目要求设计一个公平的评分过程，将每个考题分配给3个独立的专家，如果他们的评分不一致，则需要请一位裁判做出最终决定。文章详细描述了评分规则，并给出了解决该问题的程序。 ... [详细]

蜡笔小新 2023-12-14 13:00:09
const
node . js require 和 ES6 导入导出的区别

node.jsrequire和ES6导入导出的区别原 ... [详细]

蜡笔小新 2023-12-10 11:12:31
int
Linux虚拟化部署中的VLAN配置方法详解

本文详细介绍了在Linux虚拟化部署中进行VLAN配置的方法。首先要确认Linux系统内核是否已经支持VLAN功能，然后配置物理网卡、子网卡和虚拟VLAN网卡的关系。接着介绍了在Linux配置VLAN Trunk的步骤，包括将物理网卡添加到VLAN、检查添加的VLAN虚拟网卡信息以及重启网络服务等。最后，通过验证连通性来确认配置是否成功。 ... [详细]

蜡笔小新 2023-12-09 03:55:11
const
Simple Tips on C++(对于C++的一些建议)

Introduction（简介）Forbeingapowerfulobject-orientedprogramminglanguage,Cisuseda ... [详细]

蜡笔小新 2023-10-17 19:48:02
default
VScode格式化文档换行或不换行的设置方法

本文介绍了在VScode中设置格式化文档换行或不换行的方法，包括使用插件和修改settings.json文件的内容。详细步骤为：找到settings.json文件，将其中的代码替换为指定的代码。 ... [详细]

蜡笔小新 2023-12-14 17:15:38
import
Linux重启网络命令实例及关机和重启示例教程

本文介绍了Linux系统中重启网络命令的实例，以及使用不同方式关机和重启系统的示例教程。包括使用图形界面和控制台访问系统的方法，以及使用shutdown命令进行系统关机和重启的句法和用法。 ... [详细]

蜡笔小新 2023-12-14 15:52:52
input
CF：3D City Model（小思维）问题解析和代码实现

本文通过解析CF：3D City Model问题，介绍了问题的背景和要求，并给出了相应的代码实现。该问题涉及到在一个矩形的网格上建造城市的情景，每个网格单元可以作为建筑的基础，建筑由多个立方体叠加而成。文章详细讲解了问题的解决思路，并给出了相应的代码实现供读者参考。 ... [详细]

蜡笔小新 2023-12-13 14:17:11
js
MooTools和JQuery并排 - MooTools and JQuery Side by Side

IjustinheritedsomewebpageswhichusesMooTools.IneverusedMooTools.NowIneedtoaddsomef ... [详细]

蜡笔小新 2023-12-12 13:43:58
const
差分约束系统求解House Man跳跃问题的思路与方法

本文讨论了使用差分约束系统求解House Man跳跃问题的思路与方法。给定一组不同高度，要求从最低点跳跃到最高点，每次跳跃的距离不超过D，并且不能改变给定的顺序。通过建立差分约束系统，将问题转化为图的建立和查询距离的问题。文章详细介绍了建立约束条件的方法，并使用SPFA算法判环并输出结果。同时还讨论了建边方向和跳跃顺序的关系。 ... [详细]

蜡笔小新 2023-12-14 11:49:51
const
Express App如何提供不需要的静态文件？

本文介绍了如何使用Express App提供静态文件，同时提到了一些不需要使用的文件，如package.json和/.ssh/known_hosts，并解释了为什么app.get('*')无法捕获所有请求以及为什么app.use(express.static(__dirname))可能会提供不需要的文件。 ... [详细]

蜡笔小新 2023-12-12 14:38:07
int
java boolean 大小_java boolean 大小

先看官方文档TheJavaTutorialshavebeenwrittenforJDK8.Examplesandpracticesdescribedinthispagedontta ... [详细]

蜡笔小新 2023-12-12 13:36:56
future
to_a和to_ary有什么区别？ - What's the difference between to_a and to_ary?

Whatsthedifferencebetweento_aandto_ary?to_a和to_ary有什么区别？ ... [详细]

蜡笔小新 2023-12-11 19:30:04
future
微软的STL容器类实现是否线程安全？

本文讨论了微软的STL容器类是否线程安全。根据MSDN的回答，STL容器类包括vector、deque、list、queue、stack、priority_queue、valarray、map、hash_map、multimap、hash_multimap、set、hash_set、multiset、hash_multiset、basic_string和bitset。对于单个对象来说，多个线程同时读取是安全的。但如果一个线程正在写入一个对象，那么所有的读写操作都需要进行同步。 ... [详细]

蜡笔小新 2023-12-11 11:53:23
regex
数组或散列中的正则表达式排序 - Regex in array or hash - sorting

Ihaveaworkfolderdirectory.我有一个工作文件夹目录。holderDir.glob(*)>holder[ProjectOne, ... [详细]

蜡笔小新 2023-12-10 12:41:53
regex
Spring框架《一》简介

Spring框架《一》1.Spring概述1.1简介1.2Spring模板二、IOC容器和Bean1.IOC和DI简介2.三种通过类型获取bean3.给bean的属性赋值3.1依赖 ... [详细]

蜡笔小新 2023-12-09 20:10:11

蓝俊明伟伦淑舜

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章