Perl（或R或SQL）：计算字符串在列之间出现的频率-Perl(orR,orSQL):Counthowoftenstringappearsacrosscolumns

作者：萍萍jean | 来源：互联网 | 2023-05-17 06:45

Ihaveatextfilethatlookslikethis:我有一个看起来像这样的文本文件：gene1gene2gene3adcb

I have a text file that looks like this:

我有一个看起来像这样的文本文件：

gene1   gene2   gene3
a       d       c
b       e       d
c       f       g
d       g       
        h
        i

(Each column is a human gene, and each contains a variable number of proteins (strings, shown as letters here) that can bind to those genes).

（每列都是一个人类基因，每个都包含可变数量的蛋白质（字符串，这里显示为字母），可以与这些基因结合）。

What I want to do is count how many columns each string is represented in, output that number and all the column headers, like this:

我想要做的是计算每个字符串表示的列数，输出该数字和所有列标题，如下所示：

a   1   gene1
b   1   gene1
c   2   gene1 gene3
d   3   gene1 gene2 gene3
e   1   gene2
f   1   gene2
g   2   gene2 gene3
h   1   gene2
i   1   gene2

I have been trying to figure out how to do this in Perl and R, but without success so far. Thanks for any help.

我一直试图弄清楚如何在Perl和R中做到这一点，但到目前为止还没有成功。谢谢你的帮助。

5 个解决方案

#1

This solution seems like a bit of a hack, but it gives the desired output. It relies on using both plyr and reshape packages, though I'm sure you could find base R alternatives. The trick is that function melt lets us flatten the data out into a long format, which allows for easy(ish) manipulation from that point forward.

这个解决方案看起来有点像黑客，但它提供了所需的输出。它依赖于同时使用plyr和reshape包，但我确信你可以找到基本的R替代品。诀窍在于，函数融合让我们将数据展平为长格式，从而可以实现从那一点开始的简单（ish）操作。

library(reshape)
library(plyr)

#Recreate your data
dat <- data.frame(gene1 = c(letters[1:4], NA, NA),
                  gene2 = letters[4:9],
                  gene3 = c("c", "d", "g", NA, NA, NA)
                  )

#Melt the data. You'll need to update this if you have more columns
dat.m <- melt(dat, measure.vars = 1:3)

#Tabulate counts
counts <- as.data.frame(table(dat.m$value))

#I'm not sure what to call this column since it's a smooshing of column names
otherColumn <- ddply(dat.m, "value", function(x) paste(x$variable, collapse = " "))

#Merge the two together. You could fix the column names above, or just deal with it here
merge(counts, otherColumn, by.x = "Var1", by.y = "value")

Gives:

得到：

> merge(counts, otherColumn, by.x = "Var1", by.y = "value")
  Var1 Freq                V1
1    a    1             gene1
2    b    1             gene1
3    c    2       gene1 gene3
4    d    3 gene1 gene2 gene3
....

#2

In perl, assuming the proteins in each column don't have duplicates that need to be removed. (If they do, a hash of hashes should be used instead.)

在perl中，假设每列中的蛋白质不具有需要去除的重复。（如果他们这样做，则应该使用散列哈希值。）

use strict;
use warnings;

my $header = <>;
my %column_genes;
while ($header =~ /(\S+)/g) {
    $column_genes{$-[1]} = "$1";
}

my %proteins;
while (my $line = <>) {
    while ($line =~ /(\S+)/g) {
        if (exists $column_genes{$-[1]}) {
            push @{ $proteins{$1} }, $column_genes{$-[1]};
        }
        else {
            warn "line $. column $-[1] unexpected protein $1 ignored\n";
        }
    }
}

for my $protein (sort keys %proteins) {
    print join("\t",
        $protein,
        scalar @{ $proteins{$protein} },
        join(' ', sort @{ $proteins{$protein} } )
    ), "\n";
}

Reads from stdin, writes to stdout.

从stdin读取，写入stdout。

#3

A one liner (or rather 3 liner)

一个衬垫（或更确切地说是3个衬垫）

ddply(na.omit(melt(dat, m = 1:3)), .(value), summarize, 
     len = length(variable), 
     var = paste(variable, collapse = " "))

#4

If it's not a lot of columns, you can do something like this in sql. You basically flatten out the data into a 2 column derived table of protein/gene and then summarize it as needed.

如果它不是很多列，你可以在sql中做这样的事情。您基本上将数据压缩成2列衍生的蛋白质/基因表，然后根据需要对其进行总结。

;with cte as (
  select gene1 as protein, 'gene1' as gene
  union select gene2 as protein, 'gene2' as gene
  union select gene3 as protein, 'gene3' as gene
)

select protein, count(*) as cnt, group_concat(gene) as gene
from cte
group by protein

#5

In mysql, like so:

在mysql中，像这样：

select protein, count(*), group_concat(gene order by gene separator ' ') from gene_protein group by protein;

assuming data like:

假设数据如下：

create table gene_protein (gene varchar(255) not null, protein varchar(255) not null);
insert into gene_protein values ('gene1','a'),('gene1','b'),('gene1','c'),('gene1','d');
insert into gene_protein values ('gene2','d'),('gene2','e'),('gene2','f'),('gene2','g'),('gene2','h'),('gene2','i');
insert into gene_protein values ('gene3','c'),('gene3','d'),('gene3','g');

推荐阅读

main
Nginx使用（server参数配置）

本文介绍了Nginx的使用，重点讲解了server参数配置，包括端口号、主机名、根目录等内容。同时，还介绍了Nginx的反向代理功能。 ... [详细]

蜡笔小新 2023-12-14 17:08:34
数组
[大整数乘法] java代码实现

本文介绍了使用java代码实现大整数乘法的过程，同时也涉及到大整数加法和大整数减法的计算方法。通过分治算法来提高计算效率，并对算法的时间复杂度进行了研究。详细代码实现请参考文章链接。 ... [详细]

蜡笔小新 2023-12-13 11:21:32
function
C++字符字符串处理及字符集编码方案

本文介绍了C++中字符字符串处理的问题，并详细解释了字符集编码方案，包括UNICODE、Windows apps采用的UTF-16编码、ASCII、SBCS和DBCS编码方案。同时说明了ANSI C标准和Windows中的字符/字符串数据类型实现。文章还提到了在编译时需要定义UNICODE宏以支持unicode编码，否则将使用windows code page编译。最后，给出了相关的头文件和数据类型定义。 ... [详细]

蜡笔小新 2023-12-13 04:59:58
main
Go Cobra命令行工具入门教程

本文介绍了Go语言实现的命令行工具Cobra的基本概念、安装方法和入门实践。Cobra被广泛应用于各种项目中，如Kubernetes、Hugo和Github CLI等。通过使用Cobra，我们可以快速创建命令行工具，适用于写测试脚本和各种服务的Admin CLI。文章还通过一个简单的demo演示了Cobra的使用方法。 ... [详细]

蜡笔小新 2023-12-12 20:02:41
main
开发笔记：实验7的文件读写操作

本文介绍了使用C++的ofstream和ifstream类进行文件读写操作的方法，包括创建文件、写入文件和读取文件的过程。同时还介绍了如何判断文件是否成功打开和关闭文件的方法。通过本文的学习，读者可以了解如何在C++中进行文件读写操作。 ... [详细]

蜡笔小新 2023-12-12 17:48:18
数组
Which is more efficient: char str[] or char *str?

This article discusses the efficiency of using char str[] and char *str and whether there is any reason to prefer one over the other. It explains the difference between the two and provides an example to illustrate their usage. ... [详细]

蜡笔小新 2023-12-12 10:13:35
数组
SpringMVC接收请求参数的方式总结

本文总结了在SpringMVC开发中处理控制器参数的各种方式，包括处理使用@RequestParam注解的参数、MultipartFile类型参数和Simple类型参数的RequestParamMethodArgumentResolver，处理@RequestBody注解的参数的RequestResponseBodyMethodProcessor，以及PathVariableMapMethodArgumentResol等子类。 ... [详细]

蜡笔小新 2023-12-11 19:55:40
数组
在类中定义数组时出错 - Error on defining arrays in class

Iamtryingtomakeaclassthatwillreadatextfileofnamesintoanarray,thenreturnthatarra ... [详细]

蜡笔小新 2023-12-14 17:38:12
数组
VScode格式化文档换行或不换行的设置方法

本文介绍了在VScode中设置格式化文档换行或不换行的方法，包括使用插件和修改settings.json文件的内容。详细步骤为：找到settings.json文件，将其中的代码替换为指定的代码。 ... [详细]

蜡笔小新 2023-12-14 17:15:38
function
PE总结9PE文件结构之解析导出表

本文介绍了PE文件结构中的导出表的解析方法，包括获取区段头表、遍历查找所在的区段等步骤。通过该方法可以准确地解析PE文件中的导出表信息。 ... [详细]

蜡笔小新 2023-12-13 11:47:24
php
在mac环境下使用nginx配置nodejs代理服务器的步骤

本文介绍了在mac环境下使用nginx配置nodejs代理服务器的步骤，包括安装nginx、创建目录和文件、配置代理的域名和日志记录等。 ... [详细]

蜡笔小新 2023-12-13 10:34:21
main
Go GUIlxn/walk 学习3.菜单栏和工具栏的具体实现

本文介绍了使用Go语言的GUI库lxn/walk实现菜单栏和工具栏的具体方法，包括消息窗口的产生、文件放置动作响应和提示框的应用。部分代码来自上一篇博客和lxn/walk官方示例。文章提供了学习GUI开发的实际案例和代码示例。 ... [详细]

蜡笔小新 2023-12-12 20:56:55
function
【openwrt】设备mt7628关于wan侧eth0.1 mac地址固定的问题

本文讨论了在openwrt-17.01版本中，mt7628设备上初始化启动时eth0的mac地址总是随机生成的问题。每次随机生成的eth0的mac地址都会写到/sys/class/net/eth0/address目录下，而openwrt-17.01原版的SDK会根据随机生成的eth0的mac地址再生成eth0.1、eth0.2等，生成后的mac地址会保存在/etc/config/network下。 ... [详细]

蜡笔小新 2023-12-12 17:47:48
const
Express App如何提供不需要的静态文件？

本文介绍了如何使用Express App提供静态文件，同时提到了一些不需要使用的文件，如package.json和/.ssh/known_hosts，并解释了为什么app.get('*')无法捕获所有请求以及为什么app.use(express.static(__dirname))可能会提供不需要的文件。 ... [详细]

蜡笔小新 2023-12-12 14:38:07
数组
java 数组基础知识_java数组基础知识

数组的排序：数组本身有Arrays类中的sort()方法,这里写几种常见的排序方法。(1)冒泡排序法publicstaticvoidmain(String[]args ... [详细]

蜡笔小新 2023-12-11 21:29:03

萍萍jean

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章