将描述性统计行的值从R导出到Excel工作表中

我有一个超过85,000个值的大型数据库,用于超过100个不同的公司,标注了超过100个variables。 我的目标是确定与几个variables相对应的描述性统计(平均值,标准差,最小值和值的数量)。

以下是一个给定公司的信息,我将称之为F公司。

Attendance Number of representatives Number of Presenters Company Audience 29 2 30 2 20 3 30 4 30 10 20 5 40 20 10 5 10 30 13 5 

我要做的是让R计算描述统计[平均值,标准差,最小值和最大值],并为每个这些特定的列,并以下列方式导出到Excel中:

 Company F Average Number of Attendance Standard Deviation of Number of Attendance Min Number of Attendance Max Number of Attendance and Number of People in Attendance Average of Number of Representatives Standard Deviation of Number of Representatives Min of Number of Representatives Max Number of Representatives Total Number of Values Average Number of Presenters Standard Deviation Number of Presenters Min Number of Presenters Max Number of Presenters Total Number of Presenters Average Company Audience Standard Deviation Company Audience Min Number of Company Audience Max Number of Company Audience Total Number of Company Audience 

因为这是一个很长的行,所以我总结一下,我试图find这些列中每一列的描述性统计量[均值,标准差,最小值,最大值和n]。 这些都应该对应于公司F.

我如何试图解决这个问题:

我已经使用了R中的描述性统计function来获取数据框来识别我的代码。 为此,我使用了心理包装:

  library(psych) describe(CompanyF$Attendance) describe(CompanyF$NumberofRepresentatives) describe(CompanyF$Number_of_Presenters describe(CompanyF$Company Audience) 

从使用这个包我可以得到数据框,然后进入Excel并手动构build行,input我收到的值,并省略心理库包提供的任何与我感兴趣的信息不相符的信息。以下是我从心理包得到的信息types的一个例子:

 vars n mean sd median trimmed mad min max range skew kurtosis se 1 1 559 2.02 2.21 1 1.75 1.48 0 9 9 0.78 -0.65 0.09 

这个过程非常耗时,并且存在错误。 在完成F公司的工作后,我在F公司的正下方创build了一个新的行,但是这次又是另一个公司,比如G公司,我继续find描述性统计的过程[mean,standard deviation,min ,max,和n]为每个这些感兴趣的variables(出席人数,代表人数,演讲人数量和公司观众)。

我已经寻找各种解决scheme,其中之一来自这个堆栈溢出后从R导出数据到Excel,但我无法find一个解释如何从R行逐行信息导入到Excel以及如何指定它确定我上面列出的描述性统计资料。

理想情况下,我会有以下输出放入Excel中:

 Company F Average Number of Attendance Standard Deviation of Number of Attendance Min Number of Attendance Max Number of Attendance and Number of People in Attendance Average of Number of Representatives Standard Deviation of Number of Representatives Min of Number of Representatives Max Number of Representatives Total Number of Values Average Number of Presenters Standard Deviation Number of Presenters Min Number of Presenters Max Number of Presenters Total Number of Presenters Average Company Audience Standard Deviation Company Audience Min Number of Company Audience Max Number of Company Audience Total Number of Company Audience Company G Average Number of Attendance Standard Deviation of Number of Attendance Min Number of Attendance Max Number of Attendance and Number of People in Attendance Average of Number of Representatives Standard Deviation of Number of Representatives Min of Number of Representatives Max Number of Representatives Total Number of Values Average Number of Presenters Standard Deviation Number of Presenters Min Number of Presenters Max Number of Presenters Total Number of Presenters Average Company Audience Standard Deviation Company Audience Min Number of Company Audience Max Number of Company Audience Total Number of Company Audience Company H Average Number of Attendance Standard Deviation of Number of Attendance Min Number of Attendance Max Number of Attendance and Number of People in Attendance Average of Number of Representatives Standard Deviation of Number of Representatives Min of Number of Representatives Max Number of Representatives Total Number of Values Average Number of Presenters Standard Deviation Number of Presenters Min Number of Presenters Max Number of Presenters Total Number of Presenters Average Company Audience Standard Deviation Company Audience Min Number of Company Audience Max Number of Company Audience Total Number of Company Audience 

等等。

我的数据的原始子集如下:

 structure(list(sn = structure(c(2L, 2L, 3L, 5L, 2L, 7L, 1L, 9L, 1L, 9L, NA, 9L, 1L, 26L, 11L, 9L, 7L, NA, NA, 7L, 17L, 9L, NA, 21L, 7L, 17L, 7L, 7L, 16L, 7L, 7L, 7L, 7L, 26L, 7L, 6L, 26L, 22L, NA, NA, 11L, 23L, 23L, 26L, NA, 7L, 23L, 1L, NA, 1L, 7L, 11L, 12L, 13L, 9L, NA, 15L, NA, 20L, 15L, NA, 17L, 5L, NA, 22L, 15L, NA, NA, 5L, 8L, 32L, 29L, 23L, 33L, 1L, 23L, 14L, 6L, 7L, 15L), .Label = c("Broome Street", "Company A", "Company B", "Company BC", "Company C", "Company CC", "Company D Clinton", "Company DD", "Company E", "Company ED BroadCompany", "Company G", "Company H BroadCompany", "Company I BroadCompany", "Company I Studio", "Company J", "Company K", "Company L", "Company M", "Company M BroadCompany", "Company M HS BroadCompany", "Company MCC BroadCompany", "Company N", "Company P", "Company Q", "Company Q Company N", "Company Q Company ZZ", "Company R - Company ZZ", "Company SLab", "Company Z", "Company ZE", "Company ZED", "Company ZEQ", "Company ZZ", "Company ZZQ", "Company ZZQ Company N"), class = "factor"), earn_tot = c(21.85, 20.8, NA, 8.16, NA, NA, NA, NA, NA, NA, NA, NA, NA, 7.16, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 43.32, NA, 30.48, NA, NA, 34.9, NA, NA, NA, NA, NA, 25.82, 40.75, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0, NA, NA, NA, 30, NA, NA, NA, NA, NA, NA, 39.1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 52.29, 44.32, NA, 7, 38.32, 0, NA, NA, 8.25, NA, NA), earn_and_current_tot = c(29.43, 20.8, NA, 8.16, NA, NA, NA, NA, NA, NA, NA, NA, NA, 7.16, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 49.9, NA, 37.56, NA, NA, 41.98, NA, NA, NA, NA, NA, 37.32, 49, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0, NA, NA, NA, 37, NA, NA, NA, NA, NA, NA, 47.68, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 57.29, 48.48, NA, 7, 45.9, 0, NA, NA, 15.75, NA, NA), pass_99 = c(0L, 0L, NA, NA, NA, NA, 1L, NA, NA, NA, NA, 5L, NA, 0L, NA, 5L, NA, NA, NA, 0L, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, 5L, NA, NA, NA, NA, 4L, 0L, NA, NA, NA, 4L, 4L, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, 1L, NA, NA, NA, NA, 1L, NA, NA, 0L, 4L, 0L, NA, NA, 0L, NA, NA), pass_65 = c(0L, 0L, 5L, 0L, 6L, NA, 0L, 5L, NA, 5L, NA, 6L, NA, 0L, 5L, 2L, NA, NA, NA, 0L, 5L, 5L, NA, NA, NA, 0L, NA, 1L, 4L, 7L, 5L, 5L, 7L, 0L, 5L, NA, 0L, 1L, NA, NA, NA, 2L, 0L, 6L, NA, 8L, 2L, 0L, NA, 4L, 0L, 1L, 3L, NA, NA, NA, NA, NA, 4L, 0L, NA, 5L, 7L, NA, 0L, NA, NA, NA, 5L, 0L, 5L, 4L, 0L, 2L, 0L, 0L, 7L, 0L, NA, 5L)), .Names = c("sn", "earn_tot", "earn_and_current_tot", "pass_99", "pass_65"), row.names = c(NA, 80L), class = "data.frame") 

有四个子集列是最重要的。 这些列是“earn_tot”,“earn_and_current_tot”,“pass_99”和“pass_65”。 这里列出的许多公司已经被匿名。 我正在与大约100家公司合作。 在名为“sn”的专栏中有许多公司名称。 整个子集数据集的名称称为Subset.MergedEx.So。

我很抱歉没有提出一个好的可重复的例子。 感谢您的耐心等待。 我一直在阅读如何构build一个,并使用下面的代码:dput((head(Subset.MergedEx.SO,80)))

这可能不是一个最佳的解决scheme,但它只使用basepsych包。

这里是数据

 df <- data.frame(company = rep(c("A","B", "C","D"), each = 5), attendance = sample(5:10,20,TRUE), representatives = sample(2:30,20,TRUE), presenters = sample(20:30,20,TRUE), audience = sample(50:70,20,TRUE)) 

我写了一个函数来获取你需要的值。 我假设你只有五类信息:公司名称,出席人员,代表,主持人,听众。

  get.values<-function(x){ require(psych) info<-describeBy(x[,2:5], group = x[,1]) n.companies<-length(levels(df[,1])) n<-list() mean<-list() sd<-list() min<-list() max<-list() for(i in 1:n.companies){ n[[i]]<-info[[i]][,2] mean[[i]]<-info[[i]][,3] sd[[i]]<-info[[i]][,4] min[[i]]<-info[[i]][,8] max[[i]]<-info[[i]][,9] } l<-Map(c, mean, sd, min, max, n) valuedf<-do.call(rbind, l) return(valuedf) } 

我也写了一个函数来生成你想要的列名,你可以把它们命名为任何你想要的:

 get.names<-function(x){ require(psych) names<-rownames(describe(x[,2:5])) avg<-character() sd<-character() min<-character() max<-character() total<-character() for(i in 1:length(names)){ avg[i]<-paste("average number of", names[i]) sd[i]<-paste("standard deviation of", names[i]) min[i]<-paste("min number of", names[i]) max[i]<-paste("max number of", names[i]) total[i]<-paste("total number of", names[i]) } cnames<-c(avg,sd,min,max,total) return(cnames) } 

将值和名称组合到一个新的数据框中:

 output<-get.values(df) col.names<-get.names(df) colnames(output)<-col.names rownames(output)<-levels(df[,1]) 

导出为ex​​cel:

 library(xlsx) write.xlsx(output, "descriptives.xlsx") 

你可以做的是将你的数据融合成长格式,然后把它转换成具有多种聚合函数的宽格式:

 library(data.table) dat.new <- dcast(melt(dat, id="company"), company ~ variable, fun = list(mean,sd), value.var = "value") 

这给了:

 > dat.new company value_mean_attendance value_mean_presenters value_mean_audience value_sd_attendance value_sd_presenters value_sd_audience 1: A 8.0 24.8 60.6 1.870829 4.207137 7.668116 2: B 8.2 23.8 64.2 2.489980 2.387467 2.049390 

现在你可以用WriteXLS包把它写到一个excel文件中:

 library(WriteXLS) WriteXLS("dat.new","companies.xls") 

因为要为每个公司计算许多统计信息,所以可能需要考虑将每个公司的摘要统计信息写入excel文件中的单独工作表。

再次,您将数据转换为长格式并melt ,然后使用每个公司和每个公司的lapply(.SD, function(x) list(average = mean(x), sdev = sd(x)))$valuevariables。 将公司生成的data.table分割成data.table列表。 最后把这个列表写到一个excel文件中:

 dat.new <- melt(dat, id="company")[, lapply(.SD, function(x) list(average = mean(x), sdev = sd(x)))$value, .(company,variable)] company.list <- split(dat.new, dat.new$company) WriteXLS(company.list,"companies.xls") 

现在你有一个excel文件,每个公司都有一个单独的选项卡。


使用的数据:

 set.seed(21) dat <- data.table(company = rep(c("A","B"), each = 5), attendance = sample(5:10,10,TRUE), presenters = sample(20:30,10,TRUE), audience = sample(50:70,10,TRUE))