在三个参数Weibull cdf上运行最大似然估计的错误

本文介绍了在三个参数Weibull cdf上运行最大似然估计的错误的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在很多夏天，我都在努力应对苍蝇随着时间的推移(不定期地采摘)而出现的累积现象(尽管首先，我只是想工作一年).累积出现遵循S型模式，我想创建3参数Weibull累积分布函数的最大似然估计.我一直尝试在fitdistrplus包中使用的三参数模型一直给我一个错误.我认为这一定与我的数据的结构有关，但我无法弄清楚.显然，我希望它将每个点都读取为x(度数天)和y(紧急度)值，但是它似乎无法读取两列.我得到的主要错误是数学函数的非数字参数"或(数据必须稍有不同)"数据必须是长度大于1的数字矢量".下面是我的代码，包括在df_dd_em数据框中添加的列，用于在必要时累积出现和出现百分比.

I am working with the cumulative emergence of flies over time (taken at irregular intervals) over many summers (though first I am just trying to make one year work). The cumulative emergence follows a sigmoid pattern and I want to create a maximum likelihood estimation of a 3-parameter Weibull cumulative distribution function. The three-parameter models I've been trying to use in the fitdistrplus package keep giving me an error. I think this must have something to do with how my data is structured, but I cannot figure it out. Obviously I want it to read each point as an x (degree days) and a y (emergence) value, but it seems to be unable to read two columns. The main error I'm getting says "Non-numeric argument to mathematical function" or (with slightly different code) "data must be a numeric vector of length greater than 1". Below is my code including added columns in the df_dd_em dataframe for cumulative emergence and percent emergence in case that is useful.

    degree_days <-   c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
                      1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
                      1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
                      2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
                      2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
                      2707.36,2773.82,2816.39,2863.94)
    emergence <-  c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
                   0,0,0,0,1,0,0,0,0,0)
    cum_em <- cumsum(emergence)
    df_dd_em <- data.frame (degree_days, emergence, cum_em)
    df_dd_em$percent <- ave(df_dd_em$emergence, FUN = function(df_dd_em) 100*(df_dd_em)/46)
    df_dd_em$cum_per <- ave(df_dd_em$cum_em, FUN = function(df_dd_em) 100*(df_dd_em)/46)
    x <- pweibull(df_dd_em[c(1,3)],shape=5)
    dframe2.mle <- fitdist(x, "weibull",method='mle')

推荐答案

这是我对您追求的最好的猜测:

Here's my best guess at what you're after:

设置数据:

dd <- data.frame(degree_days=c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
                      1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
                      1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
                      2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
                      2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
                      2707.36,2773.82,2816.39,2863.94),
                 emergence=c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
                 0,0,0,0,1,0,0,0,0,0))
dd <- transform(dd,cum_em=cumsum(emergence))

我们实际上将适合间隔删减"的分布(即，连续度日观测值之间出现的概率:此版本假设第一个观测值指的是第一度观测值之前天观测值，您可以将其更改为引用最后一个观测值之后的观测值.

We're actually going to fit to an "interval-censored" distribution (i.e. probability of emergence between successive degree day observations: this version assumes that the first observation refers to observations before the first degree-day observation, you could change it to refer to observations after the last observation).

library(bbmle)
## y*log(p) allowing for 0/0 occurrences:
y_log_p <- function(y,p) ifelse(y==0 & p==0,0,y*log(p))
NLLfun <- function(scale,shape,x=dd$degree_days,y=dd$emergence) {
    prob <- pmax(diff(pweibull(c(-Inf,x),      ## or (c(x,Inf))
             shape=shape,scale=scale)),1e-6)
    ## multinomial probability
    -sum(y_log_p(y,prob))
}    
library(bbmle)

我可能应该使用更系统的方法，例如矩量法(即，将Weibull分布的均值和方差与数据的均值和方差进行匹配)，但我只是四处寻找以寻找合理的起始值:

I should probably have used something more systematic like the method of moments (i.e. matching the mean and variance of a Weibull distribution with the mean and variance of the data), but I just hacked around a bit to find plausible starting values:

## preliminary look (method of moments would be better)
scvec <- 10^(seq(0,4,length=101))
plot(scvec,sapply(scvec,NLLfun,shape=1))

使用parscale让R知道参数的比例非常不同很重要:

It's important to use parscale to let R know that the parameters are on very different scales:

startvals <- list(scale=1000,shape=1)
m1 <- mle2(NLLfun,start=startvals,
     control=list(parscale=unlist(startvals)))

现在尝试使用三参数Weibull(按照最初的要求)-只需对我们已经拥有的产品进行一些修改:

Now try with a three-parameter Weibull (as originally requested) -- requires only a slight modification of what we already have:

library(FAdist)
NLLfun2 <- function(scale,shape,thres,
                    x=dd$degree_days,y=dd$emergence) {
    prob <- pmax(diff(pweibull3(c(-Inf,x),shape=shape,scale=scale,thres)),
                 1e-6)
    ## multinomial probability
    -sum(y_log_p(y,prob))
}    
startvals2 <- list(scale=1000,shape=1,thres=100)
m2 <- mle2(NLLfun2,start=startvals2,
     control=list(parscale=unlist(startvals2)))

看起来三参数拟合更好:

Looks like the three-parameter fit is much better:

library(emdbook)
AICtab(m1,m2)
##    dAIC df
## m2  0.0 3 
## m1 21.7 2

这是图形摘要:

with(dd,plot(cum_em~degree_days,cex=3))
with(as.list(coef(m1)),curve(sum(dd$emergence)*
                             pweibull(x,shape=shape,scale=scale),col=2,
                             add=TRUE))
with(as.list(coef(m2)),curve(sum(dd$emergence)*
                             pweibull3(x,shape=shape,
                                       scale=scale,thres=thres),col=4,
                             add=TRUE))

(也可以通过ggplot2 ...更加优雅地完成此操作)

(could also do this more elegantly with ggplot2 ...)

这些看起来似乎不太合适，但它们很理智. (原则上，您可以根据每个时间间隔的预期出现次数进行卡方拟合优度检验，并考虑到已经安装了三参数模型这一事实，尽管该值可能会偏低...)
拟合度上的置信区间有点麻烦；您的选择是(1)自举； (2)参数自举(假设数据为多元正态分布的重采样参数)； (3)增量法.
使用bbmle::mle2可以轻松完成获取配置文件置信区间的操作:

These don't seem like spectacularly good fits, but they're sane. (You could in principle do a chi-squared goodness-of-fit test based on the expected number of emergences per interval, and accounting for the fact that you've fitted a three-parameter model, although the values might be a bit low ...)
Confidence intervals on the fit are a bit of a nuisance; your choices are (1) bootstrapping; (2) parametric bootstrapping (resample parameters assuming a multivariate normal distribution of the data); (3) delta method.
Using bbmle::mle2 makes it easy to do things like get profile confidence intervals:

 confint(m1)
 ##             2.5 %      97.5 %
 ## scale 1576.685652 1777.437283
 ## shape    4.223867    6.318481

这篇关于在三个参数Weibull cdf上运行最大似然估计的错误的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！