本文介绍了在给定的出生日期和任意日期的情况下,有效且准确地计算出R的年龄(以年,月或周为单位)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定出生日期和任意日期,我面临着计算年龄(以年,月或周为单位)的常见任务.问题是,我经常必须对许多条记录(> 3亿条记录)执行此操作,因此性能是此处的关键问题.

I am facing the common task of calculating the age (in years, months, or weeks) given the date of birth and an arbitrary date. The thing is that quite often I have to do this over many many records (>300 millions), so performance is a key issue here.

在SO和Google中快速搜索之后,我发现了3种替代方法:

After a quick search in SO and Google I found 3 alternatives:

  • 常见算术过程(/365.25)(链接)
  • 使用软件包lubridate中的函数new_interval()duration()(链接)
  • eeptools中的
  • 函数age_calc()(链接链接链接)
  • A common arithmetic procedure (/365.25) (link)
  • Using functions new_interval() and duration() from package lubridate (link)
  • Function age_calc() from package eeptools (link, link, link)

所以,这是我的玩具代码:

So, here's my toy code:

# Some toy birthdates
birthdate <- as.Date(c("1978-12-30", "1978-12-31", "1979-01-01", 
                       "1962-12-30", "1962-12-31", "1963-01-01", 
                       "2000-06-16", "2000-06-17", "2000-06-18", 
                       "2007-03-18", "2007-03-19", "2007-03-20", 
                       "1968-02-29", "1968-02-29", "1968-02-29"))

# Given dates to calculate the age
givendate <- as.Date(c("2015-12-31", "2015-12-31", "2015-12-31", 
                       "2015-12-31", "2015-12-31", "2015-12-31", 
                       "2050-06-17", "2050-06-17", "2050-06-17",
                       "2008-03-19", "2008-03-19", "2008-03-19", 
                       "2015-02-28", "2015-03-01", "2015-03-02"))

# Using a common arithmetic procedure ("Time differences in days"/365.25)
(givendate-birthdate)/365.25

# Use the package lubridate
require(lubridate)
new_interval(start = birthdate, end = givendate) / 
                     duration(num = 1, units = "years")

# Use the package eeptools
library(eeptools)
age_calc(dob = birthdate, enddate = givendate, units = "years")

让我们稍后再讨论准确性,并首先关注性能.这是代码:

Let's talk later about accuracy and focus first on performance. Here's the code:

# Now let's compare the performance of the alternatives using microbenchmark
library(microbenchmark)
mbm <- microbenchmark(
    arithmetic = (givendate - birthdate) / 365.25,
    lubridate = new_interval(start = birthdate, end = givendate) /
                                     duration(num = 1, units = "years"),
    eeptools = age_calc(dob = birthdate, enddate = givendate, 
                        units = "years"),
    times = 1000
)

# And examine the results
mbm
autoplot(mbm)

结果在这里:

底线:lubridateeeptools函数的性能比算术方法差得多(/365.25至少快10倍).不幸的是,算术方法不够准确,我无法承受该方法会犯的一些错误.

Bottom line: performance of lubridate and eeptools functions is much worse than the arithmetic method (/365.25 is at least 10 times faster). Unfortunately, the arithmetic method is not accurate enough and I cannot afford the few mistakes that this method will make.

当我在一些文章中读到的时候,lubridateeeptools不会犯这样的错误(尽管我没有看代码/了解更多有关那些函数的信息,以了解它们使用哪种方法),这就是为什么我想要使用它们,但它们的性能对我的实际应用程序无效.

As I read on some posts, lubridate and eeptools make no such mistakes (though, I haven't looked at the code/read more about those functions to know which method they use) and that's why I wanted to use them, but their performance does not work for my real application.

对有效,准确地计算年龄的方法有何想法?

Any ideas on an efficient and accurate method to calculate the age?

糟糕,看来lubridate也会出错.显然,基于这个玩具示例,它比算术方法犯了更多的错误(请参见第3、6、9、12行). (我做错什么了吗?)

Ops, it seems lubridate also makes mistakes. And apparently based on this toy example, it makes more mistakes than the arithmetic method (see lines 3, 6, 9, 12). (am I doing something wrong?)

toy_df <- data.frame(
    birthdate = birthdate,
    givendate = givendate,
    arithmetic = as.numeric((givendate - birthdate) / 365.25),
    lubridate = new_interval(start = birthdate, end = givendate) /
        duration(num = 1, units = "years"),
    eeptools = age_calc(dob = birthdate, enddate = givendate,
                        units = "years")
)
toy_df[, 3:5] <- floor(toy_df[, 3:5])
toy_df

    birthdate  givendate arithmetic lubridate eeptools
1  1978-12-30 2015-12-31         37        37       37
2  1978-12-31 2015-12-31         36        37       37
3  1979-01-01 2015-12-31         36        37       36
4  1962-12-30 2015-12-31         53        53       53
5  1962-12-31 2015-12-31         52        53       53
6  1963-01-01 2015-12-31         52        53       52
7  2000-06-16 2050-06-17         50        50       50
8  2000-06-17 2050-06-17         49        50       50
9  2000-06-18 2050-06-17         49        50       49
10 2007-03-18 2008-03-19          1         1        1
11 2007-03-19 2008-03-19          1         1        1
12 2007-03-20 2008-03-19          0         1        0
13 1968-02-29 2015-02-28         46        47       46
14 1968-02-29 2015-03-01         47        47       47
15 1968-02-29 2015-03-02         47        47       47

推荐答案

好,所以我在另一个:

age <- function(from, to) {
    from_lt = as.POSIXlt(from)
    to_lt = as.POSIXlt(to)

    age = to_lt$year - from_lt$year

    ifelse(to_lt$mon < from_lt$mon |
               (to_lt$mon == from_lt$mon & to_lt$mday < from_lt$mday),
           age - 1, age)
}

@Jim发表的话说:以下函数采用Date对象的向量并计算年龄,正确地计算了leap年.似乎比其他任何一个答案都更简单."

It was posted by @Jim saying "The following function takes a vectors of Date objects and calculates the ages, correctly accounting for leap years. Seems to be a simpler solution than any of the other answers".

它确实更简单,并且可以实现我一直在寻找的窍门.平均而言,它实际上比算术方法要快(大约快75%).

It is indeed simpler and it does the trick I was looking for. On average, it is actually faster than the arithmetic method (about 75% faster).

mbm <- microbenchmark(
    arithmetic = (givendate - birthdate) / 365.25,
    lubridate = interval(start = birthdate, end = givendate) /
        duration(num = 1, units = "years"),
    eeptools = age_calc(dob = birthdate, enddate = givendate, 
                        units = "years"),
    age = age(from = birthdate, to = givendate),
    times = 1000
)
mbm
autoplot(mbm)

至少在我的示例中,它没有犯任何错误(并且在任何示例中都不应犯错;这是使用ifelse s的非常简单的函数).

And at least in my examples it does not make any mistake (and it should not in any example; it's a pretty straightforward function using ifelses).

toy_df <- data.frame(
    birthdate = birthdate,
    givendate = givendate,
    arithmetic = as.numeric((givendate - birthdate) / 365.25),
    lubridate = interval(start = birthdate, end = givendate) /
        duration(num = 1, units = "years"),
    eeptools = age_calc(dob = birthdate, enddate = givendate,
                        units = "years"),
    age = age(from = birthdate, to = givendate)
)
toy_df[, 3:6] <- floor(toy_df[, 3:6])
toy_df

    birthdate  givendate arithmetic lubridate eeptools age
1  1978-12-30 2015-12-31         37        37       37  37
2  1978-12-31 2015-12-31         36        37       37  37
3  1979-01-01 2015-12-31         36        37       36  36
4  1962-12-30 2015-12-31         53        53       53  53
5  1962-12-31 2015-12-31         52        53       53  53
6  1963-01-01 2015-12-31         52        53       52  52
7  2000-06-16 2050-06-17         50        50       50  50
8  2000-06-17 2050-06-17         49        50       50  50
9  2000-06-18 2050-06-17         49        50       49  49
10 2007-03-18 2008-03-19          1         1        1   1
11 2007-03-19 2008-03-19          1         1        1   1
12 2007-03-20 2008-03-19          0         1        0   0
13 1968-02-29 2015-02-28         46        47       46  46
14 1968-02-29 2015-03-01         47        47       47  47
15 1968-02-29 2015-03-02         47        47       47  47

我不认为这是一个完整的解决方案,因为我也想将年龄设在几个月和几周之内,并且此功能特定于几年.无论如何,我将其发布在这里,因为它解决了多年以来的问题.我不会接受,因为:

I do not consider it as a complete solution because I also wanted to have age in months and weeks, and this function is specific for years. I post it here anyway because it solves the problem for the age in years. I will not accept it because:

  1. 我将等待@Jim将其发布为答案.
  2. 我将拭目以待,看看其他人是否提出了完整的解决方案(有效,准确并且可以根据需要以年,月或周为单位的年龄).

这篇关于在给定的出生日期和任意日期的情况下,有效且准确地计算出R的年龄(以年,月或周为单位)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-28 11:10