在Java中跟踪内存泄漏/垃圾收集问题

本文介绍了在Java中跟踪内存泄漏/垃圾收集问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述这是我一直试图追查几个月的问题。我有一个java应用程序运行该处理XML提要并将结果存储在数据库中。有一些间歇性的资源问题很难追查到。背景：在生产框中是最引人注目的），我没有特别好的进入框，并且一直无法让Jprofiler运行。这个盒子是一个运行centos 5.2，tomcat6和java 1.6.0.11的64位四核，8GB的机器。它以这些java-opts开始 JAVA_OPTS = - server -Xmx5g -Xms4g -Xss256k -XX：MaxPermSize = 256m -XX ：+ PrintGCDetails - XX：+ PrintGCTimeStamps -XX：+ UseConcMarkSweepGC -XX：+ PrintTenuringDistribution -XX：+ UseParNewGC 技术堆栈如下： Centos 64位5.2 Java 6u11 Tomcat 6 Spring / WebMVC 2.5 Hibernate 3 Quartz 1.6.1 DBCP 1.2。 1 Mysql 5.0.45 Ehcache 1.5.0 （当然还有很多其他的依赖项，特别是jakarta-commons库）我能够重现问题的最接近的是32位内存要求较低的机器。我有控制权。我已经用JProfiler探测了它并解决了许多性能问题（同步问题，预编译/缓存xpath查询，减少了线程池，并且消除了不必要的休眠预取，以及在处理过程中过度热衷于缓存预热）。在每种情况下，分析器显示这些资源都是以某种原因占用大量资源的，而且这些资源在更改进入后不再是主要资源消耗者。问题： JVM似乎完全忽略了内存使用设置，填满了所有内存并且没有响应。对于面向客户的客户来说，这是一个问题，他们期望定期进行投票（5分钟基准和1分钟重试），同样也适用于我们的运营团队，他们经常被告知某个盒子没有响应并需要重新启动。这个盒子上没有其他重要的东西了。出现问题是垃圾收集。我们使用ConcurrentMarkSweep（如上所述）收集器，因为原始STW收集器导致JDBC超时并且变得越来越慢。这些日志显示，随着内存使用量的增加，开始抛出cms失败，并且回到最初的世界停止收集器，然后收集器似乎没有正确收集。然而，使用jprofiler运行时，运行GC按钮似乎很好地清理了内存，而不是显示增加的占用空间，但由于我无法直接将jprofiler连接到生产框，并且解决了经证实的热点似乎并不像在工作中，我留下了调整垃圾收集盲目的巫术。我曾尝试过：分析和修复热点。使用STW，并行和CMS垃圾收集器。以最小/最大堆大小运行，增量为1 / 2,2 / 4,4 / 5,6 / 6。以256M递增运行，永久空间高达1Gb。以上的许多组合。我也咨询了JVM [调优参考]（http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html），但是无法找到任何解释行为或在这种情况下使用_which_ tuning参数的任何示例。我也尝试了jprofiler在离线模式下尝试连接jconsole和visualvm，但我似乎无法找到任何会干扰我的gc日志数据的东西。不幸的是，这个问题也偶尔出现，它似乎是不可预测的，它可以运行几天甚至一周而不会有任何问题，或者它可能会在一天中失败40次，而且我唯一能够持续捕获的就是垃圾收集正在发挥作用。任何人都可以提供以下建议：a）为什么JVM使用8个物理集群和2 GB交换空间时，它被配置为最大输出小于6. b）实际解释的GC调整引用给出了使用高级集合的时间和类型设置的合理示例。 c）对最常见的java内存泄漏的引用（我理解未声明的引用，但我的意思是在库/框架级别，或者更多内联网的数据结构，比如hashmaps）。感谢您提供的所有洞察力。编辑 > Emil H： 1）是的，我的开发群集是生产数据的镜像，直至媒体服务器。主要区别在于32/64位和可用RAM的数量，我无法很容易地进行复制，但代码和查询与设置完全相同。 2）有一些依赖于JaxB的遗留代码，但在重新排序作业以避免计划冲突时，我通常会消除它，因为它一旦运行天。主分析器使用调用java.xml.xpath包的XPath查询。这是几个热点的来源，因为其中一个查询没有被预编译，另外两个是对硬编码字符串的引用。我创建了一个线程安全缓存（hashmap），并将对xpath查询的引用分解为最终的静态字符串，从而显着降低了资源消耗。查询仍然是处理的一大部分，但应该是因为这是应用程序的主要责任。 3）另外需要注意的是，另一个主要消费者是来自JAI的图像操作（重新处理来自Feed的图像）。我不熟悉java的图形库，但从我发现他们不是特别泄漏。（感谢迄今为止的答案，伙计们！）更新：我可以使用VisualVM连接到生产实例，但它禁用了GC可视化/运行GC选项（尽管我可以在本地查看它）。有趣的是：虚拟机的堆分配服从JAVA_OPTS，实际分配的堆坐在舒适的1-1.5 gig中，似乎没有泄漏，但箱级监视仍然显示泄漏模式，但它是不反映在虚拟机监控中。这个盒子上没有别的东西在运行，所以我很难过。解决方案那么，我终于找到了导致这个问题的原因，我发布了一个详细的答案，以防其他人有这些问题。我尝试了jmap，当进程出现时，但这通常会导致jvm挂起更进一步，我将不得不使用--force来运行它。这导致堆转储似乎缺少大量数据，或者至少错过了它们之间的引用。为了分析，我尝试了jhat，它提供了大量的数据，但没有太多解释它的方式。其次，我尝试了基于eclipse的内存分析工具（ http://www.eclipse.org/mat/），这表明堆大多是与tomcat相关的类。问题是jmap没有报告应用程序的实际状态，而只是捕获关闭时的类，主要是tomcat类。我试了几次，注意到模型对象的计数非常高（实际上是2-3x超过在数据库中被标记为公共的）。使用这个我分析了慢速查询日志以及一些不相关的性能问题。我尝试了额外的懒加载（ http：// docs。 jboss.org/hibernate/core/3.3/reference/en/html/performance.html ），以及用直接的jdbc查询替换一些hibernate操作（大部分是在处理大型集合的加载和操作的地方） - jdbc替换只是直接在连接表上工作），并替换了一些mysql正在记录的低效查询。这些步骤改进了前端的性能，但仍然没有解决泄漏问题，该应用程序仍然不稳定，行为不可预知。最后，我找到了选项：-XX：+ HeapDumpOnOutOfMemoryError。这最终产生了一个非常大的（〜6.5GB）hprof文件，准确显示了应用程序的状态。具有讽刺意味的是，这个文件太大了，甚至在一个有16GB内存的盒子上也无法分辨。幸运的是，MAT能够产生一些漂亮的图形并显示出一些更好的数据。这次突出显示的是单个石英线程占用6GB堆的4.5GB，其中大部分是休眠状态StatefulPersistenceContext（ https://www.hibernate.org/hib_docs/v3/api/org /hibernate/engine/StatefulPersistenceContext.html ）。这个类在内部被hibernate用作它的主缓存（我已经禁用了EHCache支持的第二级和查询缓存）。该类用于启用大部分hibernate功能，因此不能直接禁用（可以直接解决它，但spring不支持无状态的会话），如果在成熟的产品中有这么大的内存泄漏，我会非常惊讶。那么，为什么它现在泄漏？那么，这是一个组合：石英线程池实例化某些东西是threadLocal，春天是注入一个会话工厂，它在石英线程生命周期开始时创建一个会话，然后被重用来运行使用休眠会话的各种石英作业。然后Hibernate在会话中缓存，这是它的预期行为。接下来的问题是线程池永远不会释放会话，所以hibernate保持驻留并在会话的生命周期中维护缓存。由于这是使用spring hibernate模板支持，因此没有明确使用会话（我们使用dao - > manager - > driver - > quartz-job hierarchy，dao通过spring注入休眠配置，所以操作是所以会话永远不会被关闭，hibernate保持对缓存对象的引用，所以它们永远不会被垃圾回收，所以每次一个新的作业运行它会继续填充线程本地的缓存，所以在不同的作业之间甚至没有任何共享。此外，由于这是一个需要大量写作的工作（读数很少），缓存大部分都是浪费的，所以对象一直在创建。解决方案：创建一个dao方法它会显式调用session.flush（）和session.clear（），并在每个作业开始时调用该方法。该应用已运行几天现在没有监控问题，内存错误或重新启动。感谢大家在这方面的帮助，这是一个相当棘手的错误追踪，因为一切都在做什么本来是应该的，但最后一个三线方法设法解决所有问题。 This is a problem I have been trying to track down for a couple months now. I have a java app running that processes xml feeds and stores the result in a database. There have been intermittent resource problems that are very difficult to track down.Background:On the production box (where the problem is most noticeable), i do not have particularly good access to the box, and have been unable to get Jprofiler running. That box is a 64bit quad-core, 8gb machine running centos 5.2, tomcat6, and java 1.6.0.11. It starts with these java-optsJAVA_OPTS="-server -Xmx5g -Xms4g -Xss256k -XX:MaxPermSize=256m -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseConcMarkSweepGC -XX:+PrintTenuringDistribution -XX:+UseParNewGC"The technology stack is the following: Centos 64-bit 5.2 Java 6u11Tomcat 6 Spring/WebMVC 2.5 Hibernate 3 Quartz 1.6.1 DBCP 1.2.1 Mysql 5.0.45 Ehcache 1.5.0(and of course a host of other dependencies, notably the jakarta-commons libraries) The closest I can get to reproducing the problem is a 32-bit machine with lower memory requirements. That I do have control over. I have probed it to death with JProfiler and fixed many performance problems (synchronization issues, precompiling/caching xpath queries, reducing the threadpool, and removing unnecessary hibernate pre-fetching, and overzealous "cache-warming" during processing).In each case, the profiler showed these as taking up huge amounts of resources for one reason or another, and that these were no longer primary resource hogs once the changes went in.The Problem:The JVM seems to completely ignore the memory usage settings, fills all memory and becomes unresponsive. This is an issue for the customer facing end, who expects a regular poll (5 minute basis and 1-minute retry), as well for our operations teams, who are constantly notified that a box has become unresponsive and have to restart it. There is nothing else significant running on this box.The problem appears to be garbage collection. We are using the ConcurrentMarkSweep (as noted above) collector because the original STW collector was causing JDBC timeouts and became increasingly slow. The logs show that as the memory usage increases, that is begins to throw cms failures, and kicks back to the original stop-the-world collector, which then seems to not properly collect.However, running with jprofiler, the "Run GC" button seems to clean up the memory nicely rather than showing an increasing footprint, but since I can not connect jprofiler directly to the production box, and resolving proven hotspots doesnt seem to be working I am left with the voodoo of tuning Garbage Collection blind.What I have tried: Profiling and fixing hotspots. Using STW, Parallel and CMS garbage collectors. Running with min/max heap sizes at 1/2,2/4,4/5,6/6 increments. Running with permgen space in 256M increments up to 1Gb. Many combinations of the above. I have also consulted the JVM [tuning reference](http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html) , but can't really find anything explaining this behavior or any examples of _which_ tuning parameters to use in a situation like this. I have also (unsuccessfully) tried jprofiler in offline mode, connecting with jconsole, visualvm, but I can't seem to find anything that will interperet my gc log data. Unfortunately, the problem also pops up sporadically, it seems to be unpredictable, it can run for days or even a week without having any problems, or it can fail 40 times in a day, and the only thing I can seem to catch consistently is that garbage collection is acting up.Can anyone give any advice as to:a) Why a JVM is using 8 physical gigs and 2 gb of swap space when it is configured to max out at less than 6.b) A reference to GC tuning that actually explains or gives reasonable examples of when and what kind of setting to use the advanced collections with.c) A reference to the most common java memory leaks (i understand unclaimed references, but I mean at the library/framework level, or something more inherenet in data structures, like hashmaps). Thanks for any and all insight you can provide.EDITEmil H:1) Yes, my development cluster is a mirror of production data, down to the media server. The primary difference is the 32/64bit and the amount of RAM available, which I can't replicate very easily, but the code and queries and settings are identical. 2) There is some legacy code that relies on JaxB, but in reordering the jobs to try to avoid scheduling conflicts, I have that execution generally eliminated since it runs once a day. The primary parser uses XPath queries which call down to the java.xml.xpath package. This was the source of a few hotspots, for one the queries were not being pre-compiled, and two the references to them were in hardcoded strings. I created a threadsafe cache (hashmap) and factored the references to the xpath queries to be final static Strings, which lowered resource consumption significantly. The querying still is a large part of the processing, but it should be because that is the main responsibility of the application. 3) An additional note, the other primary consumer is image operations from JAI (reprocessing images from a feed). I am unfamiliar with java's graphic libraries, but from what I have found they are not particularly leaky. (thanks for the answers so far, folks!) UPDATE:I was able to connect to the production instance with VisualVM, but it had disabled the GC visualization / run-GC option (though i could view it locally). The interesting thing: The heap allocation of the VM is obeying the JAVA_OPTS, and the actual allocated heap is sitting comfortably at 1-1.5 gigs, and doesnt seem to be leaking, but the box level monitoring still shows a leak pattern, but it is not reflected in the VM monitoring. There is nothing else running on this box, so I am stumped. 解决方案 Well, I finally found the issue that was causing this, and I'm posting a detail answer in case someone else has these issues.I tried jmap while the process was acting up, but this usually caused the jvm to hang further, and I would have to run it with --force. This resulted in heap dumps that seemed to be missing a lot of data, or at least missing the references between them. For analysis, I tried jhat, which presents a lot of data but not much in the way of how to interpret it. Secondly, I tried the eclipse-based memory analysis tool ( http://www.eclipse.org/mat/ ), which showed that the heap was mostly classes related to tomcat.The issue was that jmap was not reporting the actual state of the application, and was only catching the classes on shutdown, which was mostly tomcat classes.I tried a few more times, and noticed that there were some very high counts of model objects (actually 2-3x more than were marked public in the database). Using this I analyzed the slow query logs, and a few unrelated performance problems. I tried extra-lazy loading ( http://docs.jboss.org/hibernate/core/3.3/reference/en/html/performance.html ), as well as replacing a few hibernate operations with direct jdbc queries (mostly where it was dealing with loading and operating on large collections -- the jdbc replacements just worked directly on the join tables), and replaced some other inefficient queries that mysql was logging.These steps improved pieces of the frontend performance, but still did not address the issue of the leak, the app was still unstable and acting unpredictably.Finally, I found the option: -XX:+HeapDumpOnOutOfMemoryError . This finally produced a very large (~6.5GB) hprof file that accurately showed the state of the application. Ironically, the file was so large that jhat could not anaylze it, even on a box with 16gb of ram. Fortunately, MAT was able to produce some nice looking graphs and showed some better data. This time what stuck out was a single quartz thread was taking up 4.5GB of the 6GB of heap, and the majority of that was a hibernate StatefulPersistenceContext ( https://www.hibernate.org/hib_docs/v3/api/org/hibernate/engine/StatefulPersistenceContext.html ). This class is used by hibernate internally as its primary cache (i had disabled the second-level and query-caches backed by EHCache). This class is used to enable most of the features of hibernate, so it can't be directly disabled (you can work around it directly, but spring doesn't support stateless session) , and i would be very surprised if this had such a major memory leak in a mature product. So why was it leaking now?Well, it was a combination of things:The quartz thread pool instantiates with certain things being threadLocal, spring was injecting a session factory in, that was creating a session at the start of the quartz threads lifecycle, which was then being reused to run the various quartz jobs that used the hibernate session. Hibernate then was caching in the session, which is its expected behavior. The problem then is that the thread pool was never releasing the session, so hibernate was staying resident and maintaining the cache for the lifecycle of the session. Since this was using springs hibernate template support, there was no explicit use of the sessions (we are using a dao -> manager -> driver -> quartz-job hierarchy, the dao is injected with hibernate configs through spring, so the operations are done directly on the templates).So the session was never being closed, hibernate was maintaining references to the cache objects, so they were never being garbage collected, so each time a new job ran it would just keep filling up the cache local to the thread, so there was not even any sharing between the different jobs. Also since this is a write-intensive job (very little reading), the cache was mostly wasted, so the objects kept getting created.The solution: create a dao method that explicitly calls session.flush() and session.clear(), and invoke that method at the beginning of each job.The app has been running for a few days now with no monitoring issues, memory errors or restarts.Thanks for everyone's help on this, it was a pretty tricky bug to track down, as everything was doing exactly what it was supposed to, but in the end a 3 line method managed to fix all the problems. 这篇关于在Java中跟踪内存泄漏/垃圾收集问题的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！