众所周知,Oracle JDK  是 Java 语言的绝对权威,很多时候 JDK 与 Java 语言近似一个概念。但我们始终要保持实事求是的精神,敢于质疑。本文记录了一次线上troubleshoot 实战,包含问题分析、解决并提交 Oracle JDK bug 的核心过程。一、背景现象 总之 就是某系统上线后 CLOSE_WAIT数量随着时间增加而大量增加,持续触发多个告警。二、分析定位过程部署了一个节点,用来复现之前出现的问题。 Step1 问题聚焦先查看到底是哪些IP之间的连接产生了大量CLOSE_WAIT,另外系统还会涉及调第三方,总之要确认连接建立的双方。执行命令:   结果: (ps:xxx、yyy、zzz 均无含义,基于信息安全考虑,屏蔽掉 ip)。tcp 3547 0 10.107.17.xxx:34602 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 38 0 10.107.17.xxx:59088 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 38 0 10.107.17.xxx:58028 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 38 0 10.107.17.xxx:51962 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 3563 0 10.107.17.xxx:46962 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 38 0 10.107.17.xxx:34608 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 38 0 10.107.17.xxx:46496 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 38 0 10.107.17.xxx:50774 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 38 0 10.107.17.xxx:59904 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 38 0 10.107.17.xxx:40208 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 38 0 10.107.17.xxx:41064 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 38 0 10.107.17.xxx:36994 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 3547 0 10.107.17.xxx:45080 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 6235 0 10.107.17.xxx:60966 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 38 0 10.107.17.xxx:56178 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 3547 0 10.107.17.xxx:39922 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 38 0 10.107.17.xxx:43270 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 38 0 10.107.17.xxx:40926 zzz.202.32.242:443 CLOSE_WAIT 19819/javatcp 38 0 10.107.17.xxx:44472 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 2891 0 10.107.17.xxx:43036 zzz.202.32.241:443 CLOSE_WAIT 19819/java................tcp 38 0 10.107.17.xxx:33472 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 38 0 10.107.17.xxx:51976 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 38 0 10.107.17.xxx:57788 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 38 0 10.107.17.xxx:35638 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 38 0 10.107.17.xxx:43778 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 38 0 10.107.17.xxx:46418 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 38 0 10.107.17.xxx:49914 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 38 0 10.107.17.xxx:49258 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 38 0 10.107.17.xxx:48718 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 38 0 10.107.17.xxx:51480 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 38 0 10.107.17.xxx:59816 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 38 0 10.107.17.xxx:49266 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 38 0 10.107.17.xxx:50246 yyy.12.230.115:443 CLOSE_WAIT 19819/javatcp 38 0 10.107.17.xxx:39324 yyy.12.230.115:443 CLOSE_WAIT 19819/java总之: 这个三个IP是导火索。 Step2 问题分析这三个IP具体是谁?具体是请求了哪个接口?暂时无法直接获知!最直接的导火索暂时断了线索。接着从侧面开始查看更多信息,JVM信息外部资源、线程 什么的都看了,未发现明显异常抓包要抓包获取更多线索了。对于很久没有碰过TCP层,有些吃力。 得到线索:发现大量的RST那么是什么操作会导致CLOSE_WAIT呢?什么样的连接导致大量RST呢(可参考RST通常原因)?  Step3 代码分析定位运维大佬的协助查询,得知这三个IP是图片CDN服务。至此,可以定位到具体代码逻辑,图片CDN请求可以排查代码。仔细分析这部分源码后,推测因为服务器 发起 URL请求,请求不存在,导致抛出异常,但是JDK中却没有地方关闭Socket。javax.imageio.read(URL)/** * Returns a BufferedImage as the result of decoding * a supplied URL with an ImageReader * chosen automatically from among those currently registered. An * InputStream is obtained from the URL, * which is wrapped in an ImageInputStream. If no * registered ImageReader claims to be able to read * the resulting stream, null is returned. * *The current cache settings from getUseCacheand * getCacheDirectory will be used to control caching in the * ImageInputStream that is created. * *This method does not attempt to locate * ImageReaders that can read directly from a * URL; that may be accomplished using * IIORegistry and ImageReaderSpi. * * @param input a URL to read from. * * @return a BufferedImage containing the decoded * contents of the input, or null. * * @exception IllegalArgumentException if input is * null. * @exception IOException if an error occurs during reading. */ public static BufferedImage read(URL input) throws IOException { if (input == null) { throw new IllegalArgumentException("input == null!"); } InputStream istream = null; try { //此处,建立TCP连接!并且直接获取流,因为流数据不存在,进入cache块,抛出! istream = input.openStream(); } catch (IOException e) { throw new IIOException("Can't get input stream from URL!", e); } ImageInputStream stream = createImageInputStream(istream); BufferedImage bi; try { bi = read(stream); if (bi == null) { stream.close(); } } finally { istream.close(); } return bi; }可以看到JDK并没有关闭 ImageIO.read(url) 代码中封装的Socket连接!CDN会请求超时关闭导致服务器处于CLOSE_WAIT?限于网络经验有限,并不能100%确认我的想法。所以模拟下吧。 Step4  复现与模拟根据系统业务源码,快速模拟:public static void main(String[] args) throws InterruptedException { ExecutorService ex = Executors.newFixedThreadPool(100); for (int i = 0; i 5000; i++) { ex.execute(task()); }} /** * @throws IOException * @throws MalformedURLException */ private static Runnable task() { return new Runnable() { @Override public void run() { // domain must exists,but file doesnot. String vivofsUrl = "https://vivobbs.xx.yy.zz/wiwNWYCFW9ieGbWq/20181129/3a2adfde12cd328d81f965088890eeffff.jpg"; File file = null; BufferedImage image = null; try { file = File.createTempFile("abc", "jpg"); URL url1 = new URL(vivofsUrl); image = ImageIO.read(url1); } catch (Throwable e) { e.printStackTrace(); } finally { if (null != file) { file.delete(); } if (null != image) { image.flush(); image = null; } } } };} 抓包 TCP查看 问题复现! Step5 沟通后提报bugreport 给Oracle。三、Oracle沟通提单之后,Oracle跟我联系沟通。截取部分邮件内容,仅供参考。已被采纳四、疑点与不足TCP状态机的流转不够熟悉透彻。导致一些问题不能从TCP状态机分析推理,知识的全面精通需要不断提高。
09-02 02:12