sxzhou's notebook - Java日志异常栈丢失的原因

Java日志异常栈丢失的原因

最近在排查线上问题时碰到个诡异的问题，error日志中的异常栈没有了，只有一行：

java.lang.NullPointerException

于是去查了网上文章，都指向一个JVM启动参数： OmitStackTraceInFastThrow
这个参数默认是开启的，解释是：

The compiler in the server VM now provides correct stack backtraces for all “cold” built-in exceptions. For performance purposes, when such an exception is thrown a few times, the method may be recompiled. After recompilation, the compiler may choose a faster tactic using preallocated exceptions that do not provide a stack trace. To disable completely the use of preallocated exceptions, use this new flag: -XX:-OmitStackTraceInFastThrow.

JVM对一些特定的异常类型做了Fast Throw优化，如果检测到在代码里某个位置连续多次抛出同一类型异常的话，C2会决定用Fast Throw方式来抛出异常，而异常Trace即详细的异常栈信息会被清空。这种异常抛出速度非常快，因为不需要在堆里分配内存，也不需要构造完整的异常栈信息。
做个小测试：

public class StackLostTest {
    public static void main(String[] args) {
        for (int i = 0; i < 200000; i++) {
            try {
                Integer a = null;
                Integer b = a + 1;
            } catch (Exception e) {
                int length = e.getStackTrace().length;
                if (length < 1) {
                    System.out.println("stack lost! i=" + i);
                    e.printStackTrace();
                    break;
                }
            }
        }
    }
}

在我本地的运行结果是：

stack lost! i=115715
java.lang.NullPointerException

这个抛NPE的的代码在执行了115715次后被优化了，我们可以设置启动参数-XX:-OmitStackTraceInFastThrow，这样可以使得异常栈完整展示，但是在生产环境，这种方式是不好的，可能导致日志风暴，我们应该找到最开始抛异常的时间点，查看当时的日志，所以说日志持久化和异常及时响应才是正解。

更深入些 执行这个优化的代码如下：

  // If this throw happens frequently, an uncommon trap might cause
  // a performance pothole.  If there is a local exception handler,
  // and if this particular bytecode appears to be deoptimizing often,
  // let us handle the throw inline, with a preconstructed instance.
  // Note:   If the deopt count has blown up, the uncommon trap
  // runtime is going to flush this nmethod, not matter what.
  if (treat_throw_as_hot
      && (!StackTraceInThrowable || OmitStackTraceInFastThrow)) {
    // If the throw is local, we use a pre-existing instance and
    // punt on the backtrace.  This would lead to a missing backtrace
    // (a repeat of 4292742) if the backtrace object is ever asked
    // for its backtrace.
    // Fixing this remaining case of 4292742 requires some flavor of
    // escape analysis.  Leave that for the future.
    ciInstance* ex_obj = NULL;
    switch (reason) {
    case Deoptimization::Reason_null_check:
      ex_obj = env()->NullPointerException_instance();
      break;
    case Deoptimization::Reason_div0_check:
      ex_obj = env()->ArithmeticException_instance();
      break;
    case Deoptimization::Reason_range_check:
      ex_obj = env()->ArrayIndexOutOfBoundsException_instance();
      break;
    case Deoptimization::Reason_class_check:
      if (java_bc() == Bytecodes::_aastore) {
        ex_obj = env()->ArrayStoreException_instance();
      } else {
        ex_obj = env()->ClassCastException_instance();
      }
      break;
    }
    ... ...
}

可以看到，触发优化的判断条件是： treat_throw_as_hot && (!StackTraceInThrowable || OmitStackTraceInFastThrow)，即：

这个代码位置持续抛出异常
StackTraceInThrowable=false或者OmitStackTraceInFastThrow=true
命中了如上几种异常

是否能在运行时临时关闭这个优化，让开发者能快速定位问题，这里有些讨论：
https://bugs.openjdk.java.net/browse/JDK-8046503

其实上面这段测试代码，在本地运行时会有个奇怪的现象，异常栈丢失后再执行一段时间后有出现了完整异常栈了，这里也有人复现过：
https://bugs.openjdk.java.net/browse/JDK-8073432 推测是JIT编译优化和反优化造成的，需要更多研究。

如果觉得我的文章对您有用，请在支付宝公益平台找个项目捐点钱。 @sxzhou Aug 8, 2021

奉献爱心