What is Just-in-Time Compilation?
=================================

Just-in-Time compilation (JIT) is the process of turning some form of
interpreted program evaluation into a native program, and doing so at
runtime.

For example, instead of using a facility that can evaluate arbitrary
SQL expressions to evaluate an SQL predicate like WHERE a.col = 3, it
is possible to generate a function than can be natively executed by
the CPU that just handles that expression, yielding a speedup.

This is JIT, rather than ahead-of-time (AOT) compilation, because it
is done at query execution time, and perhaps only in cases where the
relevant task is repeated a number of times. Given the way JIT
compilation is used in PostgreSQL, the lines between interpretation,
AOT and JIT are somewhat blurry.

Note that the interpreted program turned into a native program does
not necessarily have to be a program in the classical sense. E.g. it
is highly beneficial to JIT compile tuple deforming into a native
function just handling a specific type of table, despite tuple
deforming not commonly being understood as a "program".

即时编译(Just-in-Time Compilation,JIT)是将某种形式的解释程序评估转换为本机程序的过程,并在运行时进行。

例如,不使用可以评估任意 SQL 表达式的工具来评估 SQL 谓词,比如 WHERE a.col = 3,可以生成一个函数,该函数可以由 CPU 本机执行,只处理该表达式,从而提高速度。

这是 JIT 编译,而不是提前编译(Ahead-of-Time Compilation,AOT),因为它是在查询执行时进行的,可能仅在相关任务重复多次的情况下进行。鉴于 JIT 编译在 PostgreSQL 中的使用方式,解释、AOT 和 JIT 之间的界限有些模糊。

请注意,转换为本机程序的解释程序不一定是经典意义上的程序。例如,将元组解构 JIT 编译为本机函数,仅处理特定类型的表,尽管元组解构通常不被理解为“程序”,但这是非常有益的。

Why JIT?
========

Parts of PostgreSQL are commonly bottlenecked by comparatively small
pieces of CPU intensive code. In a number of cases that is because the
relevant code has to be very generic (e.g. handling arbitrary SQL
level expressions, over arbitrary tables, with arbitrary extensions
installed). This often leads to a large number of indirect jumps and
unpredictable branches, and generally a high number of instructions
for a given task. E.g. just evaluating an expression comparing a
column in a database to an integer ends up needing several hundred
cycles.

By generating native code large numbers of indirect jumps can be
removed by either making them into direct branches (e.g. replacing the
indirect call to an SQL operator's implementation with a direct call
to that function), or by removing it entirely (e.g. by evaluating the
branch at compile time because the input is constant). Similarly a lot
of branches can be entirely removed (e.g. by again evaluating the
branch at compile time because the input is constant). The latter is
particularly beneficial for removing branches during tuple deforming.

为什么需要 JIT?

在 PostgreSQL 中,一些部分常常受到相对较小的 CPU 密集型代码的限制。在许多情况下,这是因为相关的代码必须非常通用(例如,处理任意的 SQL 级别表达式,针对任意的表,安装了任意的扩展)。这通常会导致大量的间接跳转和不可预测的分支,以及对于给定任务而言指令数量较多。例如,仅仅评估一个将数据库中的列与整数进行比较的表达式就需要几百个周期。

通过生成本机代码,可以通过将大量的间接跳转转换为直接分支(例如,将对 SQL 操作符实现的间接调用替换为对该函数的直接调用),或者完全删除它(例如,通过在编译时评估分支,因为输入是常量)来减少间接跳转的数量。同样,许多分支可以完全删除(例如,通过再次在编译时评估分支,因为输入是常量)。后者对于在元组解构期间删除分支尤为有益。

How to JIT
==========

PostgreSQL, by default, uses LLVM to perform JIT. LLVM was chosen
because it is developed by several large corporations and therefore
unlikely to be discontinued, because it has a license compatible with
PostgreSQL, and because its IR can be generated from C using the Clang
compiler.

如何进行 JIT

默认情况下,PostgreSQL 使用 LLVM 进行 JIT。选择 LLVM 是因为它由几个大型公司开发,因此不太可能停止开发,因为它的许可证与 PostgreSQL 兼容,并且可以使用 Clang 编译器将其 IR 从 C 生成。

Shared Library Separation
-------------------------

To avoid the main PostgreSQL binary directly depending on LLVM, which
would prevent LLVM support being independently installed by OS package
managers, the LLVM dependent code is located in a shared library that
is loaded on-demand.

An additional benefit of doing so is that it is relatively easy to
evaluate JIT compilation that does not use LLVM, by changing out the
shared library used to provide JIT compilation.

To achieve this, code intending to perform JIT (e.g. expression evaluation)
calls an LLVM independent wrapper located in jit.c to do so. If the
shared library providing JIT support can be loaded (i.e. PostgreSQL was
compiled with LLVM support and the shared library is installed), the task
of JIT compiling an expression gets handed off to the shared library. This
obviously requires that the function in jit.c is allowed to fail in case
no JIT provider can be loaded.

Which shared library is loaded is determined by the jit_provider GUC,
defaulting to "llvmjit".

Cloistering code performing JIT into a shared library unfortunately
also means that code doing JIT compilation for various parts of code
has to be located separately from the code doing so without
JIT. E.g. the JIT version of execExprInterp.c is located in jit/llvm/
rather than executor/.

共享库分离

为了避免主 PostgreSQL 二进制文件直接依赖于 LLVM,这将阻止操作系统软件包管理器独立安装 LLVM 支持,LLVM 依赖的代码位于一个按需加载的共享库中。

这样做的另一个好处是,相对容易评估不使用 LLVM 的 JIT 编译,只需更换用于提供 JIT 编译的共享库即可。

为了实现这一点,打算执行 JIT 的代码(例如表达式评估)调用位于 jit.c 中的一个独立于 LLVM 的包装器来执行。如果可以加载提供 JIT 支持的共享库(即 PostgreSQL 是使用 LLVM 支持编译的,并且安装了共享库),则将表达式的 JIT 编译任务交给共享库处理。这显然要求 jit.c 中的函数在无法加载 JIT 提供程序的情况下允许失败。

加载的共享库由 jit_provider GUC 决定,默认为 “llvmjit”。

不幸的是,将执行 JIT 的代码隔离到一个共享库中意味着为不同部分的代码执行 JIT 编译的代码必须与不执行 JIT 的代码分开放置。例如,execExprInterp.c 的 JIT 版本位于 jit/llvm/ 目录中,而不是 executor/ 目录中。

JIT Context
-----------

For performance and convenience reasons it is useful to allow JITed
functions to be emitted and deallocated together. It is e.g. very
common to create a number of functions at query initialization time,
use them during query execution, and then deallocate all of them
together at the end of the query.

Lifetimes of JITed functions are managed via JITContext. Exactly one
such context should be created for work in which all created JITed
function should have the same lifetime. E.g. there's exactly one
JITContext for each query executed, in the query's EState.  Only the
release of a JITContext is exposed to the provider independent
facility, as the creation of one is done on-demand by the JIT
implementations.

Emitting individual functions separately is more expensive than
emitting several functions at once, and emitting them together can
provide additional optimization opportunities. To facilitate that, the
LLVM provider separates defining functions from optimizing and
emitting functions in an executable manner.

Creating functions into the current mutable module (a module
essentially is LLVM's equivalent of a translation unit in C) is done
using
  extern LLVMModuleRef llvm_mutable_module(LLVMJitContext *context);
in which it then can emit as much code using the LLVM APIs as it
wants. Whenever a function actually needs to be called
  extern void *llvm_get_function(LLVMJitContext *context, const char *funcname);
returns a pointer to it.

E.g. in the expression evaluation case this setup allows most
functions in a query to be emitted during ExecInitNode(), delaying the
function emission to the time the first time a function is actually
used.

JIT 上下文

出于性能和便利性的考虑,允许一起发出和释放 JIT 函数非常有用。例如,在查询初始化时创建一些函数,在查询执行期间使用它们,然后在查询结束时一起释放所有函数,这是非常常见的。

JIT 函数的生命周期通过 JITContext 进行管理。对于所有创建的 JIT 函数具有相同生命周期的工作,应该创建一个 JITContext。例如,每个执行的查询在查询的 EState 中有一个 JITContext。只有 JITContext 的释放对于独立于提供程序的设施是可见的,因为 JITContext 的创建是由 JIT 实现按需完成的。

单独发出每个函数比一次发出多个函数更耗费资源,而一起发出它们可以提供额外的优化机会。为了方便这一点,LLVM 提供程序将定义函数与优化和发出函数分开,以可执行的方式进行。

将函数创建到当前可变模块(模块本质上是 LLVM 中等价于 C 语言的翻译单元)中,可以使用以下方法: extern LLVMModuleRef llvm_mutable_module(LLVMJitContext *context); 然后可以使用 LLVM API 发出尽可能多的代码。每当实际需要调用函数时,使用以下方法: extern void *llvm_get_function(LLVMJitContext *context, const char *funcname); 返回指向该函数的指针。

例如,在表达式评估的情况下,这个设置允许在 ExecInitNode() 中发出查询中的大多数函数,将函数的发出延迟到第一次实际使用函数的时候。

Error Handling
--------------

There are two aspects of error handling.  Firstly, generated (LLVM IR)
and emitted functions (mmap()ed segments) need to be cleaned up both
after a successful query execution and after an error. This is done by
registering each created JITContext with the current resource owner,
and cleaning it up on error / end of transaction. If it is desirable
to release resources earlier, jit_release_context() can be used.

The second, less pretty, aspect of error handling is OOM handling
inside LLVM itself. The above resowner based mechanism takes care of
cleaning up emitted code upon ERROR, but there's also the chance that
LLVM itself runs out of memory. LLVM by default does *not* use any C++
exceptions. Its allocations are primarily funneled through the
standard "new" handlers, and some direct use of malloc() and
mmap(). For the former a 'new handler' exists:
http://en.cppreference.com/w/cpp/memory/new/set_new_handler
For the latter LLVM provides callbacks that get called upon failure
(unfortunately mmap() failures are treated as fatal rather than OOM errors).
What we've chosen to do for now is have two functions that LLVM using code
must use:
extern void llvm_enter_fatal_on_oom(void);
extern void llvm_leave_fatal_on_oom(void);
before interacting with LLVM code.

When a libstdc++ new or LLVM error occurs, the handlers set up by the
above functions trigger a FATAL error. We have to use FATAL rather
than ERROR, as we *cannot* reliably throw ERROR inside a foreign
library without risking corrupting its internal state.

Users of the above sections do *not* have to use PG_TRY/CATCH blocks,
the handlers instead are reset on toplevel sigsetjmp() level.

Using a relatively small enter/leave protected section of code, rather
than setting up these handlers globally, avoids negative interactions
with extensions that might use C++ such as PostGIS. As LLVM code
generation should never execute arbitrary code, just setting these
handlers temporarily ought to suffice.

错误处理有两个方面。首先,生成的(LLVM IR)和发射的函数(mmap()的段)需要在成功执行查询和出现错误后进行清理。这是通过将每个创建的JITContext注册到当前资源所有者,并在错误/事务结束时进行清理来完成的。如果希望更早释放资源,可以使用jit_release_context()。

错误处理的第二个方面是LLVM内部的OOM处理。上述基于资源所有者的机制负责在出现错误时清理发射的代码,但LLVM本身也有可能耗尽内存。LLVM默认情况下不使用任何C++异常。它的分配主要通过标准的"new"处理程序进行,以及一些直接使用malloc()和mmap()。对于前者,存在一个"new处理程序":http://en.cppreference.com/w/cpp/memory/new/set_new_handler 对于后者,LLVM提供了在失败时调用的回调函数(不幸的是,将mmap()失败视为致命错误而不是OOM错误)。 我们目前选择的做法是有两个函数,LLVM使用的代码必须使用这些函数: extern void llvm_enter_fatal_on_oom(void); extern void llvm_leave_fatal_on_oom(void); 在与LLVM代码交互之前。

当发生libstdc++ new或LLVM错误时,上述函数设置的处理程序会触发一个致命错误。我们必须使用致命错误而不是错误,因为在外部库中不能可靠地抛出错误,以免破坏其内部状态。

上述部分的用户不需要使用PG_TRY/CATCH块,处理程序会在顶层sigsetjmp()级别重置。

使用相对较小的进入/离开保护代码部分,而不是全局设置这些处理程序,可以避免与可能使用C++的扩展(如PostGIS)产生负面互动。由于LLVM代码生成应该永远不会执行任意代码,因此仅设置这些处理程序临时应该就足够了。

Type Synchronization
--------------------

To be able to generate code that can perform tasks done by "interpreted"
PostgreSQL, it obviously is required that code generation knows about at
least a few PostgreSQL types.  While it is possible to inform LLVM about
type definitions by recreating them manually in C code, that is failure
prone and labor intensive.

Instead there is one small file (llvmjit_types.c) which references each of
the types required for JITing. That file is translated to bitcode at
compile time, and loaded when LLVM is initialized in a backend.

That works very well to synchronize the type definition, but unfortunately
it does *not* synchronize offsets as the IR level representation doesn't
know field names.  Instead, required offsets are maintained as defines in
the original struct definition, like so:
#define FIELDNO_TUPLETABLESLOT_NVALID 9
        int                     tts_nvalid;             /* # of valid values in tts_values */
While that still needs to be defined, it's only required for a
relatively small number of fields, and it's bunched together with the
struct definition, so it's easily kept synchronized.

类型同步

为了能够生成能够执行由"解释"PostgreSQL执行的任务的代码,显然需要代码生成器了解至少一些PostgreSQL类型。虽然可以通过在C代码中手动重新创建类型定义来通知LLVM,但这种方法容易出错且工作量大。

相反,有一个小文件(llvmjit_types.c),其中引用了JIT所需的每个类型。该文件在编译时被转换为位码,并在后端初始化LLVM时加载。

这种方法非常适合同步类型定义,但不幸的是,它不会同步偏移量,因为IR级别的表示不知道字段名称。相反,所需的偏移量在原始结构定义中以定义的方式进行维护,例如#define FIELDNO_TUPLETABLESLOT_NVALID 9 int tts_nvalid; /* # of valid values in tts_values */虽然仍然需要定义它,但它仅适用于相对较少的字段,并且它与结构定义一起组合在一起,因此很容易保持同步。

Inlining
--------

One big advantage of JITing expressions is that it can significantly
reduce the overhead of PostgreSQL's extensible function/operator
mechanism, by inlining the body of called functions/operators.

It obviously is undesirable to maintain a second implementation of
commonly used functions, just for inlining purposes. Instead we take
advantage of the fact that the Clang compiler can emit LLVM IR.

The ability to do so allows us to get the LLVM IR for all operators
(e.g. int8eq, float8pl etc), without maintaining two copies.  These
bitcode files get installed into the server's
  $pkglibdir/bitcode/postgres/
Using existing LLVM functionality (for parallel LTO compilation),
additionally an index is over these is stored to
$pkglibdir/bitcode/postgres.index.bc

Similarly extensions can install code into
  $pkglibdir/bitcode/[extension]/
accompanied by
  $pkglibdir/bitcode/[extension].index.bc

just alongside the actual library.  An extension's index will be used
to look up symbols when located in the corresponding shared
library. Symbols that are used inside the extension, when inlined,
will be first looked up in the main binary and then the extension's.

内联

JIT表达式的一个重要优势是可以通过内联被调用函数/操作符的主体来显著减少PostgreSQL可扩展函数/操作符机制的开销。

显然,为了内联目的而维护常用函数的第二个实现是不可取的。相反,我们利用Clang编译器可以生成LLVM IR的事实。

这样做的能力使我们能够获取所有操作符的LLVM IR(例如int8eq,float8pl等),而无需维护两个副本。这些位码文件被安装到服务器的 $pkglibdir/bitcode/postgres/

使用现有的LLVM功能(用于并行LTO编译),此外还在其中存储了一个索引 $pkglibdir/bitcode/postgres.index.bc

类似地,扩展可以将代码安装到 $pkglibdir/bitcode/[extension]/ 并伴随着 $pkglibdir/bitcode/[extension].index.bc

就在实际库旁边。当位于相应的共享库中时,扩展的索引将用于查找符号。内联时使用的扩展内部使用的符号将首先在主二进制文件中查找,然后在扩展中查找。

Caching
-------

Currently it is not yet possible to cache generated functions, even
though that'd be desirable from a performance point of view. The
problem is that the generated functions commonly contain pointers into
per-execution memory. The expression evaluation machinery needs to
be redesigned a bit to avoid that. Basically all per-execution memory
needs to be referenced as an offset to one block of memory stored in
an ExprState, rather than absolute pointers into memory.

Once that is addressed, adding an LRU cache that's keyed by the
generated LLVM IR will allow the usage of optimized functions even for
faster queries.

A longer term project is to move expression compilation to the planner
stage, allowing e.g. to tie compiled expressions to prepared
statements.

An even more advanced approach would be to use JIT with few
optimizations initially, and build an optimized version in the
background. But that's even further off.

缓存

目前尚不可能缓存生成的函数,尽管从性能角度来看这是可取的。问题在于生成的函数通常包含指向每次执行内存的指针。为了避免这个问题,需要对表达式评估机制进行一些重新设计。基本上,所有每次执行的内存都需要作为一个偏移量引用到存储在ExprState中的一块内存中,而不是绝对指针引用到内存中。

一旦解决了这个问题,通过以生成的LLVM IR为键的LRU缓存将允许在更快的查询中使用优化的函数。

一个长期的项目是将表达式编译移到规划阶段,例如将编译的表达式与准备好的语句相关联。

更高级的方法是最初使用少量优化来使用JIT,并在后台构建一个优化版本。但这还要更进一步的时间。

What to JIT
===========

Currently expression evaluation and tuple deforming are JITed. Those
were chosen because they commonly are major CPU bottlenecks in
analytics queries, but are by no means the only potentially beneficial cases.

For JITing to be beneficial a piece of code first and foremost has to
be a CPU bottleneck. But also importantly, JITing can only be
beneficial if overhead can be removed by doing so. E.g. in the tuple
deforming case the knowledge about the number of columns and their
types can remove a significant number of branches, and in the
expression evaluation case a lot of indirect jumps/calls can be
removed.  If neither of these is the case, JITing is a waste of
resources.

Future avenues for JITing are tuple sorting, COPY parsing/output
generation, and later compiling larger parts of queries.

何时使用JIT

目前,JIT用于表达式评估和元组解构。选择这些部分是因为它们通常是分析查询中的主要CPU瓶颈,但并不是唯一可能受益的情况。

要使JIT有益处,首先必须是CPU瓶颈。但同样重要的是,只有通过JIT可以消除开销时,JIT才能有益。例如,在元组解构的情况下,关于列数和类型的知识可以消除大量的分支,而在表达式评估的情况下,可以消除大量的间接跳转/调用。如果这两种情况都不是,JIT就是资源的浪费。

未来可以使用JIT的领域包括元组排序、COPY解析/输出生成,以及以后编译查询的较大部分。

When to JIT
===========

Currently there are a number of GUCs that influence JITing:

- jit_above_cost = -1, 0-DBL_MAX - all queries with a higher total cost
  get JITed, *without* optimization (expensive part), corresponding to
  -O0. This commonly already results in significant speedups if
  expression/deforming is a bottleneck (removing dynamic branches
  mostly).
- jit_optimize_above_cost = -1, 0-DBL_MAX - all queries with a higher total cost
  get JITed, *with* optimization (expensive part).
- jit_inline_above_cost = -1, 0-DBL_MAX - inlining is tried if query has
  higher cost.

Whenever a query's total cost is above these limits, JITing is
performed.

Alternative costing models, e.g. by generating separate paths for
parts of a query with lower cpu_* costs, are also a possibility, but
it's doubtful the overhead of doing so is sufficient.  Another
alternative would be to count the number of times individual
expressions are estimated to be evaluated, and perform JITing of these
individual expressions.

The obvious seeming approach of JITing expressions individually after
a number of execution turns out not to work too well. Primarily
because emitting many small functions individually has significant
overhead. Secondarily because the time until JITing occurs causes
relative slowdowns that eat into the gain of JIT compilation.

何时使用JIT

目前有一些GUC参数会影响JIT编译:

  • jit_above_cost = -1, 0-DBL_MAX - 所有总成本高于指定值的查询都会进行JIT编译,不进行优化(即昂贵的部分),相当于-O0选项。如果表达式/解析是瓶颈(主要是减少动态分支),通常会显著提高速度。
  • jit_optimize_above_cost = -1, 0-DBL_MAX - 所有总成本高于指定值的查询都会进行JIT编译,并进行优化(即昂贵的部分)。
  • jit_inline_above_cost = -1, 0-DBL_MAX - 如果查询的成本高于指定值,则尝试内联。

只要查询的总成本超过这些限制,就会执行JIT编译。

还有其他的成本模型可以选择,例如为查询的部分生成具有较低cpu_*成本的单独路径,但是怀疑这样做的开销是否足够。另一种选择是计算个别表达式被估计评估的次数,并对这些个别表达式进行JIT编译。

明显的方法是在一定数量的执行后逐个对表达式进行JIT编译,但事实证明效果并不太好。主要原因是单独发出许多小函数会带来显著的开销。其次,JIT编译发生的时间会导致相对减慢,从而消耗JIT编译的收益。

09-19 05:38