Commit 0a6b7b7813799f76e1859387688611af05db376c
1 parent
b314f270
update
git-svn-id: svn://svn.savannah.nongnu.org/qemu/trunk@4581 c046a42c-6fe2-441c-8c8c-71466251a162
Showing
2 changed files
with
64 additions
and
86 deletions
tcg/README
| ... | ... | @@ -16,14 +16,18 @@ from the host, although it is never the case for QEMU. |
| 16 | 16 | |
| 17 | 17 | A TCG "function" corresponds to a QEMU Translated Block (TB). |
| 18 | 18 | |
| 19 | -A TCG "temporary" is a variable only live in a given | |
| 20 | -function. Temporaries are allocated explicitly in each function. | |
| 19 | +A TCG "temporary" is a variable only live in a basic | |
| 20 | +block. Temporaries are allocated explicitly in each function. | |
| 21 | 21 | |
| 22 | -A TCG "global" is a variable which is live in all the functions. They | |
| 23 | -are defined before the functions defined. A TCG global can be a memory | |
| 24 | -location (e.g. a QEMU CPU register), a fixed host register (e.g. the | |
| 25 | -QEMU CPU state pointer) or a memory location which is stored in a | |
| 26 | -register outside QEMU TBs (not implemented yet). | |
| 22 | +A TCG "local temporary" is a variable only live in a function. Local | |
| 23 | +temporaries are allocated explicitly in each function. | |
| 24 | + | |
| 25 | +A TCG "global" is a variable which is live in all the functions | |
| 26 | +(equivalent of a C global variable). They are defined before the | |
| 27 | +functions defined. A TCG global can be a memory location (e.g. a QEMU | |
| 28 | +CPU register), a fixed host register (e.g. the QEMU CPU state pointer) | |
| 29 | +or a memory location which is stored in a register outside QEMU TBs | |
| 30 | +(not implemented yet). | |
| 27 | 31 | |
| 28 | 32 | A TCG "basic block" corresponds to a list of instructions terminated |
| 29 | 33 | by a branch instruction. |
| ... | ... | @@ -32,11 +36,11 @@ by a branch instruction. |
| 32 | 36 | |
| 33 | 37 | 3.1) Introduction |
| 34 | 38 | |
| 35 | -TCG instructions operate on variables which are temporaries or | |
| 36 | -globals. TCG instructions and variables are strongly typed. Two types | |
| 37 | -are supported: 32 bit integers and 64 bit integers. Pointers are | |
| 38 | -defined as an alias to 32 bit or 64 bit integers depending on the TCG | |
| 39 | -target word size. | |
| 39 | +TCG instructions operate on variables which are temporaries, local | |
| 40 | +temporaries or globals. TCG instructions and variables are strongly | |
| 41 | +typed. Two types are supported: 32 bit integers and 64 bit | |
| 42 | +integers. Pointers are defined as an alias to 32 bit or 64 bit | |
| 43 | +integers depending on the TCG target word size. | |
| 40 | 44 | |
| 41 | 45 | Each instruction has a fixed number of output variable operands, input |
| 42 | 46 | variable operands and always constant operands. |
| ... | ... | @@ -44,14 +48,12 @@ variable operands and always constant operands. |
| 44 | 48 | The notable exception is the call instruction which has a variable |
| 45 | 49 | number of outputs and inputs. |
| 46 | 50 | |
| 47 | -In the textual form, output operands come first, followed by input | |
| 48 | -operands, followed by constant operands. The output type is included | |
| 49 | -in the instruction name. Constants are prefixed with a '$'. | |
| 51 | +In the textual form, output operands usually come first, followed by | |
| 52 | +input operands, followed by constant operands. The output type is | |
| 53 | +included in the instruction name. Constants are prefixed with a '$'. | |
| 50 | 54 | |
| 51 | 55 | add_i32 t0, t1, t2 (t0 <- t1 + t2) |
| 52 | 56 | |
| 53 | -sub_i64 t2, t3, $4 (t2 <- t3 - 4) | |
| 54 | - | |
| 55 | 57 | 3.2) Assumptions |
| 56 | 58 | |
| 57 | 59 | * Basic blocks |
| ... | ... | @@ -62,9 +64,8 @@ sub_i64 t2, t3, $4 (t2 <- t3 - 4) |
| 62 | 64 | - Basic blocks start after the end of a previous basic block, at a |
| 63 | 65 | set_label instruction or after a legacy dyngen operation. |
| 64 | 66 | |
| 65 | -After the end of a basic block, temporaries at destroyed and globals | |
| 66 | -are stored at their initial storage (register or memory place | |
| 67 | -depending on their declarations). | |
| 67 | +After the end of a basic block, the content of temporaries is | |
| 68 | +destroyed, but local temporaries and globals are preserved. | |
| 68 | 69 | |
| 69 | 70 | * Floating point types are not supported yet |
| 70 | 71 | |
| ... | ... | @@ -100,7 +101,7 @@ optimizations: |
| 100 | 101 | is suppressed. |
| 101 | 102 | |
| 102 | 103 | - A liveness analysis is done at the basic block level. The |
| 103 | - information is used to suppress moves from a dead temporary to | |
| 104 | + information is used to suppress moves from a dead variable to | |
| 104 | 105 | another one. It is also used to remove instructions which compute |
| 105 | 106 | dead results. The later is especially useful for condition code |
| 106 | 107 | optimization in QEMU. |
| ... | ... | @@ -113,47 +114,6 @@ optimizations: |
| 113 | 114 | |
| 114 | 115 | only the last instruction is kept. |
| 115 | 116 | |
| 116 | -- A macro system is supported (may get closer to function inlining | |
| 117 | - some day). It is useful if the liveness analysis is likely to prove | |
| 118 | - that some results of a computation are indeed not useful. With the | |
| 119 | - macro system, the user can provide several alternative | |
| 120 | - implementations which are used depending on the used results. It is | |
| 121 | - especially useful for condition code optimization in QEMU. | |
| 122 | - | |
| 123 | - Here is an example: | |
| 124 | - | |
| 125 | - macro_2 t0, t1, $1 | |
| 126 | - mov_i32 t0, $0x1234 | |
| 127 | - | |
| 128 | - The macro identified by the ID "$1" normally returns the values t0 | |
| 129 | - and t1. Suppose its implementation is: | |
| 130 | - | |
| 131 | - macro_start | |
| 132 | - brcond_i32 t2, $0, $TCG_COND_EQ, $1 | |
| 133 | - mov_i32 t0, $2 | |
| 134 | - br $2 | |
| 135 | - set_label $1 | |
| 136 | - mov_i32 t0, $3 | |
| 137 | - set_label $2 | |
| 138 | - add_i32 t1, t3, t4 | |
| 139 | - macro_end | |
| 140 | - | |
| 141 | - If t0 is not used after the macro, the user can provide a simpler | |
| 142 | - implementation: | |
| 143 | - | |
| 144 | - macro_start | |
| 145 | - add_i32 t1, t2, t4 | |
| 146 | - macro_end | |
| 147 | - | |
| 148 | - TCG automatically chooses the right implementation depending on | |
| 149 | - which macro outputs are used after it. | |
| 150 | - | |
| 151 | - Note that if TCG did more expensive optimizations, macros would be | |
| 152 | - less useful. In the previous example a macro is useful because the | |
| 153 | - liveness analysis is done on each basic block separately. Hence TCG | |
| 154 | - cannot remove the code computing 't0' even if it is not used after | |
| 155 | - the first macro implementation. | |
| 156 | - | |
| 157 | 117 | 3.4) Instruction Reference |
| 158 | 118 | |
| 159 | 119 | ********* Function call |
| ... | ... | @@ -241,6 +201,10 @@ t0=t1|t2 |
| 241 | 201 | |
| 242 | 202 | t0=t1^t2 |
| 243 | 203 | |
| 204 | +* not_i32/i64 t0, t1 | |
| 205 | + | |
| 206 | +t0=~t1 | |
| 207 | + | |
| 244 | 208 | ********* Shifts |
| 245 | 209 | |
| 246 | 210 | * shl_i32/i64 t0, t1, t2 |
| ... | ... | @@ -428,3 +392,34 @@ to apply more optimizations because more registers will be free for |
| 428 | 392 | the generated code. |
| 429 | 393 | |
| 430 | 394 | The exception model is the same as the dyngen one. |
| 395 | + | |
| 396 | +6) Recommended coding rules for best performance | |
| 397 | + | |
| 398 | +- Use globals to represent the parts of the QEMU CPU state which are | |
| 399 | + often modified, e.g. the integer registers and the condition | |
| 400 | + codes. TCG will be able to use host registers to store them. | |
| 401 | + | |
| 402 | +- Avoid globals stored in fixed registers. They must be used only to | |
| 403 | + store the pointer to the CPU state and possibly to store a pointer | |
| 404 | + to a register window. The other uses are to ensure backward | |
| 405 | + compatibility with dyngen during the porting a new target to TCG. | |
| 406 | + | |
| 407 | +- Use temporaries. Use local temporaries only when really needed, | |
| 408 | + e.g. when you need to use a value after a jump. Local temporaries | |
| 409 | + introduce a performance hit in the current TCG implementation: their | |
| 410 | + content is saved to memory at end of each basic block. | |
| 411 | + | |
| 412 | +- Free temporaries and local temporaries when they are no longer used | |
| 413 | + (tcg_temp_free). Since tcg_const_x() also creates a temporary, you | |
| 414 | + should free it after it is used. Freeing temporaries does not yield | |
| 415 | + a better generated code, but it reduces the memory usage of TCG and | |
| 416 | + the speed of the translation. | |
| 417 | + | |
| 418 | +- Don't hesitate to use helpers for complicated or seldom used target | |
| 419 | + intructions. There is little performance advantage in using TCG to | |
| 420 | + implement target instructions taking more than about twenty TCG | |
| 421 | + instructions. | |
| 422 | + | |
| 423 | +- Use the 'discard' instruction if you know that TCG won't be able to | |
| 424 | + prove that a given global is "dead" at a given program point. The | |
| 425 | + x86 target uses it to improve the condition codes optimisation. | ... | ... |
tcg/TODO
| 1 | -- test macro system | |
| 1 | +- Add new instructions such as: andnot, ror, rol, setcond, clz, ctz, | |
| 2 | + popcnt. | |
| 2 | 3 | |
| 3 | -- test conditional jumps | |
| 4 | +- See if it is worth exporting mul2, mulu2, div2, divu2. | |
| 4 | 5 | |
| 5 | -- test mul, div, ext8s, ext16s, bswap | |
| 6 | - | |
| 7 | -- generate a global TB prologue and epilogue to save/restore registers | |
| 8 | - to/from the CPU state and to reserve a stack frame to optimize | |
| 9 | - helper calls. Modify cpu-exec.c so that it does not use global | |
| 10 | - register variables (except maybe for 'env'). | |
| 11 | - | |
| 12 | -- fully convert the x86 target. The minimal amount of work includes: | |
| 13 | - - add cc_src, cc_dst and cc_op as globals | |
| 14 | - - disable its eflags optimization (the liveness analysis should | |
| 15 | - suffice) | |
| 16 | - - move complicated operations to helpers (in particular FPU, SSE, MMX). | |
| 17 | - | |
| 18 | -- optimize the x86 target: | |
| 19 | - - move some or all the registers as globals | |
| 20 | - - use the TB prologue and epilogue to have QEMU target registers in | |
| 21 | - pre assigned host registers. | |
| 6 | +- Support of globals saved in fixed registers between TBs. | |
| 22 | 7 | |
| 23 | 8 | Ideas: |
| 24 | 9 | |
| 25 | 10 | - Move the slow part of the qemu_ld/st ops after the end of the TB. |
| 26 | 11 | |
| 27 | -- Experiment: change instruction storage to simplify macro handling | |
| 28 | - and to handle dynamic allocation and see if the translation speed is | |
| 29 | - OK. | |
| 30 | - | |
| 31 | -- change exception syntax to get closer to QOP system (exception | |
| 12 | +- Change exception syntax to get closer to QOP system (exception | |
| 32 | 13 | parameters given with a specific instruction). |
| 14 | + | |
| 15 | +- Add float and vector support. | ... | ... |