Commit 0a6b7b7813799f76e1859387688611af05db376c
1 parent
b314f270
update
git-svn-id: svn://svn.savannah.nongnu.org/qemu/trunk@4581 c046a42c-6fe2-441c-8c8c-71466251a162
Showing
2 changed files
with
64 additions
and
86 deletions
tcg/README
| @@ -16,14 +16,18 @@ from the host, although it is never the case for QEMU. | @@ -16,14 +16,18 @@ from the host, although it is never the case for QEMU. | ||
| 16 | 16 | ||
| 17 | A TCG "function" corresponds to a QEMU Translated Block (TB). | 17 | A TCG "function" corresponds to a QEMU Translated Block (TB). |
| 18 | 18 | ||
| 19 | -A TCG "temporary" is a variable only live in a given | ||
| 20 | -function. Temporaries are allocated explicitly in each function. | 19 | +A TCG "temporary" is a variable only live in a basic |
| 20 | +block. Temporaries are allocated explicitly in each function. | ||
| 21 | 21 | ||
| 22 | -A TCG "global" is a variable which is live in all the functions. They | ||
| 23 | -are defined before the functions defined. A TCG global can be a memory | ||
| 24 | -location (e.g. a QEMU CPU register), a fixed host register (e.g. the | ||
| 25 | -QEMU CPU state pointer) or a memory location which is stored in a | ||
| 26 | -register outside QEMU TBs (not implemented yet). | 22 | +A TCG "local temporary" is a variable only live in a function. Local |
| 23 | +temporaries are allocated explicitly in each function. | ||
| 24 | + | ||
| 25 | +A TCG "global" is a variable which is live in all the functions | ||
| 26 | +(equivalent of a C global variable). They are defined before the | ||
| 27 | +functions defined. A TCG global can be a memory location (e.g. a QEMU | ||
| 28 | +CPU register), a fixed host register (e.g. the QEMU CPU state pointer) | ||
| 29 | +or a memory location which is stored in a register outside QEMU TBs | ||
| 30 | +(not implemented yet). | ||
| 27 | 31 | ||
| 28 | A TCG "basic block" corresponds to a list of instructions terminated | 32 | A TCG "basic block" corresponds to a list of instructions terminated |
| 29 | by a branch instruction. | 33 | by a branch instruction. |
| @@ -32,11 +36,11 @@ by a branch instruction. | @@ -32,11 +36,11 @@ by a branch instruction. | ||
| 32 | 36 | ||
| 33 | 3.1) Introduction | 37 | 3.1) Introduction |
| 34 | 38 | ||
| 35 | -TCG instructions operate on variables which are temporaries or | ||
| 36 | -globals. TCG instructions and variables are strongly typed. Two types | ||
| 37 | -are supported: 32 bit integers and 64 bit integers. Pointers are | ||
| 38 | -defined as an alias to 32 bit or 64 bit integers depending on the TCG | ||
| 39 | -target word size. | 39 | +TCG instructions operate on variables which are temporaries, local |
| 40 | +temporaries or globals. TCG instructions and variables are strongly | ||
| 41 | +typed. Two types are supported: 32 bit integers and 64 bit | ||
| 42 | +integers. Pointers are defined as an alias to 32 bit or 64 bit | ||
| 43 | +integers depending on the TCG target word size. | ||
| 40 | 44 | ||
| 41 | Each instruction has a fixed number of output variable operands, input | 45 | Each instruction has a fixed number of output variable operands, input |
| 42 | variable operands and always constant operands. | 46 | variable operands and always constant operands. |
| @@ -44,14 +48,12 @@ variable operands and always constant operands. | @@ -44,14 +48,12 @@ variable operands and always constant operands. | ||
| 44 | The notable exception is the call instruction which has a variable | 48 | The notable exception is the call instruction which has a variable |
| 45 | number of outputs and inputs. | 49 | number of outputs and inputs. |
| 46 | 50 | ||
| 47 | -In the textual form, output operands come first, followed by input | ||
| 48 | -operands, followed by constant operands. The output type is included | ||
| 49 | -in the instruction name. Constants are prefixed with a '$'. | 51 | +In the textual form, output operands usually come first, followed by |
| 52 | +input operands, followed by constant operands. The output type is | ||
| 53 | +included in the instruction name. Constants are prefixed with a '$'. | ||
| 50 | 54 | ||
| 51 | add_i32 t0, t1, t2 (t0 <- t1 + t2) | 55 | add_i32 t0, t1, t2 (t0 <- t1 + t2) |
| 52 | 56 | ||
| 53 | -sub_i64 t2, t3, $4 (t2 <- t3 - 4) | ||
| 54 | - | ||
| 55 | 3.2) Assumptions | 57 | 3.2) Assumptions |
| 56 | 58 | ||
| 57 | * Basic blocks | 59 | * Basic blocks |
| @@ -62,9 +64,8 @@ sub_i64 t2, t3, $4 (t2 <- t3 - 4) | @@ -62,9 +64,8 @@ sub_i64 t2, t3, $4 (t2 <- t3 - 4) | ||
| 62 | - Basic blocks start after the end of a previous basic block, at a | 64 | - Basic blocks start after the end of a previous basic block, at a |
| 63 | set_label instruction or after a legacy dyngen operation. | 65 | set_label instruction or after a legacy dyngen operation. |
| 64 | 66 | ||
| 65 | -After the end of a basic block, temporaries at destroyed and globals | ||
| 66 | -are stored at their initial storage (register or memory place | ||
| 67 | -depending on their declarations). | 67 | +After the end of a basic block, the content of temporaries is |
| 68 | +destroyed, but local temporaries and globals are preserved. | ||
| 68 | 69 | ||
| 69 | * Floating point types are not supported yet | 70 | * Floating point types are not supported yet |
| 70 | 71 | ||
| @@ -100,7 +101,7 @@ optimizations: | @@ -100,7 +101,7 @@ optimizations: | ||
| 100 | is suppressed. | 101 | is suppressed. |
| 101 | 102 | ||
| 102 | - A liveness analysis is done at the basic block level. The | 103 | - A liveness analysis is done at the basic block level. The |
| 103 | - information is used to suppress moves from a dead temporary to | 104 | + information is used to suppress moves from a dead variable to |
| 104 | another one. It is also used to remove instructions which compute | 105 | another one. It is also used to remove instructions which compute |
| 105 | dead results. The later is especially useful for condition code | 106 | dead results. The later is especially useful for condition code |
| 106 | optimization in QEMU. | 107 | optimization in QEMU. |
| @@ -113,47 +114,6 @@ optimizations: | @@ -113,47 +114,6 @@ optimizations: | ||
| 113 | 114 | ||
| 114 | only the last instruction is kept. | 115 | only the last instruction is kept. |
| 115 | 116 | ||
| 116 | -- A macro system is supported (may get closer to function inlining | ||
| 117 | - some day). It is useful if the liveness analysis is likely to prove | ||
| 118 | - that some results of a computation are indeed not useful. With the | ||
| 119 | - macro system, the user can provide several alternative | ||
| 120 | - implementations which are used depending on the used results. It is | ||
| 121 | - especially useful for condition code optimization in QEMU. | ||
| 122 | - | ||
| 123 | - Here is an example: | ||
| 124 | - | ||
| 125 | - macro_2 t0, t1, $1 | ||
| 126 | - mov_i32 t0, $0x1234 | ||
| 127 | - | ||
| 128 | - The macro identified by the ID "$1" normally returns the values t0 | ||
| 129 | - and t1. Suppose its implementation is: | ||
| 130 | - | ||
| 131 | - macro_start | ||
| 132 | - brcond_i32 t2, $0, $TCG_COND_EQ, $1 | ||
| 133 | - mov_i32 t0, $2 | ||
| 134 | - br $2 | ||
| 135 | - set_label $1 | ||
| 136 | - mov_i32 t0, $3 | ||
| 137 | - set_label $2 | ||
| 138 | - add_i32 t1, t3, t4 | ||
| 139 | - macro_end | ||
| 140 | - | ||
| 141 | - If t0 is not used after the macro, the user can provide a simpler | ||
| 142 | - implementation: | ||
| 143 | - | ||
| 144 | - macro_start | ||
| 145 | - add_i32 t1, t2, t4 | ||
| 146 | - macro_end | ||
| 147 | - | ||
| 148 | - TCG automatically chooses the right implementation depending on | ||
| 149 | - which macro outputs are used after it. | ||
| 150 | - | ||
| 151 | - Note that if TCG did more expensive optimizations, macros would be | ||
| 152 | - less useful. In the previous example a macro is useful because the | ||
| 153 | - liveness analysis is done on each basic block separately. Hence TCG | ||
| 154 | - cannot remove the code computing 't0' even if it is not used after | ||
| 155 | - the first macro implementation. | ||
| 156 | - | ||
| 157 | 3.4) Instruction Reference | 117 | 3.4) Instruction Reference |
| 158 | 118 | ||
| 159 | ********* Function call | 119 | ********* Function call |
| @@ -241,6 +201,10 @@ t0=t1|t2 | @@ -241,6 +201,10 @@ t0=t1|t2 | ||
| 241 | 201 | ||
| 242 | t0=t1^t2 | 202 | t0=t1^t2 |
| 243 | 203 | ||
| 204 | +* not_i32/i64 t0, t1 | ||
| 205 | + | ||
| 206 | +t0=~t1 | ||
| 207 | + | ||
| 244 | ********* Shifts | 208 | ********* Shifts |
| 245 | 209 | ||
| 246 | * shl_i32/i64 t0, t1, t2 | 210 | * shl_i32/i64 t0, t1, t2 |
| @@ -428,3 +392,34 @@ to apply more optimizations because more registers will be free for | @@ -428,3 +392,34 @@ to apply more optimizations because more registers will be free for | ||
| 428 | the generated code. | 392 | the generated code. |
| 429 | 393 | ||
| 430 | The exception model is the same as the dyngen one. | 394 | The exception model is the same as the dyngen one. |
| 395 | + | ||
| 396 | +6) Recommended coding rules for best performance | ||
| 397 | + | ||
| 398 | +- Use globals to represent the parts of the QEMU CPU state which are | ||
| 399 | + often modified, e.g. the integer registers and the condition | ||
| 400 | + codes. TCG will be able to use host registers to store them. | ||
| 401 | + | ||
| 402 | +- Avoid globals stored in fixed registers. They must be used only to | ||
| 403 | + store the pointer to the CPU state and possibly to store a pointer | ||
| 404 | + to a register window. The other uses are to ensure backward | ||
| 405 | + compatibility with dyngen during the porting a new target to TCG. | ||
| 406 | + | ||
| 407 | +- Use temporaries. Use local temporaries only when really needed, | ||
| 408 | + e.g. when you need to use a value after a jump. Local temporaries | ||
| 409 | + introduce a performance hit in the current TCG implementation: their | ||
| 410 | + content is saved to memory at end of each basic block. | ||
| 411 | + | ||
| 412 | +- Free temporaries and local temporaries when they are no longer used | ||
| 413 | + (tcg_temp_free). Since tcg_const_x() also creates a temporary, you | ||
| 414 | + should free it after it is used. Freeing temporaries does not yield | ||
| 415 | + a better generated code, but it reduces the memory usage of TCG and | ||
| 416 | + the speed of the translation. | ||
| 417 | + | ||
| 418 | +- Don't hesitate to use helpers for complicated or seldom used target | ||
| 419 | + intructions. There is little performance advantage in using TCG to | ||
| 420 | + implement target instructions taking more than about twenty TCG | ||
| 421 | + instructions. | ||
| 422 | + | ||
| 423 | +- Use the 'discard' instruction if you know that TCG won't be able to | ||
| 424 | + prove that a given global is "dead" at a given program point. The | ||
| 425 | + x86 target uses it to improve the condition codes optimisation. |
tcg/TODO
| 1 | -- test macro system | 1 | +- Add new instructions such as: andnot, ror, rol, setcond, clz, ctz, |
| 2 | + popcnt. | ||
| 2 | 3 | ||
| 3 | -- test conditional jumps | 4 | +- See if it is worth exporting mul2, mulu2, div2, divu2. |
| 4 | 5 | ||
| 5 | -- test mul, div, ext8s, ext16s, bswap | ||
| 6 | - | ||
| 7 | -- generate a global TB prologue and epilogue to save/restore registers | ||
| 8 | - to/from the CPU state and to reserve a stack frame to optimize | ||
| 9 | - helper calls. Modify cpu-exec.c so that it does not use global | ||
| 10 | - register variables (except maybe for 'env'). | ||
| 11 | - | ||
| 12 | -- fully convert the x86 target. The minimal amount of work includes: | ||
| 13 | - - add cc_src, cc_dst and cc_op as globals | ||
| 14 | - - disable its eflags optimization (the liveness analysis should | ||
| 15 | - suffice) | ||
| 16 | - - move complicated operations to helpers (in particular FPU, SSE, MMX). | ||
| 17 | - | ||
| 18 | -- optimize the x86 target: | ||
| 19 | - - move some or all the registers as globals | ||
| 20 | - - use the TB prologue and epilogue to have QEMU target registers in | ||
| 21 | - pre assigned host registers. | 6 | +- Support of globals saved in fixed registers between TBs. |
| 22 | 7 | ||
| 23 | Ideas: | 8 | Ideas: |
| 24 | 9 | ||
| 25 | - Move the slow part of the qemu_ld/st ops after the end of the TB. | 10 | - Move the slow part of the qemu_ld/st ops after the end of the TB. |
| 26 | 11 | ||
| 27 | -- Experiment: change instruction storage to simplify macro handling | ||
| 28 | - and to handle dynamic allocation and see if the translation speed is | ||
| 29 | - OK. | ||
| 30 | - | ||
| 31 | -- change exception syntax to get closer to QOP system (exception | 12 | +- Change exception syntax to get closer to QOP system (exception |
| 32 | parameters given with a specific instruction). | 13 | parameters given with a specific instruction). |
| 14 | + | ||
| 15 | +- Add float and vector support. |