Commit 0a6b7b7813799f76e1859387688611af05db376c
1 parent
b314f270
update
git-svn-id: svn://svn.savannah.nongnu.org/qemu/trunk@4581 c046a42c-6fe2-441c-8c8c-71466251a162
Showing
2 changed files
with
64 additions
and
86 deletions
tcg/README
@@ -16,14 +16,18 @@ from the host, although it is never the case for QEMU. | @@ -16,14 +16,18 @@ from the host, although it is never the case for QEMU. | ||
16 | 16 | ||
17 | A TCG "function" corresponds to a QEMU Translated Block (TB). | 17 | A TCG "function" corresponds to a QEMU Translated Block (TB). |
18 | 18 | ||
19 | -A TCG "temporary" is a variable only live in a given | ||
20 | -function. Temporaries are allocated explicitly in each function. | 19 | +A TCG "temporary" is a variable only live in a basic |
20 | +block. Temporaries are allocated explicitly in each function. | ||
21 | 21 | ||
22 | -A TCG "global" is a variable which is live in all the functions. They | ||
23 | -are defined before the functions defined. A TCG global can be a memory | ||
24 | -location (e.g. a QEMU CPU register), a fixed host register (e.g. the | ||
25 | -QEMU CPU state pointer) or a memory location which is stored in a | ||
26 | -register outside QEMU TBs (not implemented yet). | 22 | +A TCG "local temporary" is a variable only live in a function. Local |
23 | +temporaries are allocated explicitly in each function. | ||
24 | + | ||
25 | +A TCG "global" is a variable which is live in all the functions | ||
26 | +(equivalent of a C global variable). They are defined before the | ||
27 | +functions defined. A TCG global can be a memory location (e.g. a QEMU | ||
28 | +CPU register), a fixed host register (e.g. the QEMU CPU state pointer) | ||
29 | +or a memory location which is stored in a register outside QEMU TBs | ||
30 | +(not implemented yet). | ||
27 | 31 | ||
28 | A TCG "basic block" corresponds to a list of instructions terminated | 32 | A TCG "basic block" corresponds to a list of instructions terminated |
29 | by a branch instruction. | 33 | by a branch instruction. |
@@ -32,11 +36,11 @@ by a branch instruction. | @@ -32,11 +36,11 @@ by a branch instruction. | ||
32 | 36 | ||
33 | 3.1) Introduction | 37 | 3.1) Introduction |
34 | 38 | ||
35 | -TCG instructions operate on variables which are temporaries or | ||
36 | -globals. TCG instructions and variables are strongly typed. Two types | ||
37 | -are supported: 32 bit integers and 64 bit integers. Pointers are | ||
38 | -defined as an alias to 32 bit or 64 bit integers depending on the TCG | ||
39 | -target word size. | 39 | +TCG instructions operate on variables which are temporaries, local |
40 | +temporaries or globals. TCG instructions and variables are strongly | ||
41 | +typed. Two types are supported: 32 bit integers and 64 bit | ||
42 | +integers. Pointers are defined as an alias to 32 bit or 64 bit | ||
43 | +integers depending on the TCG target word size. | ||
40 | 44 | ||
41 | Each instruction has a fixed number of output variable operands, input | 45 | Each instruction has a fixed number of output variable operands, input |
42 | variable operands and always constant operands. | 46 | variable operands and always constant operands. |
@@ -44,14 +48,12 @@ variable operands and always constant operands. | @@ -44,14 +48,12 @@ variable operands and always constant operands. | ||
44 | The notable exception is the call instruction which has a variable | 48 | The notable exception is the call instruction which has a variable |
45 | number of outputs and inputs. | 49 | number of outputs and inputs. |
46 | 50 | ||
47 | -In the textual form, output operands come first, followed by input | ||
48 | -operands, followed by constant operands. The output type is included | ||
49 | -in the instruction name. Constants are prefixed with a '$'. | 51 | +In the textual form, output operands usually come first, followed by |
52 | +input operands, followed by constant operands. The output type is | ||
53 | +included in the instruction name. Constants are prefixed with a '$'. | ||
50 | 54 | ||
51 | add_i32 t0, t1, t2 (t0 <- t1 + t2) | 55 | add_i32 t0, t1, t2 (t0 <- t1 + t2) |
52 | 56 | ||
53 | -sub_i64 t2, t3, $4 (t2 <- t3 - 4) | ||
54 | - | ||
55 | 3.2) Assumptions | 57 | 3.2) Assumptions |
56 | 58 | ||
57 | * Basic blocks | 59 | * Basic blocks |
@@ -62,9 +64,8 @@ sub_i64 t2, t3, $4 (t2 <- t3 - 4) | @@ -62,9 +64,8 @@ sub_i64 t2, t3, $4 (t2 <- t3 - 4) | ||
62 | - Basic blocks start after the end of a previous basic block, at a | 64 | - Basic blocks start after the end of a previous basic block, at a |
63 | set_label instruction or after a legacy dyngen operation. | 65 | set_label instruction or after a legacy dyngen operation. |
64 | 66 | ||
65 | -After the end of a basic block, temporaries at destroyed and globals | ||
66 | -are stored at their initial storage (register or memory place | ||
67 | -depending on their declarations). | 67 | +After the end of a basic block, the content of temporaries is |
68 | +destroyed, but local temporaries and globals are preserved. | ||
68 | 69 | ||
69 | * Floating point types are not supported yet | 70 | * Floating point types are not supported yet |
70 | 71 | ||
@@ -100,7 +101,7 @@ optimizations: | @@ -100,7 +101,7 @@ optimizations: | ||
100 | is suppressed. | 101 | is suppressed. |
101 | 102 | ||
102 | - A liveness analysis is done at the basic block level. The | 103 | - A liveness analysis is done at the basic block level. The |
103 | - information is used to suppress moves from a dead temporary to | 104 | + information is used to suppress moves from a dead variable to |
104 | another one. It is also used to remove instructions which compute | 105 | another one. It is also used to remove instructions which compute |
105 | dead results. The later is especially useful for condition code | 106 | dead results. The later is especially useful for condition code |
106 | optimization in QEMU. | 107 | optimization in QEMU. |
@@ -113,47 +114,6 @@ optimizations: | @@ -113,47 +114,6 @@ optimizations: | ||
113 | 114 | ||
114 | only the last instruction is kept. | 115 | only the last instruction is kept. |
115 | 116 | ||
116 | -- A macro system is supported (may get closer to function inlining | ||
117 | - some day). It is useful if the liveness analysis is likely to prove | ||
118 | - that some results of a computation are indeed not useful. With the | ||
119 | - macro system, the user can provide several alternative | ||
120 | - implementations which are used depending on the used results. It is | ||
121 | - especially useful for condition code optimization in QEMU. | ||
122 | - | ||
123 | - Here is an example: | ||
124 | - | ||
125 | - macro_2 t0, t1, $1 | ||
126 | - mov_i32 t0, $0x1234 | ||
127 | - | ||
128 | - The macro identified by the ID "$1" normally returns the values t0 | ||
129 | - and t1. Suppose its implementation is: | ||
130 | - | ||
131 | - macro_start | ||
132 | - brcond_i32 t2, $0, $TCG_COND_EQ, $1 | ||
133 | - mov_i32 t0, $2 | ||
134 | - br $2 | ||
135 | - set_label $1 | ||
136 | - mov_i32 t0, $3 | ||
137 | - set_label $2 | ||
138 | - add_i32 t1, t3, t4 | ||
139 | - macro_end | ||
140 | - | ||
141 | - If t0 is not used after the macro, the user can provide a simpler | ||
142 | - implementation: | ||
143 | - | ||
144 | - macro_start | ||
145 | - add_i32 t1, t2, t4 | ||
146 | - macro_end | ||
147 | - | ||
148 | - TCG automatically chooses the right implementation depending on | ||
149 | - which macro outputs are used after it. | ||
150 | - | ||
151 | - Note that if TCG did more expensive optimizations, macros would be | ||
152 | - less useful. In the previous example a macro is useful because the | ||
153 | - liveness analysis is done on each basic block separately. Hence TCG | ||
154 | - cannot remove the code computing 't0' even if it is not used after | ||
155 | - the first macro implementation. | ||
156 | - | ||
157 | 3.4) Instruction Reference | 117 | 3.4) Instruction Reference |
158 | 118 | ||
159 | ********* Function call | 119 | ********* Function call |
@@ -241,6 +201,10 @@ t0=t1|t2 | @@ -241,6 +201,10 @@ t0=t1|t2 | ||
241 | 201 | ||
242 | t0=t1^t2 | 202 | t0=t1^t2 |
243 | 203 | ||
204 | +* not_i32/i64 t0, t1 | ||
205 | + | ||
206 | +t0=~t1 | ||
207 | + | ||
244 | ********* Shifts | 208 | ********* Shifts |
245 | 209 | ||
246 | * shl_i32/i64 t0, t1, t2 | 210 | * shl_i32/i64 t0, t1, t2 |
@@ -428,3 +392,34 @@ to apply more optimizations because more registers will be free for | @@ -428,3 +392,34 @@ to apply more optimizations because more registers will be free for | ||
428 | the generated code. | 392 | the generated code. |
429 | 393 | ||
430 | The exception model is the same as the dyngen one. | 394 | The exception model is the same as the dyngen one. |
395 | + | ||
396 | +6) Recommended coding rules for best performance | ||
397 | + | ||
398 | +- Use globals to represent the parts of the QEMU CPU state which are | ||
399 | + often modified, e.g. the integer registers and the condition | ||
400 | + codes. TCG will be able to use host registers to store them. | ||
401 | + | ||
402 | +- Avoid globals stored in fixed registers. They must be used only to | ||
403 | + store the pointer to the CPU state and possibly to store a pointer | ||
404 | + to a register window. The other uses are to ensure backward | ||
405 | + compatibility with dyngen during the porting a new target to TCG. | ||
406 | + | ||
407 | +- Use temporaries. Use local temporaries only when really needed, | ||
408 | + e.g. when you need to use a value after a jump. Local temporaries | ||
409 | + introduce a performance hit in the current TCG implementation: their | ||
410 | + content is saved to memory at end of each basic block. | ||
411 | + | ||
412 | +- Free temporaries and local temporaries when they are no longer used | ||
413 | + (tcg_temp_free). Since tcg_const_x() also creates a temporary, you | ||
414 | + should free it after it is used. Freeing temporaries does not yield | ||
415 | + a better generated code, but it reduces the memory usage of TCG and | ||
416 | + the speed of the translation. | ||
417 | + | ||
418 | +- Don't hesitate to use helpers for complicated or seldom used target | ||
419 | + intructions. There is little performance advantage in using TCG to | ||
420 | + implement target instructions taking more than about twenty TCG | ||
421 | + instructions. | ||
422 | + | ||
423 | +- Use the 'discard' instruction if you know that TCG won't be able to | ||
424 | + prove that a given global is "dead" at a given program point. The | ||
425 | + x86 target uses it to improve the condition codes optimisation. |
tcg/TODO
1 | -- test macro system | 1 | +- Add new instructions such as: andnot, ror, rol, setcond, clz, ctz, |
2 | + popcnt. | ||
2 | 3 | ||
3 | -- test conditional jumps | 4 | +- See if it is worth exporting mul2, mulu2, div2, divu2. |
4 | 5 | ||
5 | -- test mul, div, ext8s, ext16s, bswap | ||
6 | - | ||
7 | -- generate a global TB prologue and epilogue to save/restore registers | ||
8 | - to/from the CPU state and to reserve a stack frame to optimize | ||
9 | - helper calls. Modify cpu-exec.c so that it does not use global | ||
10 | - register variables (except maybe for 'env'). | ||
11 | - | ||
12 | -- fully convert the x86 target. The minimal amount of work includes: | ||
13 | - - add cc_src, cc_dst and cc_op as globals | ||
14 | - - disable its eflags optimization (the liveness analysis should | ||
15 | - suffice) | ||
16 | - - move complicated operations to helpers (in particular FPU, SSE, MMX). | ||
17 | - | ||
18 | -- optimize the x86 target: | ||
19 | - - move some or all the registers as globals | ||
20 | - - use the TB prologue and epilogue to have QEMU target registers in | ||
21 | - pre assigned host registers. | 6 | +- Support of globals saved in fixed registers between TBs. |
22 | 7 | ||
23 | Ideas: | 8 | Ideas: |
24 | 9 | ||
25 | - Move the slow part of the qemu_ld/st ops after the end of the TB. | 10 | - Move the slow part of the qemu_ld/st ops after the end of the TB. |
26 | 11 | ||
27 | -- Experiment: change instruction storage to simplify macro handling | ||
28 | - and to handle dynamic allocation and see if the translation speed is | ||
29 | - OK. | ||
30 | - | ||
31 | -- change exception syntax to get closer to QOP system (exception | 12 | +- Change exception syntax to get closer to QOP system (exception |
32 | parameters given with a specific instruction). | 13 | parameters given with a specific instruction). |
14 | + | ||
15 | +- Add float and vector support. |