Commit 0a6b7b7813799f76e1859387688611af05db376c
1 parent
b314f270
update
git-svn-id: svn://svn.savannah.nongnu.org/qemu/trunk@4581 c046a42c-6fe2-441c-8c8c-71466251a162
Showing
2 changed files
with
64 additions
and
86 deletions
tcg/README
... | ... | @@ -16,14 +16,18 @@ from the host, although it is never the case for QEMU. |
16 | 16 | |
17 | 17 | A TCG "function" corresponds to a QEMU Translated Block (TB). |
18 | 18 | |
19 | -A TCG "temporary" is a variable only live in a given | |
20 | -function. Temporaries are allocated explicitly in each function. | |
19 | +A TCG "temporary" is a variable only live in a basic | |
20 | +block. Temporaries are allocated explicitly in each function. | |
21 | 21 | |
22 | -A TCG "global" is a variable which is live in all the functions. They | |
23 | -are defined before the functions defined. A TCG global can be a memory | |
24 | -location (e.g. a QEMU CPU register), a fixed host register (e.g. the | |
25 | -QEMU CPU state pointer) or a memory location which is stored in a | |
26 | -register outside QEMU TBs (not implemented yet). | |
22 | +A TCG "local temporary" is a variable only live in a function. Local | |
23 | +temporaries are allocated explicitly in each function. | |
24 | + | |
25 | +A TCG "global" is a variable which is live in all the functions | |
26 | +(equivalent of a C global variable). They are defined before the | |
27 | +functions defined. A TCG global can be a memory location (e.g. a QEMU | |
28 | +CPU register), a fixed host register (e.g. the QEMU CPU state pointer) | |
29 | +or a memory location which is stored in a register outside QEMU TBs | |
30 | +(not implemented yet). | |
27 | 31 | |
28 | 32 | A TCG "basic block" corresponds to a list of instructions terminated |
29 | 33 | by a branch instruction. |
... | ... | @@ -32,11 +36,11 @@ by a branch instruction. |
32 | 36 | |
33 | 37 | 3.1) Introduction |
34 | 38 | |
35 | -TCG instructions operate on variables which are temporaries or | |
36 | -globals. TCG instructions and variables are strongly typed. Two types | |
37 | -are supported: 32 bit integers and 64 bit integers. Pointers are | |
38 | -defined as an alias to 32 bit or 64 bit integers depending on the TCG | |
39 | -target word size. | |
39 | +TCG instructions operate on variables which are temporaries, local | |
40 | +temporaries or globals. TCG instructions and variables are strongly | |
41 | +typed. Two types are supported: 32 bit integers and 64 bit | |
42 | +integers. Pointers are defined as an alias to 32 bit or 64 bit | |
43 | +integers depending on the TCG target word size. | |
40 | 44 | |
41 | 45 | Each instruction has a fixed number of output variable operands, input |
42 | 46 | variable operands and always constant operands. |
... | ... | @@ -44,14 +48,12 @@ variable operands and always constant operands. |
44 | 48 | The notable exception is the call instruction which has a variable |
45 | 49 | number of outputs and inputs. |
46 | 50 | |
47 | -In the textual form, output operands come first, followed by input | |
48 | -operands, followed by constant operands. The output type is included | |
49 | -in the instruction name. Constants are prefixed with a '$'. | |
51 | +In the textual form, output operands usually come first, followed by | |
52 | +input operands, followed by constant operands. The output type is | |
53 | +included in the instruction name. Constants are prefixed with a '$'. | |
50 | 54 | |
51 | 55 | add_i32 t0, t1, t2 (t0 <- t1 + t2) |
52 | 56 | |
53 | -sub_i64 t2, t3, $4 (t2 <- t3 - 4) | |
54 | - | |
55 | 57 | 3.2) Assumptions |
56 | 58 | |
57 | 59 | * Basic blocks |
... | ... | @@ -62,9 +64,8 @@ sub_i64 t2, t3, $4 (t2 <- t3 - 4) |
62 | 64 | - Basic blocks start after the end of a previous basic block, at a |
63 | 65 | set_label instruction or after a legacy dyngen operation. |
64 | 66 | |
65 | -After the end of a basic block, temporaries at destroyed and globals | |
66 | -are stored at their initial storage (register or memory place | |
67 | -depending on their declarations). | |
67 | +After the end of a basic block, the content of temporaries is | |
68 | +destroyed, but local temporaries and globals are preserved. | |
68 | 69 | |
69 | 70 | * Floating point types are not supported yet |
70 | 71 | |
... | ... | @@ -100,7 +101,7 @@ optimizations: |
100 | 101 | is suppressed. |
101 | 102 | |
102 | 103 | - A liveness analysis is done at the basic block level. The |
103 | - information is used to suppress moves from a dead temporary to | |
104 | + information is used to suppress moves from a dead variable to | |
104 | 105 | another one. It is also used to remove instructions which compute |
105 | 106 | dead results. The later is especially useful for condition code |
106 | 107 | optimization in QEMU. |
... | ... | @@ -113,47 +114,6 @@ optimizations: |
113 | 114 | |
114 | 115 | only the last instruction is kept. |
115 | 116 | |
116 | -- A macro system is supported (may get closer to function inlining | |
117 | - some day). It is useful if the liveness analysis is likely to prove | |
118 | - that some results of a computation are indeed not useful. With the | |
119 | - macro system, the user can provide several alternative | |
120 | - implementations which are used depending on the used results. It is | |
121 | - especially useful for condition code optimization in QEMU. | |
122 | - | |
123 | - Here is an example: | |
124 | - | |
125 | - macro_2 t0, t1, $1 | |
126 | - mov_i32 t0, $0x1234 | |
127 | - | |
128 | - The macro identified by the ID "$1" normally returns the values t0 | |
129 | - and t1. Suppose its implementation is: | |
130 | - | |
131 | - macro_start | |
132 | - brcond_i32 t2, $0, $TCG_COND_EQ, $1 | |
133 | - mov_i32 t0, $2 | |
134 | - br $2 | |
135 | - set_label $1 | |
136 | - mov_i32 t0, $3 | |
137 | - set_label $2 | |
138 | - add_i32 t1, t3, t4 | |
139 | - macro_end | |
140 | - | |
141 | - If t0 is not used after the macro, the user can provide a simpler | |
142 | - implementation: | |
143 | - | |
144 | - macro_start | |
145 | - add_i32 t1, t2, t4 | |
146 | - macro_end | |
147 | - | |
148 | - TCG automatically chooses the right implementation depending on | |
149 | - which macro outputs are used after it. | |
150 | - | |
151 | - Note that if TCG did more expensive optimizations, macros would be | |
152 | - less useful. In the previous example a macro is useful because the | |
153 | - liveness analysis is done on each basic block separately. Hence TCG | |
154 | - cannot remove the code computing 't0' even if it is not used after | |
155 | - the first macro implementation. | |
156 | - | |
157 | 117 | 3.4) Instruction Reference |
158 | 118 | |
159 | 119 | ********* Function call |
... | ... | @@ -241,6 +201,10 @@ t0=t1|t2 |
241 | 201 | |
242 | 202 | t0=t1^t2 |
243 | 203 | |
204 | +* not_i32/i64 t0, t1 | |
205 | + | |
206 | +t0=~t1 | |
207 | + | |
244 | 208 | ********* Shifts |
245 | 209 | |
246 | 210 | * shl_i32/i64 t0, t1, t2 |
... | ... | @@ -428,3 +392,34 @@ to apply more optimizations because more registers will be free for |
428 | 392 | the generated code. |
429 | 393 | |
430 | 394 | The exception model is the same as the dyngen one. |
395 | + | |
396 | +6) Recommended coding rules for best performance | |
397 | + | |
398 | +- Use globals to represent the parts of the QEMU CPU state which are | |
399 | + often modified, e.g. the integer registers and the condition | |
400 | + codes. TCG will be able to use host registers to store them. | |
401 | + | |
402 | +- Avoid globals stored in fixed registers. They must be used only to | |
403 | + store the pointer to the CPU state and possibly to store a pointer | |
404 | + to a register window. The other uses are to ensure backward | |
405 | + compatibility with dyngen during the porting a new target to TCG. | |
406 | + | |
407 | +- Use temporaries. Use local temporaries only when really needed, | |
408 | + e.g. when you need to use a value after a jump. Local temporaries | |
409 | + introduce a performance hit in the current TCG implementation: their | |
410 | + content is saved to memory at end of each basic block. | |
411 | + | |
412 | +- Free temporaries and local temporaries when they are no longer used | |
413 | + (tcg_temp_free). Since tcg_const_x() also creates a temporary, you | |
414 | + should free it after it is used. Freeing temporaries does not yield | |
415 | + a better generated code, but it reduces the memory usage of TCG and | |
416 | + the speed of the translation. | |
417 | + | |
418 | +- Don't hesitate to use helpers for complicated or seldom used target | |
419 | + intructions. There is little performance advantage in using TCG to | |
420 | + implement target instructions taking more than about twenty TCG | |
421 | + instructions. | |
422 | + | |
423 | +- Use the 'discard' instruction if you know that TCG won't be able to | |
424 | + prove that a given global is "dead" at a given program point. The | |
425 | + x86 target uses it to improve the condition codes optimisation. | ... | ... |
tcg/TODO
1 | -- test macro system | |
1 | +- Add new instructions such as: andnot, ror, rol, setcond, clz, ctz, | |
2 | + popcnt. | |
2 | 3 | |
3 | -- test conditional jumps | |
4 | +- See if it is worth exporting mul2, mulu2, div2, divu2. | |
4 | 5 | |
5 | -- test mul, div, ext8s, ext16s, bswap | |
6 | - | |
7 | -- generate a global TB prologue and epilogue to save/restore registers | |
8 | - to/from the CPU state and to reserve a stack frame to optimize | |
9 | - helper calls. Modify cpu-exec.c so that it does not use global | |
10 | - register variables (except maybe for 'env'). | |
11 | - | |
12 | -- fully convert the x86 target. The minimal amount of work includes: | |
13 | - - add cc_src, cc_dst and cc_op as globals | |
14 | - - disable its eflags optimization (the liveness analysis should | |
15 | - suffice) | |
16 | - - move complicated operations to helpers (in particular FPU, SSE, MMX). | |
17 | - | |
18 | -- optimize the x86 target: | |
19 | - - move some or all the registers as globals | |
20 | - - use the TB prologue and epilogue to have QEMU target registers in | |
21 | - pre assigned host registers. | |
6 | +- Support of globals saved in fixed registers between TBs. | |
22 | 7 | |
23 | 8 | Ideas: |
24 | 9 | |
25 | 10 | - Move the slow part of the qemu_ld/st ops after the end of the TB. |
26 | 11 | |
27 | -- Experiment: change instruction storage to simplify macro handling | |
28 | - and to handle dynamic allocation and see if the translation speed is | |
29 | - OK. | |
30 | - | |
31 | -- change exception syntax to get closer to QOP system (exception | |
12 | +- Change exception syntax to get closer to QOP system (exception | |
32 | 13 | parameters given with a specific instruction). |
14 | + | |
15 | +- Add float and vector support. | ... | ... |