Pierwsza wersja wstępu

Jakub Lasoń · Michał Szydłowski
1 parent eef141a4
Showing 7 changed files with 64 additions and 4 deletions
bare_adv.pdf
bare_adv.synctex.gz
bare_conf_compsoc.pdf
bare_conf_compsoc.synctex.gz
bare_jrnl_compsoc.pdf
bare_jrnl_compsoc.synctex.gz
bare_jrnl_compsoc.tex
@@ -366,6 +366,29 @@
 % correct bad hyphenation here
 \hyphenation{op-tical net-works semi-conduc-tor}
+\usepackage{listings}
+\usepackage{color}
+
+\definecolor{dkgreen}{rgb}{0,0.6,0}
+\definecolor{gray}{rgb}{0.5,0.5,0.5}
+\definecolor{mauve}{rgb}{0.58,0,0.82}
+
+\lstset{frame=tb,
+  language=C++,
+  aboveskip=3mm,
+  belowskip=3mm,
+  showstringspaces=false,
+  columns=flexible,
+  basicstyle={\small\ttfamily},
+  numbers=none,
+  numberstyle=\tiny\color{gray},
+  keywordstyle=\color{blue},
+  commentstyle=\color{dkgreen},
+  stringstyle=\color{mauve},
+  breaklines=true,
+  breakatwhitespace=true,
+  tabsize=3
+}
 \begin{document}
 %
@@ -537,12 +560,49 @@ Computer Society, IEEE, IEEEtran, journal, \LaTeX, paper, template.
 % 
 % Here we have the typical use of a "T" for an initial drop letter
 % and "HIS" in caps to complete the first word.
-\IEEEPARstart{T}{his} demo file is intended to serve as a ``starter file''
-for IEEE Computer Society journal papers produced under \LaTeX\ using
-IEEEtran.cls version 1.8b and later.
+\IEEEPARstart{F}{rom} the beginning of computer science there exists a problem in speed differences between processors and memories.
+Processors have usually higher frequencies than memories containing data necessary for processor to perform calculations and in result processors spend much time being idle.
+That is why they have really fast cache memory at their disposal, however because the cost of such memory is pretty high, its amount is not sufficient. 
+This problem is more widely known as data locality problem and it is quite serious, especially during execution of loops through significant amount of data.
+Nowadays this problem is connected to phenomenon called cache memory miss, which occurs when processor asks for further data and it is not in cache memory. 
+Then computer needs to retrieve data from RAM memory, which is sllower than processor speed so from here we have this idle time of processors when they are wasting time.
+Obvious solution for that is to decrease number of cache misses, so processor can operate without obstacles and data loading from RAM wil take place during time when processor will be performing other tasks.
+
+Many solutions and optimizations were proposed to minimize impact of data locality by decreasing cache misses, one of such methods is called tiling.
+It derived from strip mining transformation, which was invented in times of vector processors.
+It takes an original loop from program and divides it into smaller ones, called stripes, what on vector processors allowed for vectorization of smaller loops but nowadays it hold almost no improvement for execution speed of programs.
+Tiling is utilizing the same idea but it is more suitable for modern processors as it enables more possibilities for other improvements and giving some gains on its own.
+Tiling usually works on loop nests, transforming it into even larger loop nest by adding additional loops to the inside of the nest, increasing loops number twice.
+However, overall number of iterations remain the same, thay are just grouped differently, what already can provide deacrease in cache memory misses.
+Let's take a look at simple loop nest.
+
+\begin{lstlisting}
+for(int i = 0; i < n; i++){
+	for(int j = 0; j < n; j++){
+		Stmt(i,j);
+	}
+}
+\end{lstlisting}
+
+This loop nest is transformed by tiling into more complex form.
+
+\begin{lstlisting}
+for(int Ti = 0; Ti < n; Ti += 64){
+	for(int Tj = 0; Tj < m; Tj += 64){
+		for(int i = Ti; i < min(Ti+63, n); i++){
+			for(int j = Tj; j < min(Tj+63,m); j++){
+				Stmt(i,j);
+			}
+		}
+	}
+}
+\end{lstlisting}
+
+As it can be seen the number of iteration is exactly the same but they happen in parts instead of iterating through whole j loop and then starting next i loop iteration. 
+Number 64 in these loops is called tile size and is very important for efficency of this optimalization as it tries to limit amount of data loaded to cache memory from RAM, so for one iteration processor would have all data it needs to complete calculation, without a need to laod additional data. Unfortunately this number have to be optimized for each computer, because many processors are different from each other and have different cache memory configurations. 
+
 % You must have at least 2 lines in the paragraph with the drop letter
 % (should never be an issue)
-I wish you the best of success.
 \hfill mds