The purpose of this example is to convert the source code of a C program, using the described Data Management techniques, such that the data memory behavior is much improved. This means:
use smaller memories/buffers (faster and less energy consuming)
reducing the number of memory transactions
reduce the total energy consumption in the data memory hierarchy
We will mainly concentrate on using loop transformations, but you may use whatever is suitable.
The program which we will use converts YUV coded pictures to RGB format.
Download and unpack the necessary files and put them in an arbitrary directory. It should contain:
akicif.y akicif.u, akicif.v: these 3 files contain the input picture in yuv format.
ycrcb420torgb.c : the original yuv-rgb algortihm
ppm-format.doc : describes the ppm format
guidelines.doc : these guidelines
cmpp6.c : program to check equivalence of two images
YUV-RGB.ppt : powerpoint slides explaining the algorithm
Study the included documentatation (i.e. doc and ppt files).
Compile and run the yuv-rgb program, using the files akiyo.y , akiyo.u and akiyo.v as input. It generates the output file out.ppm which you can inspect with e.g. a tool like xv, or any other tool which can display ppm format (check the web on a suitable viewer for your operating system). Rename out.ppm to ref.ppm. The picture should look like shown here.
Now you can start to change the program, run it again and check if the newly generated out.ppm is exactly equal to ref.ppm. Use diff (on unix), or the cmpp6.c program to compare these images.
Keep track of the different versions you make (e.g. give them appropriate names).
Make a short report about your results. This report should at least include two tables containing the following information:
array #accesses #reads #writes used in function size energy
X .. .. .. convolV0
.. .. .. convolH0
total of X .. .. .. --- a*b .. nJ
etc.; add all the arrays in this way
The table shows for each array its read and write accesses, as they occur within the different functions, and adds them (total). The row with 'total of X' includes the size of X (in nr of bytes), and the energy consumed by all these accesses.
Note: #accesses = #reads + #writes
Estimate the number of accesses, split into reads and writes (just by inspecting the c code you can calculate these numbers). For energy use the models below.
Include this table at least two times: one for the original algorithm, and one for the best transformation result you got. It could be that intermediate or other solutions are also worth listing and explaining in your report.
Note: for the original algorithm (and therefore the first table) we assume that all arrays are in external DRAM. Scalars can be assumed to be allocated in the (internal) register file. We do not count the energy contribution of register accesses. Also (except for the question below) you may ignore burst access options; so all acceses take the same number of cycles, and the same energy (see energy model below).
Could you also comment on the following 2 issues:
Number of cycles: assuming that internal SRAM based memories can be accessed in 2 cycles, and external DRAM memories in 20, give a rough (but motivated) estimate of the change in number of cycles
Burst mode: it is well known that burst accesses, i.e. accessing N successive memory locations, is cheaper then performing N single location accesses, both in energy and in number of cycles.
How would this impact your results?
Would you have to write your code differently in case burst accesses are supported?
Energy consumption model
For energy consumption we will use the following energy model.:
External DRAM memory
E = 5 nJ /access, independent of its size
We assume a linear energy function of the SRAM size:
E = 0.1 + 0.15*(Size in kbyte) nJ/access for single ported memory.
E = 0.3 + 0.25*(Size in kbyte) nJ/access for dual ported memory.
Higher ported memory is not available in the library. Note that register files can have many more ports, but they are based on flip-flops and multiplexors, and not on special designed SRAM cells.
Mention in your report which transformations you applied and in which order.
Finally add the resulting code.
Note that at the exam you will be asked to explain the steps you made, and how the final solution behaves and how it uses local memories.
The loop (and other) transformations you may need are mainly the following:
Merge loops (note that loop bounds should be made equal first).
Loop bump: this is used to make bounds of two successive loops equal in order to
merge them in the next step.
E.g. for i=1..10 .. A[i] is changed into for i=3..12 .. A[i-2]...
Unrolling outerloop followed by Jam.
Jam means merging the two (or more) innerloops which are created by unrolling the outerloop twice (or more).
This is often one of the aims for doing a merge:
B[i] = f(A[i])
C[i-1] = B[i]
b = f(A[i])
C[i-1] = b
Reorder statements (a statement can also be an inner loop)
E.g. split a loop for i=1..100 into two loops, for i=1..3 and for i=4..100
Inplace mapping (intra or inter inplace).
See the treated slides for examples on this. Inplace mapping exploits the limited lifetime of array elements.
You should always check if the transformations are allowed !! Check the array dependences (between a read/use and a previous write/def of array elements). Anyway, if you make errors, you will notice an incorrect program or picture.
The general goal of all these transformations is to combine most of the loops (that is the 5 transformations: CR and CB horizontal and vertical upsampling, and the yuv-rgb color conversion), such that intermediate arrays can be scalar reduced (see above), i.e. completely removing the required intermediate buffer, or at least greatly reduced in size.
Let's see who can come up which the best result !!