HOWTO install CELL Environment in Gentoo

Gentoo, HPC April 19th, 2007

The CELL BE SDK developed by IBM and Barcelona Supercomputing Center targets RedHat Fedora Core 6 host system. Using the rpm2targz and rpm, we can also install the CELL SDK in Gentoo Linux as well.

Dependency and Tools

Since I don’t want to use RPM as my package management, that is the motivation for me to use Gentoo, I need to resolve the dependencies manually, the following package are installed to meet the dependency requirement:

emerge glut libXmu libXext

gcc make perl rsync flex byacc are in the base system.

To extract rpm, we also need to install rpm2targz and rpm:

emerge rpm rpm2targz

Download the RPMs

The CELL SDK files are scattered in AlphaWorks and Bsc:

Download the CELLSDK21.iso from AlphaWorks, you might need to register before access.

Download all the x86-based file from Bsc, to make your life easier, you could use this script:

# ppu
wget -c  http://www.bsc.es/projects/deepcomputing/linuxoncell/cellsimulator/sdk2.1/ppu-binutils-2.17.50-8.i686.rpm

wget -c http://www.bsc.es/projects/deepcomputing/linuxoncell/cellsimulator/sdk2.1/ppu-gcc-4.1.1-10.i686.rpm

wget -c http://www.bsc.es/projects/deepcomputing/linuxoncell/cellsimulator/sdk2.1/ppu-gcc-c++-4.1.1-10.i686.rpm

wget -c http://www.bsc.es/projects/deepcomputing/linuxoncell/cellsimulator/sdk2.1/ppu-gcc-fortran-4.1.1-10.i686.rpm

wget -c http://www.bsc.es/projects/deepcomputing/linuxoncell/cellsimulator/sdk2.1/ppu-gdb-6.6-15.i686.rpm

wget -c http://www.bsc.es/projects/deepcomputing/linuxoncell/cellsimulator/sdk2.1/ppu-gcc-debuginfo-4.1.1-10.i686.rpm

wget -c http://www.bsc.es/projects/deepcomputing/linuxoncell/cellsimulator/sdk2.1/ppu-gdb-debuginfo-6.6-15.i686.rpm

# spu
wget -c http://www.bsc.es/projects/deepcomputing/linuxoncell/cellsimulator/sdk2.1/spu-binutils-2.17.50-8.i686.rpm

wget -c http://www.bsc.es/projects/deepcomputing/linuxoncell/cellsimulator/sdk2.1/spu-gcc-4.1.1-9.i686.rpm

wget -c http://www.bsc.es/projects/deepcomputing/linuxoncell/cellsimulator/sdk2.1/spu-gcc-c++-4.1.1-9.i686.rpm

hwget -c ttp://www.bsc.es/projects/deepcomputing/linuxoncell/cellsimulator/sdk2.1/spu-gdb-6.6-12.i686.rpm

wget -c http://www.bsc.es/projects/deepcomputing/linuxoncell/cellsimulator/sdk2.1/spu-newlib-1.15.0-7.i686.rpm

wget -c http://www.bsc.es/projects/deepcomputing/linuxoncell/cellsimulator/sdk2.1/spu-gcc-debuginfo-4.1.1-9.i686.rpm

wget -c http://www.bsc.es/projects/deepcomputing/linuxoncell/cellsimulator/sdk2.1/spu-gdb-debuginfo-6.6-12.i686.rpm

wget -c http://www.bsc.es/projects/deepcomputing/linuxoncell/cellsimulator/sdk2.1/ppu-sysroot-fc6-1.noarch.rpm

wget -c http://www.bsc.es/projects/deepcomputing/linuxoncell/cellsimulator/sdk2.1/ppu-sysroot64-fc6-1.noarch.rpm

# sysroot
wget -c http://www.bsc.es/projects/deepcomputing/linuxoncell/cellsimulator/sdk2.1/sysroot_image-2.1-8.noarch.rpm

Extract the RPMs

We need to convert all RPMs to tar, then untar it in the root of directory:

sudo mount CELLSDK21.iso /mnt/loop -o loop
for x in /mnt/loop/software/*i386.rpm
do
        rpm2tar $x
done

for x in *.rpm
do
        rpm2tar $x
done
BUILDDIR=`pwd`

cd /
for x in $BUILDDIR/*.tar
do
        sudo tar xvf $x
done

It would take a few hours depending on your network bandwidth. Unfortunately, the sysroot_image-2.1-8.noarch.rpm fails to covert to tar, using rpm2cpio as the rescure:

sudo rpm2cpio sysroot_image-2.1-8.noarch.rpm | cpio -i –make-directories

Test Drive

cd /opt/ibm/systemsim-cell/run/cell/linux
../run_gui

You MUST go to the specified path where .systemsim.tcl residents. Wait for the GUI window launches, then Click Go, if nothing is wrong, the Linux kernel is launched, and the sysroot_image is mapped as the guest OS system.

Warnning: The CELL simulator is extremely computing-extensive, you could click Stop to pause the simulator when it is idle.

Cleanup

We can unmerge the rpm right now if you want to a crusty system.

emerge -c rpm =db-3.2.9-r11 beecrypt =db-1.85-r3

Glimpse of SC2006: Acceleration

HPC November 22nd, 2006

It is more cost-effective to plug one acceleration board into the desktop to achieve better float-point performance by migrating to the PC cluster or commercial supercomputer for medium-scale applications. In SC2006, there are at least four techniques:

GPGPU

GPU is dedicated ASIC for multimedia, gaming applications with optimized texture/render pipeline architecture. Generic Purpose GPU is based upon vendors’ API to boost the float-point performance. PeakStream unleashes the power of ATI GPU via ATI proprietary interface, other platforms are in development. RapidMind trade off the performance for portability by using OpenGL interface.

CELL

CELL may be the first generic purpose CPU designed for the multimedia application. There have been some commercial products available in the market besides Sony’s PS3, for example, the acceleration board from mercury.

Clearspeed

Clearspeed acceleration board had made a big buzz in the SC2005. It is quite impressive for the computing capacity and power consumption.

ClearSpeed board


FPGA based reconfigure computing

This exotic technology has been around for a few year, the main barrier for its popularity is the steep learning curve for the software developers to implement the functionalities in Hardware Description Language(HDL). Some high level programming language either lacks the expressiveness of parallelism or performance.

The main barrier for the co-processor architecture is the memory bandwidth, PCI Express is still the bottleneck for transportation of CPU and acceleration board. Maybe one day, AMD would embed the GPU to the CPU to replace the float-point unit if they could figure out the power consumption and manufacture.

Another challenge comes from the programming language and library. The new programming language designers fall into the dilemma: How could we hide the low-level detail to make it more expressive and intuitive to the programmer and exploit the low-level features to enhance the performance at the same time? We must trade off between them, but where is the turning point?

Glimpse of SC2006: Visualization

HPC November 22nd, 2006

Boston University visualization demonstration

Data visualization is quite impressive to the novice audience. For example, the demonstration of Boston University made a big buzz in the attendee. The photo shows the snapshot of the interaction of solar wind and magnetic field of earth is simulated and then visualized via opendx, two virtual cameras are placed to generate the 3D effects. Audience need to wear 3D glasses as in the Universal Studio.

NIST visualization demonstration

Another example comes from NIST. This application is used in the medical image processing. The ghost image is more clearer in the simple spirals.

Sun visualization architecture

Data visualization is so essential that Sun develops the distributed environment: a very thin client(left) and dedicated rendering server(right). The user stores the working environment and data in the smart card, once it is plugged in, the server authenticates the user and bring the last environment back. The rendering server is equipped with dedicated nVidia GPUs for the sake of power efficiency.

Prime time for Python in HPC?

Development, HPC November 21st, 2006

Python is a powerful, flexible and elegant dynamic programming language, pervasively used from system administration to web applications. However, due to the lame performance, that is the price we pay for the versatility, Python is seldom used in the high performance computing.

Things have been changed recently. GPGPU, Cell, Clearspeed have emerged to our horizon as the new candidates for their outstanding performance/power ratio. They all work in the accelerate board manner, aka, the host(processor) prepares the data, pushes it to the co-processor, does some trivial work, and waits until the slaves return results. Python and other dynamic languages may glue the different pieces. Here are some approaches on the go:

StarP Parallel Framework

StarP targets to the scientist who demand high performance but reluctant to parallelize their code. The framework plays the magic to hide all the parallel diagram by attaching suffix *p to the variables. The job is decomposed in the client to basic linear algebra operations and distribute it to the server, then fetch the result if necessary. Here is the exploration of StarP magic in Python.

The essential problem for parallel computing is how to decompose the job to different work space and how to minimize the communication between the threads. StarP provides some built-in distribution and give the users options. Anyway, there is no silver bullet in this field.

Global Array with Python binding

Global Array is another approach, It sits on MPI, but provides PGAS programming model by using a set of API. Although it is too MPI-Fotran-like, it still may arouse the interest in the python community to implement a PGAS programming model without introducing new language constructs.

What is next ?

Current PGAS implementations(UPC, CAF and Titanium) share the same communication backbone, GASNet, is it possible to build a GASNet python binding and take the same approach as Global Array to construct a python PGAS library?

2nd PGASCON Overview

HPC October 4th, 2006

NOTICE: PGASCON is not the official name for The Second Conference on Partitioned Global Address Space Programming Models, and this post states only personal opinion, which does not stands for GWU HPCL.

PGAS stands for Partitioned Global Address Space, aka Distributed Share Memory. This memory model has been adopted for the DARPA’s next generation Hight Productive Computer System program. The shared memory eliminate the tedious message pass overhead, while the partitioning leverages the performance by exploiting the affinity.

There are a few challenges for parallel computing that PGAS aims to address:

How to express the parallelism more naturally?
People tends to think the problem in sequential manner unless he is an inborn hardware designer. PGAS languages add new language keywords to declare a shared variable or vector and take SPMD executive models. The users still need to consider the synchronization and atomcity in the parallel computing environment with little help from the compiler.

How to map the shared vector?
For UPC, due to the limitation of old-style C array, the user could manipulate memory layout by the blocksize, while in CAF, the user may specify the memory by using memory vector. One interesting approach is pMatLab, a mapper object is constructed in the runtime and acts as the last argument for other MatLab functions. pMapper goes even farther for automatical and semi-automatical memory map.

How to optimize the share varible access?
This is a really BIG challenge for the compiler and runtime developer. First, a carefully designed cache and TLB for shared varibles may improve the hit rate and shorten the address parsing; lazy evaluation and aggregated packet passing help to reduce the memory bandwidth contention. New technology, for example, Sun’s optical linked chip, may enhance the overall performance in the architecture side as well.