Data Density Program

By Malcolm Mclean Homepage

This program shows the data density of a two-dimensional distribution. This can't be shown easily on a scatter plot, because the density of points rapidly becomes too high. There is no definite way of defining data density at any point, so the program uses the reciprocal of squared distance from the sum of the points. This definition of data density leads to the problem that data density goes to infinity at a pixel that is exactly on a point. To avoid this, a small fudge factor is added to each point.

The program is implemented in straightforwards ANSI C and should be compileable on any platform with a C compiler.

Here's an example. The data density plot gives a much better visual fix on the data density than the scatterplot. The data file used to generate the plots is here fourgaaussians.csv . We use another file with four columns to show the ability to select x and y columns gaussplusuni.csv . This file consists of four Gaussians in columns "A" and "B", plus columns representing the Gaussian with a uniform scatter ("C" "D").

Instructions

The program requires a .csv file as input. CSV files simply consist of columns of numbers separated by commas, and can be output by all statistical software. By default datadensity takes the first column as x and the second column as y, but this can be changed. It outputs an image of the data density, in jpeg, gif, or bmp format.

Simple usage:

datadensity infile.csv densitymap.jpeg

This will create a data density map with default settings.

Firstly we might have a data set where x is much bigger or smaller than y. For instance x might be height in inches, whilst y could be salaries in pounds. Just feeding in the raw data won't give sensible results. However we can scale x by the xscale factor, to get data into more or less the same range.

Then we need to play about with the fudge factor. The higher the fudge factor, the more smear in the data. Make the fudge factor large, and the plot congeals together, and you only get a rough idea of the density. Make it too small, and it reverts to what is effectively a scatterplot. There's no "right" answer for the value to set the fudge factor to. It depends what question you're asking of the data.

fudge 0.01 fudge0.1fudge 0.5 fudge 3

If you just want to look at the data density plot casually to get an idea what the data looks like, the defualt output settings will be good enough. But if you want a plot for publication, you'll want to change the output settings. The width and height of the plot in pixels can be set. By defualt, the range of the plat is from minmium to maximum of points actually in the data, but this can be over-ridden. Jpeg, GIF or bmp files can be output.

The colourscheme is the fun part. By default the plot uses the jet palette, in which cold colours represent areas of low density and hot colours areas. But you can change this. Palettes incorporated into the program include

  • jet
  • grey
  • redgreen
  • rainbow
  • flame
  • blueyellow
  • zebra
  • unionjack
  • ocean
  • If you want to use your own colourscheme, it's a bit tricky but possible. Set up a GIF or BMP file with a palette corresponding to the colourscheme you want. Then pass the name of the file as the option to -colourscheme. The program then uses that palette.

    Here's the same image rendered with the zebra palette.

    There are too many lines. This is easily fixed by setting -levels to 12.

    Now the image is suitable for black and white representation.

    Options

    Usage:

    datadensity [options] <infile.csv> <outfile.jpeg>

    infile.csv - an input file, with x and y data in columns
    outfile.jpeg - the output file for the data density plot (.jpeg, .bmp, .gif)

    Options:

    -help - display help text
    -width <N> : output image width in pixels (default 256)
    -height <N> : output image height in pixels (default 256)
    -xvals <N or name> : column of csv file to use for x values
    -yvals <N or name> : column of csv file to use for y values
           columns are 0-based, or use the name in the header line
    -colourscheme <name> - jet, grey, redgreen, (see help)
    -levels <N> - number of colours to use (default 256)
    -invertpal - invert the palette
    -yaxisdown - top of image is origin of y axis
    -xscale <value> - adjust x-axis by factor for density calculations (default 1.0)

    -minx, -maxx, -miny, -maxy
           range of plot (defaults to minimum and maximum values in csv file)
    -fudge <value> - fudge factor to use in smearing data

    Note if you have a lot of points the calculation may take some time.

    Downloads

    Windows executable
    Source files in zip format

    Source files

  • datadensity.c
  • colourschemes.c   colourschemes.h
  • csv.c   csv.h
  • gif.c   gif.h
  • savejpeg.c   jpeg.h
  • options.c   options.h
  • rbtree.c   rbtree.h
  • bmp.c   bmp.h