Learn Creative Coding (#81) - Parsing Data Files

cc-banner

Last episode we fetched live data from APIs -- weather for Antwerp, ISS positions, city weather portraits. We used fetch() and async/await to pull JSON from URLs and pipe it straight into our canvas. That's powerful when you have a nice REST endpoint handing you clean JSON. But not all data lives behind an API. A lot of it lives in files.

CSV files exported from spreadsheets. JSON dumps downloaded from government open data portals. TSV files from scientific instruments. XML feeds from old-school services. The data you want for your next creative coding piece might be sitting in a .csv you downloaded from Kaggle, or a .json your city publishes with bike-sharing station locations, or a .tsv that a research group uploaded to their website. Before you can map any of that to visuals, you need to parse it -- crack the file open and extract the numbers and strings hiding inside.

This episode is about that: reading structured data files in the browser and turning their contents into arrays and objects your code can work with. We'll start with CSV because it's the simplest and most common, then move to JSON files (different from JSON API responses -- same format, different loading pattern), and finish with the data cleaning and normalization steps that turn raw messy numbers into values you can actually map to visual properties.

CSV: rows and commas

CSV stands for Comma-Separated Values. It's the lowest common denominator of data formats. Every spreadsheet app can export it. Every database can dump to it. Every programming language can read it. A CSV file is just plain text where each line is a row and values within each row are separated by commas. The first row is usually headers -- column names.

Here's what a simple CSV looks like:

city,population,latitude,longitude
Antwerp,529247,51.22,4.40
Brussels,1209000,50.85,4.35
Ghent,263927,51.05,3.72
Bruges,118325,51.21,3.22
Liege,197885,50.63,5.57
Namur,112831,50.47,4.87

Six cities, four columns. Clean, readable, simple. The problem is that CSV looks simple but gets tricky fast. What if a value contains a comma? Then it's wrapped in quotes: "New York, NY",8336817,40.71,-74.01. What if a quoted value contains a quote? Then the quote is doubled: "He said ""hello""",42,0,0. These edge cases are why manual CSV parsing is a useful exercise but robust production parsing needs a library.

Loading files in the browser

In the API episode we used fetch() to grab data from URLs. Turns out fetch() also works for local files -- with a caveat. If you're serving your page from a local development server (like VS Code's Live Server, or python -m http.server), you can fetch files relative to your page:

const response = await fetch('data/cities.csv');
const text = await response.text();
console.log(text);

Notice .text() instead of .json(). CSV is plain text, not JSON, so we get the raw string and parse it ourselves. If your CSV file is in a data/ folder next to your HTML file, this works. If you're opening the HTML file directly from disk (double-clicking it, file:// protocol), fetch won't work due to CORS restrictions on local files. Use a dev server -- it's one command: npx serve or python3 -m http.server.

Manual CSV parsing

Let's parse that cities CSV by hand. It's a good exercise because you understand exactly what's happening:

async function loadCSV(path) {
  const response = await fetch(path);
  const text = await response.text();

  const lines = text.trim().split('\n');
  const headers = lines[0].split(',');

  const rows = [];
  for (let i = 1; i < lines.length; i++) {
    const values = lines[i].split(',');
    const row = {};
    for (let j = 0; j < headers.length; j++) {
      row[headers[j].trim()] = values[j].trim();
    }
    rows.push(row);
  }

  return rows;
}

// usage
const cities = await loadCSV('data/cities.csv');
console.log(cities[0].city);        // "Antwerp"
console.log(cities[0].population);  // "529247" -- still a string!

Split by newlines to get rows. Split the first row by commas to get headers. For each subsequent row, split by commas, pair each value with its header, build an object. Done.

But look at that last line -- cities[0].population is "529247", a string, not a number. Everything coming out of a CSV is a string. If you try to use it for math or visual mapping without converting, you'll get string concatenation instead of addition, or NaN from comparison operations. This is the most common CSV parsing bug I've seen.

Type conversion

You need to convert strings to their proper types. Numbers should be numbers. Dates should be dates. Booleans should be booleans. Here's a version of the loader that handles type conversion:

function parseValue(str) {
  const trimmed = str.trim();

  // empty or missing
  if (trimmed === '' || trimmed === 'N/A' || trimmed === 'null') {
    return null;
  }

  // try number
  const num = Number(trimmed);
  if (!isNaN(num) && trimmed !== '') {
    return num;
  }

  // boolean
  if (trimmed.toLowerCase() === 'true') return true;
  if (trimmed.toLowerCase() === 'false') return false;

  // default: keep as string
  return trimmed;
}

async function loadCSVTyped(path) {
  const response = await fetch(path);
  const text = await response.text();

  const lines = text.trim().split('\n');
  const headers = lines[0].split(',');

  const rows = [];
  for (let i = 1; i < lines.length; i++) {
    const values = lines[i].split(',');
    const row = {};
    for (let j = 0; j < headers.length; j++) {
      row[headers[j].trim()] = parseValue(values[j]);
    }
    rows.push(row);
  }

  return rows;
}

const cities = await loadCSVTyped('data/cities.csv');
console.log(typeof cities[0].population);  // "number"
console.log(cities[0].population + 1);     // 529248 -- actual math works now

Now population is a proper number. latitude and longitude are proper numbers. You can feed them directly into your mapping functions. The parseValue function also handles common missing-data markers (N/A, empty strings, null) by converting them to JavaScript null, which is easier to check for than random strings.

Papa Parse: the real-world CSV library

Our manual parser works for clean data but breaks on quoted fields, escaped characters, different delimiters (tabs, semicolons -- yes, some countries use semicolons because they use commas as decimal separators), and other edge cases. For real-world messy CSV files, use Papa Parse. It's a small library that handles everything:

<script src="https://cdnjs.cloudflare.com/ajax/libs/PapaParse/5.4.1/papaparse.min.js"></script>

// parse from a URL
Papa.parse('data/cities.csv', {
  download: true,
  header: true,
  dynamicTyping: true,
  complete: function (results) {
    console.log(results.data);
    // [{city: "Antwerp", population: 529247, latitude: 51.22, longitude: 4.4}, ...]
  }
});

Three flags and it does everything: header: true uses the first row as column names, dynamicTyping: true auto-converts numbers and booleans (no manual parseValue needed), download: true tells it to fetch the file. The complete callback fires when parsing is done. results.data is your array of row objects, ready to visualize.

Papa Parse also handles the edge cases our manual parser doesn't -- quoted fields with commas inside them, multi-line values, different delimiters. If you're working with data from unknown sources (downloaded from the internet, exported from Excel), use Papa Parse. For data you control and know is clean, the manual parser is fine and avoids a dependency.

Loading JSON files

JSON files loaded from disk work the same as JSON from APIs -- use fetch() and .json():

const response = await fetch('data/earthquakes.json');
const data = await response.json();

console.log(data.features.length);  // number of earthquake records
console.log(data.features[0].properties.mag);  // magnitude of first quake

The difference from APIs is that the data is static -- it's a file on disk, not a live endpoint. It won't change between requests. This is actually an advantage for creative coding: your visualization is deterministic. Same file, same output, every time. No worrying about API rate limits, CORS issues, or servers going down. Download the data once, put it in your project folder, and it's yours forever.

GeoJSON is a common JSON variant for geographic data. It wraps coordinates and properties in a standard structure:

{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "geometry": {
        "type": "Point",
        "coordinates": [4.40, 51.22]
      },
      "properties": {
        "name": "Antwerp",
        "population": 529247
      }
    }
  ]
}

The coordinates are [longitude, latitude] -- note the order. Longitude first, latitude second. This trips people up constanly because we say "lat/lng" verbally but GeoJSON stores [lng, lat]. If your map looks mirrored or your points are in the wrong ocean, check the coordinate order.

Data cleaning: the unglamorous essential

Real-world data is messy. Always. Government open data has missing values. Kaggle datasets have typos. Scientific data has outliers from sensor malfunctions. If you skip cleaning and go straight to visualization, your output will be broken in subtle or obvious ways -- a single NaN value in your mapping function produces undefined behavior that propagates through your canvas.

Common problems and fixes:

function cleanData(rows) {
  return rows
    // remove rows with missing critical values
    .filter(row => row.population !== null && row.latitude !== null)

    // remove obvious outliers (negative population, coordinates out of range)
    .filter(row => row.population > 0)
    .filter(row => row.latitude >= -90 && row.latitude <= 90)
    .filter(row => row.longitude >= -180 && row.longitude <= 180)

    // fix common string issues
    .map(row => ({
      ...row,
      city: row.city ? row.city.trim() : 'Unknown'
    }));
}

const rawCities = await loadCSVTyped('data/cities.csv');
const cities = cleanData(rawCities);
console.log(`${rawCities.length} raw rows -> ${cities.length} clean rows`);

Filter out rows with missing values. Filter out impossible values (negative populations, coordinates on Mars). Trim whitespace from strings. Replace missing names with a default. Log how many rows you lost so you know the data quality. If cleaning removes half your dataset, the source data might be too messy to use -- find a better source.

The key mindset: never trust incoming data. Always inspect it before visualizing. console.log(data.slice(0, 5)) to see the first few rows. Check for null, undefined, NaN, empty strings, unexpected types. Five minutes of inspection saves an hour of debugging "why is my canvas blank."

Normalization: mapping raw values to visual range

This is where data parsing connects to creative coding. Raw data values live in their own ranges -- population from 100,000 to 1,200,000, latitude from 50 to 52, temperature from -5 to 40. Visual properties live in different ranges -- pixel positions from 0 to 800, hue from 0 to 360, opacity from 0 to 1. You need to map from one range to another.

Min-max normalization converts any value to the 0-1 range:

function normalize(value, min, max) {
  return (value - min) / (max - min);
}

// find the range from your data
const populations = cities.map(c => c.population);
const minPop = Math.min(...populations);
const maxPop = Math.max(...populations);

// normalize each city's population to 0-1
for (const city of cities) {
  city.popNorm = normalize(city.population, minPop, maxPop);
}

// now map to visual properties
// radius: 5px to 40px
// color hue: 200 (blue) to 0 (red)
for (const city of cities) {
  city.radius = 5 + city.popNorm * 35;
  city.hue = 200 - city.popNorm * 200;
}

Once everything is normalized to 0-1, mapping to any visual property is just multiplication and offset. radius = 5 + norm * 35 gives you a range of 5 to 40. hue = 200 - norm * 200 maps from blue (small population) to red (large population). The normalization step is the bridge between data space and visual space.

But min-max normalization has a weakness: outliers dominate. If one city has 10 million people and the rest have 100,000-500,000, the outlier gets normalized to 1.0 and everything else squishes near 0. The visualization shows one big dot and a bunch of tiny identical dots -- useless.

Log scaling for skewed data

When your data spans orders of magnitude (populations from 1,000 to 10,000,000, or earthquake magnitudes from 1.0 to 9.0 on a log-energy scale), linear normalization fails. Log scaling compresses the range:

function logNormalize(value, min, max) {
  const logMin = Math.log10(Math.max(min, 1));
  const logMax = Math.log10(max);
  const logVal = Math.log10(Math.max(value, 1));
  return (logVal - logMin) / (logMax - logMin);
}

// with populations ranging from 50,000 to 10,000,000:
// linear: 50000 -> 0.005 (invisible), 10000000 -> 1.0
// log:    50000 -> 0.36,              10000000 -> 1.0

The small city goes from being invisble at 0.005 (linear) to a respectable 0.36 (log). The large city is still the biggest but doesn't dominate everything else. For population data, income data, website traffic, earthquake energy -- anything where the ratio between smallest and largest is more than 10x -- log scaling usually produces better visual distributions.

Putting it all together: Belgian cities from CSV

Allez, time for a complete example. We'll load a CSV of Belgian cities, clean the data, normalize population with log scaling, and draw a map where each city is a circle positioned by its coordinates and sized by its population.

const canvas = document.createElement('canvas');
canvas.width = 800;
canvas.height = 600;
document.body.appendChild(canvas);
const ctx = canvas.getContext('2d');

// our typed CSV loader from earlier
async function loadCSVTyped(path) {
  const response = await fetch(path);
  const text = await response.text();
  const lines = text.trim().split('\n');
  const headers = lines[0].split(',');
  const rows = [];
  for (let i = 1; i < lines.length; i++) {
    const values = lines[i].split(',');
    const row = {};
    for (let j = 0; j < headers.length; j++) {
      const val = values[j] ? values[j].trim() : '';
      const num = Number(val);
      row[headers[j].trim()] = (!isNaN(num) && val !== '') ? num : val;
    }
    rows.push(row);
  }
  return rows;
}

async function main() {
  const cities = await loadCSVTyped('data/belgian-cities.csv');

  // clean
  const clean = cities.filter(c =>
    c.population > 0 && c.latitude && c.longitude
  );

  // find ranges for normalization
  const pops = clean.map(c => c.population);
  const lats = clean.map(c => c.latitude);
  const lons = clean.map(c => c.longitude);

  const minPop = Math.min(...pops);
  const maxPop = Math.max(...pops);
  const minLat = Math.min(...lats);
  const maxLat = Math.max(...lats);
  const minLon = Math.min(...lons);
  const maxLon = Math.max(...lons);

  // draw
  ctx.fillStyle = '#0a0a1a';
  ctx.fillRect(0, 0, 800, 600);

  for (const c of clean) {
    // map longitude to x, latitude to y (inverted -- higher lat = higher on screen)
    const x = ((c.longitude - minLon) / (maxLon - minLon)) * 700 + 50;
    const y = 550 - ((c.latitude - minLat) / (maxLat - minLat)) * 500;

    // log-scale population for radius
    const logMin = Math.log10(Math.max(minPop, 1));
    const logMax = Math.log10(maxPop);
    const logPop = Math.log10(Math.max(c.population, 1));
    const popNorm = (logPop - logMin) / (logMax - logMin);

    const radius = 4 + popNorm * 28;
    const hue = 200 - popNorm * 160;

    ctx.beginPath();
    ctx.arc(x, y, radius, 0, Math.PI * 2);
    ctx.fillStyle = `hsla(${hue}, 55%, 50%, 0.5)`;
    ctx.fill();

    ctx.strokeStyle = `hsla(${hue}, 55%, 65%, 0.3)`;
    ctx.lineWidth = 1;
    ctx.stroke();

    // label for large cities
    if (popNorm > 0.6) {
      ctx.fillStyle = 'rgba(200, 200, 220, 0.6)';
      ctx.font = '11px monospace';
      ctx.textAlign = 'center';
      ctx.fillText(c.city, x, y + radius + 14);
    }
  }
}

main();

Longitude maps to x, latitude maps to y (inverted because screen y increases downward but latitude increases upward). Population goes through log normalization before mapping to radius and color. Big cities get large warm-colored circles, small cities get small blue ones. Labels appear only for the biggest cities so the map doesn't drown in text. The coordinates position each city roughly where it sits geographically -- Antwerp in the north, Liege to the east, Brussels in the middle.

This is the same data-to-visual pipeline we'll use for everything going forward: load, clean, normalize, map. The specific visual encoding changes (circles, lines, particles, whatever) but the pipeline stays the same.

Inline data: when you don't have a file

Sometimes you don't have a separate file -- the data is small enough to put directly in your code. CSV as a string works fine:

const csvString = `city,population,lat,lon
Antwerp,529247,51.22,4.40
Brussels,1209000,50.85,4.35
Ghent,263927,51.05,3.72
Bruges,118325,51.21,3.22`;

const lines = csvString.trim().split('\n');
const headers = lines[0].split(',');
const data = [];
for (let i = 1; i < lines.length; i++) {
  const vals = lines[i].split(',');
  const row = {};
  for (let j = 0; j < headers.length; j++) {
    const v = vals[j].trim();
    const n = Number(v);
    row[headers[j].trim()] = (!isNaN(n) && v !== '') ? n : v;
  }
  data.push(row);
}

This is handy for prototyping. You paste a few rows of real data directly into your sketch and iterate on the visual encoding without worrying about file loading. Once the visual works, swap in the full dataset from a file. I do this constantly -- start with 5-10 rows inline, get the mapping right, then scale up. It's the fastest way to iterate on your visual encodings without getting bogged down in file loading. :-)

Filtering and aggregation

You rarely want to visualize every single row. A dataset with 50,000 entries produces 50,000 visual elements, which is either a density texture (intentional) or a cluttered mess (accidental). Filtering and aggregating before visualization gives you control over density.

// filter: only cities with population > 100,000
const largeCities = cities.filter(c => c.population > 100000);

// aggregate: average population per region
const regions = {};
for (const c of cities) {
  if (!regions[c.region]) {
    regions[c.region] = { total: 0, count: 0 };
  }
  regions[c.region].total += c.population;
  regions[c.region].count += 1;
}

const regionAvgs = Object.entries(regions).map(([name, r]) => ({
  region: name,
  avgPopulation: r.total / r.count
}));

Filtering cuts down the number of elements. Aggregation combines elements into summaries. Both are creative decisions -- which cities do you include? Do you show individual cities or regional averages? A map of all 581 Belgian municipalities looks very different from a map of the 10 largest cities or the 5 provinces. The level of aggregation shapes the story.

What's coming

We can parse data files now. CSV, JSON, inline strings -- we can crack them open, clean the contents, normalize the values, and pipe everything into our canvas. But we've been mapping data to basic properties: position, size, color. There's a whole vocabulary of visual encodings beyond those basics -- line thickness, texture density, animation speed, shape complexity, sound frequency. The next step is building a richer visual language for data. How many dimensions of data can you encode simultaneously before the visual becomes noise? Where's the line between information and ornamentation? That's where the art really starts.

't Komt erop neer...

CSV (Comma-Separated Values) is the simplest data format: one row per line, values separated by commas, first row is headers. Every spreadsheet and database can export it. Load with fetch() and .text(), then split by newlines and commas. Manual parsing works for clean data but breaks on quoted fields and edge cases
Everything from a CSV is a string. Population "529247" is not a number until you convert it with Number() or parseFloat(). Missing values show up as empty strings, "N/A", or "null" -- convert them to JavaScript null for clean handling. Type conversion is boring but skipping it causes subtle bugs
Papa Parse is a small library that handles all CSV edge cases: quoted fields with commas inside, escaped characters, different delimiters (tabs, semicolons), auto type conversion with dynamicTyping: true. Use it for data from unknown sources. For data you control, manual parsing avoids a dependency
JSON files load with fetch() and .json() just like API responses, but they're static files in your project folder. No rate limits, no CORS issues, no server dependency. Download data once, save as JSON, load locally forever. GeoJSON is a common variant for geographic data -- watch out for [longitude, latitude] coordinate order (not lat/lng)
Data cleaning is non-negotiable. Real-world data has missing values, impossible numbers, wrong types, and outliers. Filter out rows with null critical values, remove obvious outliers (negative populations, coordinates in the ocean), trim whitespace. Log how many rows you lose so you know the data quality
Min-max normalization maps any value to 0-1: (value - min) / (max - min). This is the bridge between data space and visual space. Once normalized, mapping to any visual property is just multiplication and offset. But outliers dominate -- one huge value squishes everything else near zero
Log scaling fixes skewed distributions. When data spans orders of magnitude (populations from 1,000 to 10,000,000), Math.log10() compresses the range so small values are still visible. A city with 50,000 people goes from 0.005 (invisible, linear) to 0.36 (visible, log). Use log scaling whenever the largest value is more than 10x the smallest
The full pipeline: load the file (fetch + text/json), clean the data (filter nulls and outliers), normalize values (min-max or log), then map to visual properties (position, size, color, opacity). This pipeline stays the same regardless of what visual encoding you choose -- the specifics change, the structure doesn't
Inline data (CSV as a string literal in your code) is great for prototyping. Paste 5-10 rows, get the visual mapping right, then swap in the full dataset from a file. Always start small, iterate on the encoding, then scale up
Filtering and aggregation are creative decisions. Showing all 50,000 rows vs the top 100 vs regional averages tells completely different stories from the same dataset. The level of detail you choose shapes what the viewer sees and feels

Sallukes! Thanks for reading.

Hive account@femdev