Lecture 2: How Computers See the World: Turning Everything into Numbers

A clean, modern infographic diagram showing data transformation for AI. In three distinct sections from left to right: 1. On the left, show three simple icons: a photograph icon of a cat, a text document icon with the word 'Hello', and a sound wave icon. 2. In the middle, show three prominent arrows pointing from the icons to their numerical representations. 3. On the right, show the corresponding numerical forms: a grid of numbers (matrix) for the cat photo, a list of numbers (vector) labeled [0.1, 0.7, -0.2...] for the word 'Hello', and a line graph of a digital wave for the sound wave. The style should be minimalist, educational, and diagrammatic. Use a simple color palette on a neutral background. Widescreen aspect ratio.

Series: The Sequentia Lectures: Unlocking the Math of AI
Part 1: The Foundation – Thinking Like a Machine
Lecture 2: How Computers See the World: Turning Everything into Numbers

In our last lecture, we established that an AI model is like a recipe that takes in “inputs” to produce an “output.” But this raises a fundamental question: if the model is built on mathematics, how can it possibly understand a photograph of a cat, the words in a sentence, or the sound of a voice?

The answer is both simple and profound: it can’t. Not in the way we do. Before an AI can perform any mathematical operations, it must first translate our rich, messy, analog world into its own native language: numbers.

This process of converting complex data into a numerical format is a crucial, non-negotiable first step in any AI pipeline. Today, we’ll explore how this transformation happens for different types of data.

Images: A Mosaic of Pixels

To us, a picture is a holistic scene filled with objects and meaning. To a computer, a digital image is nothing more than a giant grid of tiny dots called pixels. Each pixel has a specific color, and that color can be represented by numbers.

Grayscale (Black & White) Images: This is the simplest case. An image is a 2D grid (a matrix) where each pixel is represented by a single number, typically from 0 (pure black) to 255 (pure white). A dark grey pixel might be 50, while a light grey one could be 200. The entire image becomes a large table of numbers.

Color Images (RGB): Color images are slightly more complex. They are typically represented by three overlapping grids, one for each primary color channel: Red, Green, and Blue (RGB). Each pixel is therefore represented by three numbers (each from 0 to 255), indicating the intensity of red, green, and blue light needed to create its specific color. A bright red pixel might be (255, 0, 0), while a purple one could be (128, 0, 128). A single color photo is thus a massive 3D block of numbers (width x height x 3 channels).

When an AI “sees” a cat, it’s not seeing fur and whiskers. It’s processing a giant matrix of numbers and learning to recognize the numerical patterns that, to us, signify “cat.”

Text: From Words to Vectors

Text is even more abstract. How do you turn the word “apple” into a number? The most basic method is to create a dictionary (or “vocabulary”) of all unique words that the model might encounter.

Create a Vocabulary: Scan a large body of text and list every unique word. Let’s say our vocabulary is simple: {“a”: 1, “apple”: 2, “is”: 3, “red”: 4}.
Assign Unique IDs: Each word is assigned a unique integer ID.
One-Hot Encoding: The sentence “a red apple” could then be represented by activating the corresponding IDs. This is a simple but often inefficient method.

A more powerful technique is called Word Embedding. Instead of a single ID, each word is mapped to a list of several hundred numbers, called a vector. This vector isn’t random; it’s learned in such a way that it captures the word’s meaning and context. Words with similar meanings will have similar vectors. For example, the vectors for “king” and “queen” will be mathematically closer to each other than the vectors for “king” and “banana.”

This process of turning words or sentences into these meaningful lists of numbers is called vectorization. It transforms language into a geometric space where an AI can perform calculations, measure distances, and find relationships.

Sound: The Shape of a Wave

A sound is a continuous wave of pressure. To digitize it, we take thousands of samples of this wave every second. Each sample measures the amplitude (height) of the wave at that precise moment in time, recording it as a number. A one-second audio clip can be represented as a long list of tens of thousands of numbers that describe the exact shape of the sound wave. An AI then learns to identify the numerical patterns that correspond to specific spoken words, musical notes, or other sounds.

The Universal Language of AI

Whether it’s an image, a sentence, a sound, or even more abstract data like your movie preferences, the first step is always the same: turn it into numbers.

This numerical representation, often in the form of a vector or a matrix (a grid of vectors), is the universal input format that allows our “black box” model from Part 1 to start its work. Without this crucial translation step, the mathematical engine of AI would have no fuel to run on.

Now that we understand how AI perceives the world, in our next lecture, we’ll finally start to open the black box and see the simplest possible model in action: using a straight line to make predictions.

Leave a Comment Cancel Reply