Neural Networks and Deep Learning — CNNs, Transformers, and Large Language Models

Neural Networks and Deep Learning: How AI Sees, Reads, and Speaks

From One Artificial Neuron to Systems That Write, See, and Speak

In 2023, a neural network wrote a passing essay on the bar exam. Another generated a photorealistic image of an astronaut riding a horse on Mars from a single text description. Another translated between 200 languages in real-time. None of them were explicitly programmed to perform these tasks. They are all built from the same fundamental building block: an artificial neuron - a mathematical function inspired by, but very different from, the neurons in your brain. Stack millions of these neurons in layers, feed them enough data, and something remarkable emerges. Not consciousness. Not understanding. But a capacity for pattern recognition so powerful that the outputs are often indistinguishable from human work.

Neural networks are not new. The core mathematics dates to the 1940s. But for decades they were a curiosity - too slow to train, too small to be useful, lacking the data to learn from. Three things converged around 2012 to change this: GPUs made parallel computation thousands of times faster, the internet generated training data at unprecedented scale, and algorithmic breakthroughs (better activation functions, dropout, batch normalization) made deep networks trainable. The result was an explosion that took the field from academic research to the most commercially valuable technology on Earth within a decade.

$100M+
Estimated cost to train GPT-4 - compute, data, and engineering combined
3.5%
Error rate of ResNet on ImageNet in 2015 - surpassing human-level accuracy of 5.1%
1.8T
Estimated parameters in GPT-4 - each one a number learned during training
2017
Year the Transformer paper was published - "Attention Is All You Need" changed everything

The Artificial Neuron: The Simplest Possible Starting Point

Before you can understand a neural network with billions of parameters, you need to understand a network with one neuron. A single artificial neuron does four things, in order: it takes inputs, multiplies each by a weight, adds them up, and passes the sum through an activation function. That is the entire computation.

Anatomy of a Single Artificial Neuron x1 = 0.5 x2 = 0.8 x3 = 0.2 Inputs w1 = 0.4 w2 = 0.7 w3 = -0.2 Σ + bias 0.92 Activation Function Output 0.72 (0.5 x 0.4) + (0.8 x 0.7) + (0.2 x -0.2) + bias = 0.92 → activation(0.92) = 0.72

The inputs (x1, x2, x3) are the data. For a spam detector, these might be: x1 = number of exclamation marks, x2 = presence of "free money" in subject, x3 = sender reputation score. Each input is a number.

The weights (w1, w2, w3) control how much each input matters. A large positive weight means that input strongly influences the output. A negative weight means that input reduces the output. A weight near zero means that input is irrelevant. The weights are the parameters the network learns during training - they are the "knowledge" of the model.

The sum is simply: (x1 * w1) + (x2 * w2) + (x3 * w3) + bias. The bias is an extra parameter that shifts the output, like the y-intercept in a line equation.

The activation function introduces non-linearity. Without it, stacking neurons would be pointless - any stack of linear functions is just another linear function. The activation function lets the network learn curved, complex decision boundaries. ReLU (Rectified Linear Unit) is the most common: if the sum is positive, pass it through unchanged; if negative, output zero. Simple, but it works.

Key Insight

A single neuron is just a weighted sum followed by a non-linear function. One neuron can learn to classify simple, linearly separable things - like separating spam from non-spam when the boundary is a straight line. The power of neural networks comes not from individual neuron complexity but from connecting thousands of simple neurons in layers. Each layer learns increasingly abstract representations of the input data.

Layers: Why Depth Creates Intelligence

A neural network is neurons organized into layers. Data enters the input layer, flows through one or more hidden layers, and exits through the output layer. "Deep learning" simply means a network with many hidden layers - typically dozens to hundreds.

Neural Network: Input Layer → Hidden Layers → Output Input Layer x1 x2 x3 x4 Hidden Layer 1 Hidden Layer 2 Output Layer Cat Dog Raw pixel values Learns edges, simple textures Learns shapes, complex patterns Final classification

Each layer in a deep network learns a different level of abstraction. For an image recognition network, this hierarchy is well understood:

Layer 1 detects edges - horizontal, vertical, diagonal lines. These are the simplest visual features.

Layer 2-3 combines edges into shapes - corners, curves, simple textures like fur or scales.

Layer 4-5 combines shapes into parts - eyes, ears, noses, wheels, windows.

Deeper layers combine parts into objects - faces, cats, cars, buildings.

This is called representation learning. Each layer transforms the raw data into a progressively more useful representation. The input layer sees pixels (numbers with no meaning). The output layer sees "cat" or "dog." The magic is in the middle layers, which learn the intermediate representations that make this transformation possible.

The "deep" in deep learning refers to the depth - the number of hidden layers. A network with 2 hidden layers can learn edges and simple shapes. A network with 10 layers can learn complex objects. ResNet, the architecture that surpassed human accuracy on ImageNet in 2015, had 152 layers. Modern language models have hundreds of layers. More depth allows more abstraction, but also makes training harder - a problem solved by techniques like skip connections, batch normalization, and careful initialization.

How Neural Networks Learn: Backpropagation

Training a neural network follows the same general loop as any machine learning model: make a prediction, measure the error, adjust parameters, repeat. But the mechanism for adjusting parameters in a multi-layer network is specific and elegant. It is called backpropagation.

Forward pass: data flows through layers, produces prediction
Loss: measure how wrong the prediction is
Backward pass: calculate each weight's contribution to the error
Update: adjust weights to reduce error

The forward pass is straightforward: input data enters the first layer, each neuron computes its weighted sum and activation, the output feeds into the next layer, and so on until the output layer produces a prediction.

The loss function quantifies the error. For classification, cross-entropy loss measures how far the predicted probabilities are from the true labels. If the network predicts "92% cat" and the image is actually a cat, the loss is small. If it predicts "20% cat," the loss is large.

The backward pass (backpropagation) is where the learning happens. Using the chain rule from calculus, the algorithm calculates how much each weight in every layer contributed to the total error. This produces a gradient for every weight - a number that says "if you increase this weight by a tiny amount, the error will change by this much."

Gradient descent uses these gradients to update the weights. If a weight's gradient indicates that increasing it would reduce the error, the weight is increased slightly. If decreasing it would help, it is decreased. The size of these adjustments is controlled by the learning rate - too large and the network overshoots, oscillating wildly; too small and training takes forever.

This entire cycle - forward pass, loss, backward pass, update - happens once per batch of training examples. A batch might be 32 or 64 examples. Training GPT-4 involved running this cycle billions of times across trillions of text tokens.

Real-World Example

When AlexNet won the ImageNet competition in 2012, it trained for about 6 days on two NVIDIA GTX 580 GPUs. Training GPT-4 in 2023 required an estimated 25,000 NVIDIA A100 GPUs running for approximately 90-100 days. The compute cost went from roughly $1,000 to over $100 million. The algorithm (backpropagation) is fundamentally the same. What changed was the scale: more layers, more parameters, more data, more compute. Scale, it turns out, is the secret ingredient that unlocked capabilities nobody predicted.

Convolutional Neural Networks: Specialized for Vision

Standard neural networks treat every input independently. Feed them an image and they see a flat list of pixel values with no spatial structure. A pixel in the top-left corner has no relationship to its neighbors. This is wildly inefficient for images, where spatial relationships are everything - an eye is defined by the arrangement of pixels, not their individual values.

Convolutional Neural Networks (CNNs) solve this by using filters (also called kernels) - small matrices (typically 3x3 or 5x5) that slide across the image, computing a dot product at each position. Each filter detects a specific pattern. One filter might detect horizontal edges. Another detects vertical edges. Another detects a specific color gradient.

CNN Feature Detection: From Pixels to Recognition Input Image 224 x 224 pixels Layer 1: Edges Lines, gradients Layer 2-3: Shapes Curves, corners Layer 4+: Parts Eyes, ears, noses "Cat" 97.3% ImageNet Error Rate Over Time 2011: 25.8% error (hand-designed features) 2012: 16.4% error (AlexNet - first deep CNN) 2015: 3.5% (ResNet - surpassed humans at 5.1%)

The key insight of CNNs is parameter sharing. The same filter slides across the entire image, so the network learns to detect an edge regardless of where in the image it appears. This dramatically reduces the number of parameters compared to a fully connected network and builds in translation invariance - a cat in the top-left corner and a cat in the bottom-right corner activate the same feature detectors.

The progression from edge detection to object recognition is not programmed. Nobody tells the CNN which edges to look for or how to combine them into eyes and ears. The network learns these feature hierarchies automatically from the training data. This is the power of deep learning: it discovers the right intermediate representations without human guidance.

The ImageNet competition proved the approach. In 2011, the best systems used hand-designed image features and achieved 25.8% error. In 2012, AlexNet - a deep CNN - dropped the error to 16.4%, a massive leap. By 2015, ResNet reached 3.5%, surpassing human-level accuracy of 5.1%. The competition was effectively ended because the problem was solved.

Key Insight

Transfer learning made CNNs practical for everyone. Training a CNN from scratch on ImageNet requires weeks on expensive GPUs. But you can take a pretrained model (already trained on 14 million images), freeze the early layers (which detect universal features like edges and shapes), and retrain only the final layers on your specific task. A dermatologist with 2,000 labeled skin lesion images can fine-tune a pretrained CNN and get performance competitive with years of specialist training. This is why deep learning spread so rapidly - you do not need Google's data or compute budget to use it.

Recurrent Networks and the Rise of Transformers

CNNs dominate vision. But text, speech, and time-series data are sequences - the order matters. "The dog bit the man" means something entirely different from "the man bit the dog." CNNs do not naturally handle sequences because they process all spatial positions simultaneously.

Recurrent Neural Networks (RNNs) were designed for sequences. An RNN processes one element at a time (one word, one time step) and maintains a hidden state - a memory of what it has seen so far. When processing the word "sat" in "the cat sat on the mat," the hidden state contains information about "the" and "cat" that influences how "sat" is interpreted.

The problem with RNNs is long-range dependencies. By the time the network reaches the 500th word in a document, the information from the 1st word has been diluted through hundreds of processing steps. RNNs suffer from the vanishing gradient problem: gradients shrink exponentially as they propagate backward through time, making it nearly impossible to learn relationships between distant elements. LSTMs (Long Short-Term Memory) and GRUs (Gated Recurrent Units) mitigated this with gating mechanisms, but the fundamental limitation remained: RNNs process sequentially, one step at a time, which is slow and cannot fully leverage parallel hardware.

In 2017, a team at Google published "Attention Is All You Need," introducing the Transformer architecture. It replaced recurrence with self-attention - a mechanism that lets every element in the sequence directly attend to every other element, regardless of distance.

Transformer Self-Attention: How "it" Knows to Attend to "cat" "The cat sat on the mat because it was tired" The cat sat on the mat because it was tired 0.03 0.52 0.08 0.02 0.02 0.06 0.03 0.07 0.09 When processing "it", the attention mechanism assigns 52% weight to "cat" The model learned that pronouns attend to their referents - no rule was programmed RNN: Sequential processing "it" must wait for all prior words Transformer: Parallel attention "it" directly attends to every word at once

In the example above, when the transformer processes the word "it," the attention mechanism computes a weight for every other word in the sentence. "Cat" gets the highest weight (0.52) because the model has learned that "it" in this context refers to "the cat." This is not a programmed rule. The attention patterns emerge from training on trillions of words.

Transformers won for two reasons. First, parallelism: because every word attends to every other word simultaneously (not sequentially), transformers can fully utilize GPU hardware that excels at parallel computation. RNNs processed one word at a time - a serial bottleneck. Second, long-range dependencies: attention allows direct connections between any two positions in the sequence, regardless of distance. A word at position 1 can directly influence a word at position 10,000 without the signal degrading through intermediate steps.

Every major language model is a transformer: GPT-4, Claude, Gemini, Llama, Mistral, BERT, T5. The transformer architecture also dominates in vision (Vision Transformers), protein structure prediction (AlphaFold), music generation, and code generation. It is the single most impactful architecture in the history of deep learning.

Large Language Models: What They Actually Are

GPT-4, Claude, Gemini, Llama - these are large language models (LLMs). Strip away the marketing, and an LLM is a transformer neural network trained to predict the next token in a sequence. That is it. The model reads a sequence of tokens (words or word fragments) and predicts which token is most likely to come next.

The training process is conceptually simple: take trillions of tokens of text from the internet, books, code repositories, and academic papers. For each chunk of text, mask the last token and train the model to predict it. "The capital of France is" - predict "Paris." Repeat this billions of times across trillions of tokens, and the model builds an internal representation of language, facts, reasoning patterns, code syntax, and more. It does not memorize the text - it learns statistical regularities at such a fine grain that the output appears creative and intelligent.

What LLMs Can Do

Text generation - Write essays, stories, emails, code

Translation - Between 100+ languages

Summarization - Condense long documents

Question answering - Answer factual and reasoning questions

Code generation - Write, debug, and explain code

Analysis - Extract patterns from unstructured text

What LLMs Cannot Do

Reason about the physical world - No embodied experience

Access real-time information - Without external tools

Be consistently factual - Hallucinations are inherent

Do precise math - They predict likely tokens, not compute

Have genuine understanding - Pattern matching, not comprehension

Forget what they've been told - In-context, no permanent memory

Text generation works by iteratively predicting one token at a time. To generate a response, the model reads the prompt, predicts the most probable next token, appends it to the sequence, and repeats. Each new token is predicted based on everything that came before it. A 1,000-word response requires approximately 1,300 prediction steps. Temperature controls randomness: at temperature 0, the model always picks the single most likely token (deterministic but repetitive); at higher temperatures, it samples from the probability distribution (more creative but less reliable).

The Economics of Scale: Training Costs and Scaling Laws

Training large neural networks is among the most expensive engineering endeavors in the world. The costs are driven by three factors: data, compute, and talent.

AlexNet (2012)~$1,000
GPT-2 (2019)~$50,000
GPT-3 (2020)~$4.6M
Gemini Ultra (2023)~$30-50M
GPT-4 (2023)$100M+

Scaling laws, discovered by researchers at OpenAI and later confirmed independently, show a predictable relationship: model performance improves as a power law with increased compute, data, and parameters. Double the compute and performance improves by a predictable, measurable amount. This predictability is what justifies the enormous investment - companies can estimate the performance of a model before training it.

The implications are profound. If performance scales predictably with compute, then the richest companies build the best models. Training costs function as a moat. A startup cannot train a GPT-4 competitor for $100 million when it might cost $1 billion to match the next generation. This dynamic has concentrated AI capability in a handful of companies: OpenAI, Google, Anthropic, Meta, and Mistral, with a few others at the frontier.

The environmental cost is real. Training a single large language model can emit as much carbon as five cars over their entire lifetime. Running inference (using the model to generate responses) at scale consumes even more energy over time than training. The AI industry's electricity consumption is growing at roughly 50% per year. Whether the benefits justify the environmental cost is an open and urgent question.

Architectures Beyond Language: GANs, Diffusion, and Multimodal Models

Generative Adversarial Networks (GANs) consist of two networks competing against each other. The generator creates fake images. The discriminator tries to distinguish real images from fake ones. As the discriminator gets better at spotting fakes, the generator gets better at creating convincing ones. The result is a generator that produces photorealistic images. StyleGAN from NVIDIA can generate faces of people who do not exist, with control over specific attributes like age, hair color, and expression.

Diffusion models (DALL-E 3, Stable Diffusion, Midjourney) took a different approach. They learn to reverse the process of adding noise to an image. Start with an image, gradually add random noise until it becomes pure static, and train the model to reverse each step. At generation time, start from pure noise and iteratively denoise until a coherent image emerges. Condition the process on a text description and you get text-to-image generation. These models produce higher-quality images than GANs and are now the dominant approach.

Multimodal models (GPT-4V, Gemini, Claude) combine text and vision. They can take an image as input and reason about it in text, or take a text description and relate it to visual concepts. This is achieved by training on datasets of image-text pairs: photos with captions, diagrams with explanations, charts with descriptions. The model learns to align visual representations with text representations in a shared embedding space.

Answers to Questions People Actually Ask

Are neural networks modeled on the brain? The name and the inspiration come from biology, but the resemblance is superficial. Real neurons communicate through electrochemical signals across synapses, operate asynchronously, and exist in complex 3D architectures with feedback loops at every level. Artificial neurons compute a weighted sum and apply a function - they are math, not biology. The brain has roughly 86 billion neurons with 100 trillion connections, operates on about 20 watts of power, and learns from vastly less data than neural networks require. Neural networks are inspired by the brain in the same way that airplanes are inspired by birds: the inspiration helped, but the mechanisms are fundamentally different.

Why do LLMs hallucinate? Because they are next-token predictors, not knowledge databases. An LLM does not look up facts in a table. It predicts which token is statistically likely given the context. If the training data contains a pattern where "the capital of Zymeria is" typically precedes a plausible-sounding city name, the model will generate a plausible-sounding city name even if Zymeria does not exist. Hallucination is not a bug in the conventional sense - it is a fundamental consequence of the architecture. The model generates what sounds right, not what is right. Retrieval-augmented generation (RAG) partially mitigates this by grounding the model's responses in retrieved documents, but hallucination cannot be fully eliminated with current architectures.

What does "parameters" mean when people say GPT-4 has 1.8 trillion parameters? A parameter is a single number (a weight or bias) in the neural network. During training, each parameter is adjusted to minimize the loss function. A model with 1.8 trillion parameters has 1.8 trillion numbers that were tuned during training. More parameters allow the model to capture more complex patterns, but they also require more data and compute to train effectively. The relationship between parameter count and capability is not linear - architecture matters too. A well-designed 70 billion parameter model can outperform a poorly designed 500 billion parameter model.

Can I train my own neural network? Yes. For image classification, you can fine-tune a pretrained model on your own dataset with a few hundred images in minutes on a free Google Colab GPU. For text tasks, you can fine-tune open models like Llama 3 or Mistral on a single consumer GPU using techniques like LoRA (Low-Rank Adaptation), which only updates a small fraction of the weights. Training a frontier model from scratch? That requires millions of dollars and thousands of GPUs. But adapting existing models to specific tasks is increasingly accessible.

Will deep learning hit a wall? Nobody knows. Scaling laws suggest that performance improves predictably with compute - but power laws eventually slow down, and the energy and financial costs of scaling are rising faster than the improvements. Some researchers argue that current architectures have fundamental limitations (e.g., lack of genuine reasoning, inability to learn from a few examples like humans do) that more scale cannot fix. Others point to the consistent track record of breakthroughs appearing when predicted to be impossible. The honest answer is that the field is moving too fast for anyone to make confident predictions about its ceiling.

What is the difference between fine-tuning and prompting? Prompting changes the model's behavior by giving it instructions in text without changing any of its weights. Fine-tuning actually modifies the model's weights by training on additional data. Prompting is cheaper and faster - you just write a different prompt. Fine-tuning is more powerful - you can teach the model specialized knowledge or behaviors that prompting cannot achieve. Think of prompting as giving someone written instructions and fine-tuning as teaching them a new skill through practice.

The Trajectory: Where Deep Learning Goes Next

Deep learning has moved from a niche academic pursuit to the most commercially significant technology in the world in roughly a decade. The trajectory shows no signs of slowing, but the direction is shifting.

Multimodality is becoming standard. Models that can see, read, hear, and generate across modalities are replacing single-purpose models. A single model that understands text, images, audio, and video simultaneously is more useful than four separate specialists.

Efficiency is becoming as important as capability. Techniques like quantization (reducing precision from 32-bit to 4-bit floats), distillation (training small models to mimic large ones), and sparse architectures (activating only relevant parts of the network) are making powerful models run on phones and laptops. Llama 3.2 1B runs on a smartphone. This democratization matters more than frontier capability for most practical applications.

Agentic systems connect LLMs to tools: code execution, web browsing, database access, API calls. Instead of generating text about how to solve a problem, the model actually solves it by writing and running code, searching the web, and taking real-world actions. This shifts LLMs from information tools to action tools.

The fundamentals remain unchanged. At the bottom, it is neurons computing weighted sums. Layers building representations. Backpropagation adjusting weights. Loss functions measuring error. The complexity emerges from scale and architecture, not from any single sophisticated component. Understanding these fundamentals lets you see through the hype and assess what these systems can and cannot do - which is the most important skill in an era where everyone has an opinion about AI and almost nobody understands how it works.

The takeaway: Neural networks are layers of simple mathematical functions - weighted sums followed by non-linear activations - connected in architectures that learn hierarchical representations from data. CNNs learn visual features, from edges to objects. Transformers learn attention patterns across sequences, powering every major language model. The power comes from scale: billions of parameters trained on trillions of examples. The limitations are equally fundamental: hallucination, bias from training data, enormous compute costs, and a gap between statistical pattern matching and genuine understanding. Knowing the mechanics lets you use these tools intelligently and evaluate their outputs critically.