Rasool Abbas

Neural network from scratch

01/31/2026

Demo

I wanted to learn the fundamentals of modern AI, so I decided to build a neural network from scratch. Before we dive deep into the subject, check out the live demo by drawing a single digit number in the grid. Use your mouse to click and drag to draw or tap and hold if on a mobile device.

Clear

I hope that was a little fun and interesting. It's certainly not perfect, and we will get into why, but it's pretty good for my first from-scratch attempt!

Introduction

Neural networks are, at their very core, function estimators. Neural networks are useful in the real world because almost anything can be modeled as a function:

You can represent almost anything as a function, and then apply neural networks to attempt to estimate the function. For example, let's imagine a function where the inputs are a fixed size array of pixels (0 to 255, grayscale pixel value). Then let's say there are 2 outputs, both values can be between 0 and 1. One of the outputs represents the likelihood that the image is of a dog, and the other of a cat.You can apply a neural network in this case to take in the pixel values and generate 2 numbers, which predicts whether the image that was passed in is either a cat or dog. Obviously, this is a very rough example, but it conveys the main point across, which is if you can quantify some problem into a function, you can use a neural network to do some approximation of that function.It is within this quantification of a problem and application of a function approximator where "intelligence" begins to emerge.

Fundamentals

AI is a bit of a loose term. It can be used to describe any system that can learn and make decisions based on data, such as:

Recommendation systems
Fraud detection
Anomaly detection
Spam detection
Image recognition
Speech recognition
Translation
etc.

At the core of all of these systems are neural networks. Neural networks are a type of machine learning system that is inspired by the human brain. They are composed of layers of (artificial) neurons which are connected to each other in series.Early research into neural networks dates back to the 1940s, with the invention of the "artifical neuron" by Warren McCulloch and Walter Pitts. The "artificial neuron" is composed of inputs, an activation function, weights and biases, and an output. Based upon the inputs, the activation function will determine an output based upon the weights and biases of the neuron. Simply put: the artificial neuron is deciding whether to activate based upon the calculation of many variables.This is how a modern artificial neuron/perceptron is represented mathematically:

Perceptron

Modern Artificial Neuron — A modern artifical nueron architecture. The red text represents the biological neuron counterpart. — Source

Where:

x_{1} ... x_{d}

Inputs to the perceptron

w_{1} ... w_{d}

Weights between inputs and perceptron

b

Is the perceptron's bias

z = \sum_{i = 1}^{d} w_{i} x_{i} + b

Is the linear function

The only bit we haven't defined yet is the activation function. The activation function takes in the output of the linear function and prdouces the output of the perceptron. Below are some examples of popular activation functions:

ReLU

f (x) = \max (0, x)

Leaky ReLU

f (x) = {\begin{matrix} x & if x > 0 \\ α x & else \end{matrix}

Sigmoid

f (x) = \frac{1}{1 + e^{- x}}

Tanh

\tanh (x) = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}}

Typically, you pick the activation function based on the task the neural network is being applied to. For example, if you're trying to classify numbers, you might use sigmoid or softmax activation functions. I used the sigmoid activation function for my neural network implementation and I will expand upon that later.

Neural Network Design

I am not an expert on this subject so I will focus on the only type of neural network I have learned during this project, the fully connected feed forward neural network. This is a primitive version of today's more complex neural network architectures, however because it is simple, it helps us build early intuition well.Neurons are the very fundamental building blocks of neural networks. As you might guess, to achieve a neural network, these neurons are connected to each other like a network. One neuron's output becomes another neuron's input, and many neurons are stacked together in layers. This pattern repeats and builds upon itself.

Simple Neural Network — A simple and tiny fully connected feedforward neural network. — Source

The first layer in a neural network (first relative to information flow) is called the input layer. This layer contains the neurons which base their activation from the input data rather than other neurons. The last layer in the network is called the output layer, where the output of the neurons in this layer is the collective output of the entire network. Every layer in between is called a hidden layer and can be 1 or many layers large, which can be tuned to optimize how well the network learns data and performs.Information flows one direction, beginning with the input layer (left hand side) and ending with the output layer (right hand side). This is why it is called a feed forward neural network. Past the first layer, every neuron's input in every layer consists of the output of every neuron in the previous layer. For example, in the diagram above every neuron in the hidden layer is receiving its input from the output of every neuron in the previous layer (input layer). This is also why it's called a fully connected feed forward neural network.Let's apply this knowledge to our example in the introduction of a dog/cat classifier. Let's assume every grayscale image of either a cat/dog will be of size200 x 200 pixels. Remember that one image is our input, therefore we need the same amount of neurons as image pixels in the input layer of our neural network. Thus, the input layer of our neural network will be 200 x 200 neurons large. Choosing how many hidden layers (and how many neurons for each layer) depends on your problem but it's also adjustable. Ideally, you want the smallest size network that can learn and generalize against your data well. For basic problems with simple data, you can likely have a very small network. For problems with higher dimensional data, you will need a larger network to learn nuance better. In our current dog/cat classifier example, let's go with 2 hidden layers of 20 neurons each. Finally, the output layer is trivial. The amount of neurons in the output layer are the amount of outputs your network needs to have. In our example, consists of 2 neurons that both produce a value x where 0 <= x <= 1representing probability.I hope that helps paint a better picture about the basics of design reasoning of a (very simple) neural network. However, I am holding back the most important part, training and data.

Training a Neural Network

Piggybacking off of our cat/dog classifier example and given the neural network we designed, the neural network still wouldn't be able to predict just yet. As you might have guessed, we need to actually teach the neural network what a dog and what a cat looks like. To do that, we need some training data. From our previous outline, we need 200x200 pixel large grayscale images of both cats and dogs. Ideally, as many high quality samples as we can get.Let's assume we have a perfect training data set of 10,000 samples. Training the network means we tweak the values of the weights which connect every layer of neurons to the next. Tweaking the weights of the neurons has direct impact on whether a neuron will activate depending on the input. This creates a chain effect of neurons activating other neurons based upon the input, all of the way until the output layer.Before training, we need to do a couple of things. First, we need to split our training data set into 2 buckets, a majority of samples will be for training while the rest will be for validation. The training samples will be the actual samples the neural network will learn (adjust its weights) from. The validation samples are samples the network doesn't learn from, but rather used for the model to gauge its undersanding at some interval. The validation samples are immensely helpful and will inform us if our neural network is actually learning or just "memorizing" the training samples (a concept called over-fitting). Typically, you want to split your dataset into 80%/20% buckets for training and validation. Second, we need to initialize our neural network with random weights and bias values. These are just totally random numbers, the neural network will adjust these numbers over training steps to produce increasingly correct results.

Every training step performs the following operations in order:

Forward propagation: Pass input values (image pixels in this case) to the first layer, then compute activations for each neuron until the output layer.
Backward propagation: Use the gradient descent algorithm to adjust weight values such that the network is more likely to produce the expected output given the current state of the network, starting at the output layer, and working backwards until reaching the input layer. It is important to mention that the activations from the previous step are propogated across the network while this step is running.
Update weights: When you are running the gradient descent algorithm during backwards propagation, you are adjusting all of the network's weights (and biases, but in our simple implementation, just the weights). At the end of each training step, the new weights calculated from backwards propagation are set. This method is called stochastic gradient descent (SGD), the process of applying gradient descent algorithm over the weights at every training sample.

Note that I am brushing over a lot of intricate details here, I just want to convey a big picture idea and some brief explanation for each operation. If you extremely high quality instructive material on neural networks, this playlist from 3Blue1Brown is incredible.The bullet points above represent 1 training step. Typically, you repeat training steps until the network has converged. You would know your network converged if the accuracy has risen and hit a plateau. Usually at this stage, the network has learned the data well and is making very precise adjustments to the weights that are somewhat negligible.

Illustration of a neural network converging — The movement of the decision boundaries show the NN learning, and eventually the NN converges. — Source

The above GIF dives a bit too deep for where we want to focus right now, but I want to convey the idea that the network converges after some duration of training. After your network converges, your network continues making small adjustments for precision. Remember, the values that have changed since initializing the network are only the weights. That is the difference between a completely incoherent network and an intelligent one, just a bunch of very precise numbers that we call weights. These numbers are updated using complicated algorithms. It's beautiful to think about.When you train a network, you can export the weights in some format and apply it to another model with the same architecture for inference. This is how the live demo in this web page works.With some high-level neural network concepts out of the way, let's dive into my project, the end result of which you can see as a demo in this web page.

My Project

The inspiration for this project is my brother, who's more a veteran on this topic. He has always talked to me about AI related projects he has worked on/tinkered with for a while. In an effort to learn the basic building blocks of modern AI, I wanted to build a neural network from scratch. I also wanted to do it completely on my own which means no AI assistance of any kind or looking at other code. I wanted to build pure intuition on the subject.If you know me, you know that I sometimes have trouble with keeping attention. I was working on this project on and off for about 3 months, beginning back in the end of 2024 and ending around March of 2025. I started writing this blog post in April of 2025, to then only end up finishing it now. Better late than never, right?!When I started this project, I was for a good and beginner friendly open-source data set to train on. That's when I found the perfect one. Ladies and gentlemen, I introduce to you the "Hello World" of machine learning, the MNIST dataset ... and the task of classifying handwritten numbers.The MNIST dataset is famously a great dataset for introductory machine learning applications because it is heavily curated and well balanced. The MNSIT dataset contains 70,000 black and white images of hand written single digit numbers, that are 28x28 pixels large.

MNIST Dataset Samples — Grid of stitched together samples from the MNIST dataset.

With this labelled dataset, a ton of ambition and a little bit of ego, I began to work on this project.

Data Preparation

I pulled down the MNIST dataset from Huggingface by cloning the remote repository. The repository contains 2 parquet files, one named train that consists of 60k rows, and another named test with 10k rows. As you might have guessed, these are the training samples and validation samples, already separated for us.The parquets contain 2 columns, one called image, and another called label. The image column contains the image in bytes form, while label contains a single number, the label of the image.

MNIST Parquet File Display — A small preview of one of the parquet files from the MNIST dataset.

I converted the bytes column to PNG files in a local directory using a script I quickly put together. I wanted to have a folder of the actual image files because I was going to read the image pixel by pixel a bit later. For the 60k train samples, the folder ended up being 234MB.

File list of MNIST samples — A partial list of the MNIST samples on my local file system, named <unique_id>_<label>.png.

Now that I had the data, I was ready to write my neural network implementation.

Writing the Network

Disclaimer

The code I am about to show you for this project was written without optimization in mind, rather I was learning new concepts and making a lot of mistakes. Hence the code might look redundant, intensely verbose, and/or bad. Please ignore that, this code isn't meant to look pretty.

After my research into this subject, I had some ideas of what my neural network implementation will be. Ultimately, I wanted to keep it simple, but here are the bullet points I had in mind:

Fully connected, feed-forward neural network
All hidden layers are the same size (fixed-width network, simpler to code)
Keep the network as small as possible (parameter count)
Support exporting and importing of model weights to and from my own rudimentary format, for testing purposes

To begin, I created a Network class which will encapsulate all of my network logic and variables. I also created a Neuron class. I know, the Neuron class is extremely unnecessary, but we all do strange things when in new territory, right?

python

1class Network():
2  def __init__(
3      self, 
4      hidden_layer_count: int = HIDDEN_LAYER_COUNT, 
5      hidden_layer_size: int = HIDDEN_LAYER_SIZE,
6      input_layer_size: int = INPUT_LAYER_SIZE,
7      output_layer_size: int = OUTPUT_LAYER_SIZE,
8      weights: list[list[float]] = [],
9      biases: list[list[float]] = [],
10     learning_rate: float = LEARNING_RATE,
11  ):        
12      self.layers = []
13      self.learning_rate = learning_rate
14      self.hidden_layer_count = hidden_layer_count
15
16      # ... some input validation code ...
17
18      # Creating the input layer.
19      self.layers.append([Neuron() for _ in range(input_layer_size)])
20
21      # Creating the hidden layers.
22      for _ in range(hidden_layer_count):
23          self.layers.append([Neuron() for _ in range(hidden_layer_size)])
24
25      # Creating the output layer.
26      self.layers.append([Neuron() for _ in range(output_layer_size)])
27
28      # Setting weights and biases.
29      self.weights = weights
30      self.biases = biases

Being able to initialize a model of any size I wanted was great for testing and iterating.At this point all I had was a data structure, just a basic shell. It was now time to implement a method to perform forward propogation. As a reminder, the forward propogation algorithm is as follows:

Populate input layer with input data
Calculate activations for all of the neurons past the input layer, sequentially

For neurons in the input layer, the activations are simply just the input values from the data. For every neuron past the input layer, calculating the activation looks like:

Neuron activation formula

a = A (\sum_{i = 1}^{n} w_{i} x_{i} + b)

Where:

A

The activation function

n

The number of neurons in the previous layer

w_{i}

The weight between previous layer neuron at index i and current neuron

x_{i}

The activation of previous layer neuron at index i

b

The currnet neuron's bias

For reference, the activation function we are using is the sigmoid function:

Sigmoid function

f (x) = \frac{1}{1 + e^{- x}}

The sigmoid function has horizontal asymptotes on y = 0 and y = 1, therefore any input will be squeezed between 0 and 1. This is especially helpful in classification tasks because sigmoid allows outputs to be interpreted as probabilities. Sigmoid is widely used in classification tasks, although modern approaches use different activation functions for different layers (this is further complexity I have not explored in this implementation).Alright, enough yapping about math, let's write the forward propagation algorithm. Its actually quite simple, we just loop through the sequentially and compute the activation function for each neuron outlined in the formula above. I found this part to be the easiest.

python

1def forward(self, input: list[float]) -> list[float]:
2  ... a bit of input validation code ...
3  
4  # Setting activations of the input layer to the input.
5  for i in range(len(input)):
6
7  # Feed forward.
8  for layer_index, layer in enumerate(self.layers[1:]):
9      weights = self.weights[layer_index]
10      prev_layer = self.layers[layer_index]
11
12      for neuron_index, neuron in enumerate(layer):
13          weights_for_neuron_lower_bound = int(neuron_index * len(prev_layer))
14          weights_for_neuron_upper_bound = int(weights_for_neuron_lower_bound + len(prev_layer))
15
16          neuron_bias = self.biases[layer_index][neuron_index]
17
18          weights_for_neuron = [
19              w for w in weights[
20                  weights_for_neuron_lower_bound:weights_for_neuron_upper_bound
21              ]
22          ]
23
24          prev_layer_activations = [nueron.activation for nueron in prev_layer]
25
26          activation = sigmoid(array_sum(dot(weights_for_neuron, prev_layer_activations)) + neuron_bias)
27
28          neuron.activation = activation
29
30  output_layer = self.layers[-1]
31
32  return [float(neuron.activation) for neuron in output_layer]

That's pretty much it for the forward propagation algorithm. To test what I had so far, I initialized the network with sample weights and biases and followed the math all of the way down to the output layer.

python

1network = Network(
2  hidden_layer_count=1,
3  hidden_layer_size=2,
4  input_layer_size=2,
5  output_layer_size=2,
6  weights=[
7      [0.15, 0.20, 0.25, 0.30],
8      [0.40, 0.45, 0.50, 0.55]
9  ],
10  biases=[
11      [0.35, 0.35],
12      [0.60, 0.60]
13  ],
14)
15
16print(
17    network.forward([0.05, 0.10])
18)
19
20# [0.7513650695523157, 0.7729284653214625]

The math checks out. Side note, if I had weights from a trained neural network that I initialized my neural network with (assuming the network structures are the same), I can run inference with only the code I've written so far. That's all model inference is, just a forward propogation algorithm running on pre-trained weights.However, we will not stop here, we will continue until we can train our own model. Let's begin writing the backward propogation algorithm, the real challenge.

Back propagation

Illustration of neural network training — An illustration showing the flow of data during a forward and backward propagation run during training. — Source

To expand on our backward propagation definition in the earlier section, backward propagation reduces the error of the neural network based upon a loss function. The error is simply a measure of how innacurate the model's prediction was against the correct answer. The loss function quantifies this error in a specific way in preperation for optimization.You may use many types of loss functions for different layers depending on the type of task. For my neural network implementation, I used a simple loss function called 'Squared Error'. I used it for a couple of reasons because it's simple and it's easy to debug since my implementation is from scratch. After doing some research in retrospect, it actually turns out this loss function is not optimal for classification tasks, it is more optimal for regression tasks (predicting prices, temperatures, etc.). Maybe I will revisit this project and improve it with this knowledge!Let's write our squared error loss function:

python

1def get_error(self, input: list[float], expected_output: list[float], actual_output: list[float] = None) -> list[float]:
2    if not actual_output:
3        actual_output = self.forward(input)
4
5    errors: list[float] = []
6
7    for act_out_index, act_out in enumerate(actual_output):
8        exp_out = expected_output[act_out_index]
9
10        errors.append(
11            0.5 * (pow((exp_out - act_out), 2))
12        )
13
14    return errors

This would create the error vector which has the same dimensions as the output vector. The sum of all values within this vector is used to calculate total error for the network. This is an important data point to track during training.Ultimately, back propagation aims to modify weight values of the network such that error is minimized. We can do this by taking partial derivatives.Back propagation starts at the output layer. We are assuming that the network has ran forward propagation on a training sample to populate neuron activations across the network.

For every neuron in the output layer:

Calculate the partial derivative of error with respect to the neuron's output:

First, we need to define the total error of the network

Total error of the network

E_{total} = \sum_{n = 1}^{N} \frac{1}{2} {(t_{n} - o_{n})}^{2}

Where:

N

The number of output neurons

t_{n}

The target output of the neuron

o_{n}

The actual output of the neuron

Now to take the partial derivative with respect to current neuron:

Partial derivative of error w/r/t neuron output

\frac{\partial E_{total}}{\partial {out}_{n}} = o_{n} - t_{n}

Where:

o_{n}

The actual output of the neuron

t_{n}

The target output of the neuron

When we take the partial derivative with respect to a single neuron, the other terms in the summation drop.

Calculate the partial derivative of the neuron's output with respect to the neuron's input

Let's define the neuron's output:

neuron output

out = \frac{1}{1 + e^{- net}}

Where:

net

The input to the neuron

Note the sigmoid activation function making an appearance in the above formula since it has direct impact on the neuron's activation (output).

Now to take the partial derivative:

Partial derivative of neuron output w/r/t neuron input

\frac{\partial out}{\partial net} = out (1 - out)

Where:

out

The output of the neuron

Now that we have calculated neuron specific values, we need to perform the following calculations for every weight that feeds into this neuron:

Calculate the partial derivative of the neuron's input with respect to the weight

Let's define the neuron's input for neurons in the output layer:

Output layer neuron input

net = \sum_{i = 1}^{H} w_{i} {out}_{i} + b

Where:

H

The number of hidden layer neurons in the previous layer

w_{i}

The weight from hidden neuron at index i to the output neuron

{out}_{i}

The output of hidden neuron at index i

b

The bias for the current output layer neuron

Now to take the partial derivative of the input with respect to weight:

Partial derivative of output layer neuron input w/r/t weight

\frac{\partial net}{\partial w_{i}} = {out}_{i}

Where:

{out}_{i}

The output of hidden neuron i

Because constants and non-relevant terms drop out after we take partial derivate with respect to a specific weight. Now we can finally calculate the final value to update the weight.

Now to calculate the partial derivative of total error with respect to the weight

Partial derivative of total error with respect to weight

\frac{\partial E_{total}}{\partial w_{i}} = \frac{\partial E_{total}}{\partial out} * \frac{\partial out}{\partial net} * \frac{\partial net}{\partial w_{i}}

Where:

\frac{\partial E_{total}}{\partial out}

The partial derivative of total error w/r/t neuron output

\frac{\partial out}{\partial net}

The partial derivative of neuron output w/r/t neuron input

\frac{\partial net}{\partial w_{i}}

The partial derivative of neuron input w/r/t weight at index i

Now that we have calculated the most important value, we can update the weight:

Weight update

w_{i}^{+} = w_{i}^{-} - η * \frac{\partial E_{total}}{\partial w_{i}}

Where:

w_{i}^{-}

The old weight value

η

The learning rate

\frac{\partial E_{total}}{\partial w_{i}}

The partial derivative of total error w/r/t the weight

The learning rate is a multiplier we use to help the network learn quicker or slower, depending on the of the training process. A network that's just beginning to learn might benefit from a higher learning rate so that it can apply gradient descent more effectively, and then tune itself later. However, too high of a learning rate can cause the network to oscillate error values because it cannot precisely apply gradient descent to reduce network error.This is the critical update to the weight that we perform in order to train the network. However, these expressions only apply for the output layer. For weight updates for neurons before the output layer, the process is slightly different.Let's outline the steps for updating weights for neurons in the hidden layers.

Calculate the partial derivative of total error with respect to the output of the neuron:

Since we're calculating back propagation on the hidden layers now, this calculation will be different. Unlike the output layer neurons, the output of neurons in hidden layers affects other neurons. Therefore, calculating the error of hidden layer neurons depends on calculating errors of neurons in the next layer.

Partial derivative of total error with respect to neuron output

\frac{\partial E_{total}}{\partial {out}_{h}} = \sum_{i = 1}^{N} \frac{\partial E_{o_{i}}}{\partial {out}_{h}}

Where:

N

The amount of neurons in the next layer

\frac{\partial E_{o_{i}}}{\partial {out}_{h}}

The partial derivative of error of the next layer neuron at index i w/r/t neuron current output

Let's continue calculating for next layer neuron at index $i$ :

Partial derivative of error of next layer neuron with respect to current neuron output

\frac{\partial E_{o_{i}}}{\partial {out}_{h}} = \frac{\partial E_{o_{i}}}{\partial {net}_{o_{i}}} * \frac{\partial {net}_{o_{i}}}{\partial {out}_{h}}

Where:

\frac{\partial E_{o_{i}}}{\partial {net}_{o_{i}}}

The partial derivative of error of output layer neuron at index i w/r/t input of output layer neuron at index i

\frac{\partial {net}_{o_{i}}}{\partial {out}_{h}}

The partial derivative of input of output layer neuron at index i w/r/t the output of hidden layer neuron

We know how to calculate the first term from our previous output layer calculations:

Partial derivative of error of next layer neuron at index i w/r/t input of itself

\frac{\partial E_{o_{i}}}{\partial {net}_{o_{i}}} = \frac{\partial E_{o_{i}}}{\partial {out}_{o_{i}}} * \frac{\partial {out}_{o_{i}}}{\partial {net}_{o_{i}}}

Where:

\frac{\partial E_{o_{i}}}{\partial {out}_{o_{i}}}

The partial derivative of error of next layer neuron at index i w/r/t the output of itself

\frac{\partial {out}_{o_{i}}}{\partial {net}_{o_{i}}}

The partial derivative of input of next layer neuron at index i w/r/t the input of itself

Notice how we have already calculated both terms for the output layer neurons, therefore we just re-use those calculations. Now let's calculate the second term from our intitial calculation.

Remembering our last calculation of input to a neuron:

Output layer neuron input

net = \sum_{i = 1}^{H} w_{i} {out}_{i} + b

Where:

H

The number of hidden layer neurons in the previous layer

w_{i}

The weight from hidden neuron at index i to the output neuron

{out}_{i}

The output of hidden neuron at index i

b

The bias for the current output layer neuron

Now to take partial derivative of input of next layer neuron w/r/t hidden layer neuron output:

Partial derivative of input of next layer neuron at index i w/r/t the output of current hidden layer neuron

\frac{\partial {net}_{o_{i}}}{\partial {out}_{h}} = w_{j}

Where:

w_{j}

The weight between hidden layer neuron and current output layer neuron

Because the constants and irrelevant terms drop out.

This process is repeated for all output layer neurons and summed for the partial derivative of total error w/r/t hidden layer neuron output. Now we are ready for the next step.

Calculate the partial derivative of hidden layer neuron output w/r/t itself's output

Let's remember the calculation for neuron output:

neuron output

out = \frac{1}{1 + e^{- net}}

Where:

net

The input to the neuron

Now to take the partial derivative:

Partial derivative of neuron output w/r/t neuron input

\frac{\partial out}{\partial net} = out (1 - out)

Where:

out

The output of the neuron

net

The input to the neuron

As you can see, these are the same calculations we performed in the output layer.

Now we need to perform to update a weight of a hidden layer neuron, we need to calculate the partial derivative of hidden layer neuron input w/r/t weight value. Remember at this stage of calculations we are looping through all weights of the current neuron. Let's assign the term $k$ to denote the current weight.

Rememebering our previous calculations for this value from the output layer:

Partial derivative of hidden layer neuron input w/r/t weight at index k

\frac{\partial net}{\partial w_{k}} = {out}_{j}

Where:

{out}_{j}

The output of previous layer neuron at index j connecting weight at index k to current hidden layer neuron

Because constants and irrelevant terms drop out.

For the final calculation before the weight update, we need to calculate the partial derivative of total error w/r/t weight at index $k$ :

Partial derivative of total error w/r/t weight at index k

\frac{\partial E_{total}}{\partial w_{k}} = \frac{\partial E_{total}}{\partial {out}_{h}} * \frac{\partial {out}_{h}}{\partial {net}_{h}} * \frac{\partial {net}_{h}}{\partial w_{k}}

Where:

\frac{\partial E_{total}}{\partial {out}_{h}}

The partial derivative of total error w/r/t current hidden layer neuron output

\frac{\partial {out}_{h}}{\partial {net}_{h}}

The partial derivative of current hidden layer neuron output w/r/t current hidden layer neuron input

\frac{\partial {net}_{h}}{\partial w_{k}}

The partial derivative of currnet hidden layer neuron input w/r/t weight at index k

As you can see, we've already calculated all of these terms, now we just plug them in and update weight $k$ for the current hidden layer neuron.

Weight update for weight at index k, for hidden layer neuron at index i

w_{k}^{+} = w_{k} - η * \frac{\partial E_{total}}{\partial w_{k}}

Where:

w_{k}

The old weight value at index k

η

The learning rate

\frac{\partial E_{total}}{\partial w_{k}}

The partial derivative of total error w/r/t weight at index k

This process is repeated for all neurons until we update all weights in the network.

That's all training a neural network is! Just calculating how we can modify weights such that error is minimized, and doing it repeatedly over various training samples.

Testing the implementation

Now that we have written the back propogation algorithm, let's make sure it works. There are many ways to do this, but I want a nice demonstration of the network learning then estimating a function for us. I will initialize a tiny network and "train" it to estimate the y=x function. This sounds a bit silly, but I think it would make a great demonstration for building intuition. We will need to initialize a network with 1 input and 1 output with random weights and biases.

Linear function

y = x

First, I initialized the network with the proper architecture for this task:

python

1HIDDEN_LAYER_COUNT = 1
2HIDDEN_LAYER_SIZE = 2
3INPUT_LAYER_SIZE = 1
4OUTPUT_LAYER_SIZE = 1
5LEARNING_RATE = 0.01
6
7training_samples = [
8  0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0
9]
10
11# ... some code to create random weights and biases between -1, 1 ...
12
13network = Network(
14  weights=initial_weight_layers,
15  biases=initial_biases,
16  learning_rate=LEARNING_RATE
17)
18
19EPOCHS = 1000
20
21for _ in range(EPOCHS):
22  for training_sample in training_samples:
23      output = network.forward([training_sample])
24
25      new_weights = network.backward([training_sample])
26
27      network.weights = new_weights
28
29      error = network.get_error(
30          input=[training_sample],
31          expected_output=[training_sample],
32          actual_output=output,
33      )
34      avg_error = sum(error) / len(error)
35
36# Evaluating the network after training completes
37for sample in training_samples:
38    output = network.forward([sample])[0]
39    print("sample", sample, "output", output)

An epoch is just a training round, and 1 training round is your model training over all training samples. Epoch and learning rate are examples of a hyperparameter which is a configurable value that affects training but it is unrelated to the training data itself.Alright, let's run the training over 1000 epochs and see what happens!

1000 epoch training

$ python train.py
sample 0 output 0.49636686529585927
sample 0.1 output 0.4950622189277994
sample 0.2 output 0.493753051034191
sample 0.3 output 0.4924412446499883
sample 0.4 output 0.49112869806688536
sample 0.5 output 0.4898173145487352
sample 0.6 output 0.48850899198885495
sample 0.7 output 0.4872056126285226
sample 0.8 output 0.48590903295476057
sample 0.9 output 0.48462107389194975
sample 1.0 output 0.48334351139604437

Well, the model is supposed to predict much closer to the samples than this. The current model output makes me feel like the model is unsure and is predicting the midpoint of values just to be safe. Let's plot the average error over every epoch to see how it's being minimized.

Error vs epoch, 1000 epochs — Training for y=x, error vs epoch, 1000 epochs

This is actually a good sign. We can see that total network error decreases over training epochs. It means that our network is able to minimize error over training samples and epochs. Let's bump up the number of epochs to 10,000 to see how the predictions and this curve changes.

10,000 epoch training

$ python train.py
sample 0 output 0.15739128143104839
sample 0.1 output 0.2188780645114613
sample 0.2 output 0.2955161696484011
sample 0.3 output 0.38202537860069186
sample 0.4 output 0.4700731078647977
sample 0.5 output 0.5516387834851928
sample 0.6 output 0.6217470647604859
sample 0.7 output 0.6789280332242159
sample 0.8 output 0.7240910752003388
sample 0.9 output 0.7591787559653463
sample 1.0 output 0.7862804282139579

Nice, it's certainly not very accurate but it's much better than before. You can see the model's predictions spreading out and becoming closer to the sample values. Let's take a look at the error plot.

Error vs epoch, 10,000 epochs — Training for y=x, error vs epoch, 10,000 epochs

We can see that it decreased dramatically to a much lower error value than before. It still doesn't look like the model has "converged" however, let's try another 10x increase in epochs.

100,000 epoch training

$ python train.py
  sample 0 output 0.06741017560766883
  sample 0.1 output 0.10800807415891654
  sample 0.2 output 0.17154562453226208
  sample 0.3 output 0.26348259631471505
  sample 0.4 output 0.3817364722667494
  sample 0.5 output 0.512851353330828
  sample 0.6 output 0.6370749188831896
  sample 0.7 output 0.739466845478218
  sample 0.8 output 0.8155818848098558
  sample 0.9 output 0.8686737930138851
  sample 1.0 output 0.9045891110155017

Sweet, it looks like the predictions are very close to the actual sample value. Neural networks will not predict exactly the expected value, but somewhere near it, and additional calculations are added to a neural networks raw prediction depending on the task. The idea is that the network predicts consistent accurate results based on input. Let's now plot the error of the network.

Error vs epoch, 100,000 epochs — Training for y=x, error vs epoch, 100,000 epochs

It looks like the model converged around 20,000 epochs into training. This means the model was able to learn to such a degree that it cannot minimize error further (by a notable amount). Obviously, the predictions of the network can be much better, so why does the network think it is finished learning?Well, it's all about the models architecture. We can research and experiment with different activation functions for each layer, different error function, scale up or down the network size, etc.However, we accomplished our goal of demonstrating the network learning something trivial. It seems like the network is working properly. Let's now apply it to the MNIST dataset!

Begin training for MNIST

Let's initialize the neural network for classifying numbers from the MNIST dataset. First, we know that the images are 28x28 pixels large. That means our input layer will have to be 28x28 or 784 neurons large. For the hidden layers, I chose 2 hidden layers that are both 20 neurons each. Finally for the output layer, we need to have 9 neurons total, each representing a single digit which makes up 10 digits total (0-9). The idea is that each neuron will represent each digit and the neuron activation will be the probability the input is that digit.There doesn't exist a single formula/approach to how many hidden layers you must choose and how large. You may have some good signals depending on your problem space, however, starting small and increasing with trial and error seems to be the best approach which is what I did in this project.

python

1HIDDEN_LAYER_COUNT = 2
2HIDDEN_LAYER_SIZE = 20
3INPUT_LAYER_SIZE = 784 # 28x28
4OUTPUT_LAYER_SIZE = len(output_values)
5
6LEARNING_RATE = 0.02
7
8# ... some code to generate random weights and biases in correct size ...
9
10network = Network(
11    weights=initial_weight_layers,
12    biases=initial_biases,
13    learning_rate=LEARNING_RATE
14)
15
16network.print_network()

MNIST Classifier Training

$ python train.py
 
**************************************************
Layer[0] has size 784
Layer[1] has size 20
Layer[2] has size 20
Layer[3] has size 10
Weights[0] has size 15680
Weights[1] has size 400
Weights[2] has size 200
Biases[0] has size 20
Biases[1] has size 20
Biases[2] has size 10
**************************************************

Looks like our network is initialized properly. At the initial state, the weights and biases are random, then we train. Therefore, no training results are the same but they should be consistent.We've already prepared our data earlier, now we just need to read it. The images are in PNG formats and while I can try to also read PNG files from scratch, I decided that wasn't necessary and used the python library Pillow to read the pixel values for me.

python

1def get_image_pixels(file_path: str) -> tuple[str, list[int]]:
2    image = Image.open(file_path)
3    pixels = list(image.getdata())
4    pixels = [pixel / 255 for pixel in pixels]
5    label = file_path.split("/")[-1].split("_")[1]
6    label = label.removesuffix(".png")
7
8    return (label, pixels)

This function will allow us to grab the pixels as numeric values between [0, 1], representing the image from black to white. Below is a sample of what pixel array looks like:

Array of numbers that display pixel values — Individual pixel values (after normalization) of one sample from the MNIST dataset.

This is what the output would look like if we printed out the pixels array, 28 numbers at a time. This is essentially the model's input, how it "sees" the numbers, the data it learns from. As humans, this looks unreadable to us but let's round these large decimals so we can see the faint shape of the number.

Array of numbers that display pixel values, structured into a square grid — Individual pixel values (after normalization, and rounding) of one sample from the MNIST dataset.

If you squint hard enough, you may be able to make out the shape of the number "3". Let's now color those numbers greater to make it easier to see.

Array of numbers that display pixel values, with larger numbers colored, and a '3' is displayed in color — Individual pixel values (after normalization, rounding, and coloring by their value) of one sample from the MNIST dataset.

After coloring the numbers by relative values, we can see the shape of the number clearly. The rounding we did to the pixel values was only for visual demonstration purposes. We shouldn't pass the rounded values to the model, otherwise we risk losing detail in our training data which may hurt the model's performance.Let's write our training algorithm now. We will go over all 60k training samples, perform a forward pass to populate activations, a backwards pass to calculate new weights and apply the weights to the network. Every 60k imags will be an epoch. We've mentioned it before but this is called stochastic gradient descent since we update weights per sample. Other training concepts exist but we are performing SGD for simplicity.

python

1def train_stochastic(epochs: int):
2    # ... some setup code for random initial weights and biases ...  
3  
4    network = Network(
5        weights=initial_weight_layers,
6        biases=initial_biases,
7        learning_rate=LEARNING_RATE,
8    )
9
10    network.print_network()
11
12    training_file_paths, validation_file_paths = get_file_paths()
13
14    for i in range(epochs):
15        random.shuffle(training_file_paths)
16
17        for sample_path in training_file_paths:
18            label, image_pixels = get_image_pixels(sample_path)
19            model_input = image_pixels
20
21            expected_output = get_expected_output(label)
22
23            # Populate neuron activations across network
24            network.forward(model_input)
25
26            new_weights = network.backward(expected_output)
27
28            network.weights = new_weights
29
30        print(f"epoch={i + 1}")
31
32        # Calculate and record error and accuracy on the validation data set
33        validate(validation_file_paths, network)
34
35    network.save_network()

That's pretty much it for SGD. For every epoch, we will loop over every sample in the training data set and learn from each sample individually. The network's weights update per sample. Let's train for 5 epochs and then plot the error and accuracy over the validation data set!

Model Analysis

Graph of network error, validation data accuracy, over epochs — Model error and validation data accuracy over epochs (5 epochs).

We trained for 5 epochs and were able to almost get to 90% accuracy on the validation data set! We can clearly see the error is being minimized as the validation accuracy is increasing. The validation data set are images the model has not seen during training, therefore it is perfect for testing if the model is actually learning or memorizing the data.These are promising results. I purposefully included epoch 0 (no training) in this graph just to illustrate how the model performs without training. The model basically just guesses, so the accuracy hovering around 10% as expected. Every initialization, the model is populated with random weights and biases, therefore exact accuracy numbers will change but should remain close.We can see that error is being minimized near zero, however, it looks like the model can go further towards convergence. Let's try training a new model for 10 epochs and see what results we can get!

Nice, it seems like the model is becoming more accurate over epochs. Obviously, this wont happen forever, there's a certain point a model will not be able to learn beyond without specialized optimizations (depending on a ton of ther factors such as data quality, model architecture, hyperparameters, etc.).The model is learning less after each epoch but judging by the last few epochs, the model is still making notable improvements. Let's train a new model for another 10 epochs to see how far we can push it.

It's looking like the model stops learning around epoch 15, hence the reason the model's accuracy is remaining fairly constant afterwards. This is common when the model has hit a representational limit. The model I've developed is fairly simplistic so there's a lot to improve. To improve the model's ability to learn, we can do many things such as:

Support variable sized hidden layers. Modern model architectures have hidden layers that vary in size to optimize for learning certain kinds of data, coupled with several different activation functions within layers
Adjust biases, not just weights. When the network is initialized, we set random numbers as biases for each neurons that aren't adjusted as we perform back propagation. Modern networks train biases as well which help shift decision boundaries for better accuracy
Use training optimization methods like dynamically increasing/decreasing the learning rate depending on how the model is training per epoch. If the model gets "stuck" learning during training (stuck in local minima during gradient descent), bumping the learning rate can help the model perform gradient descent properly. Conversely, lowering the learning rate can help the model make smaller and more precise changes if the model is oscillating accuracy or error

I am thinking I will make follow up posts to make these enhancements to my implementation to hopefully reach near perfect accuracy, but for my first shot at a neural network, I am pretty happy with the results so far!Since biases are initialized randomly and are not updated during training in my implementation, training over the same exact data without making any code changes can actually get us better accuracy (just out of pure psuedo-randomness of the frozen biases influencing activations). Therefore, I trained over 20 epochs a handful of times and was able to reach 95% accuracy, which is the model used for the demo at the top of this page.Let's plot a confusion matrix for the model to see accuracy of the model per possible outcome. We can do this by constructing a grid of all of the possible outcomes (numbers 0-9) and plot for each possibility the count of correct predictions versus incorrect.

Confusion matrix — The confusion matrix of the model, calculated from the 10k MNIST validation images

Nice, this image tells us a lot about the model. For example, it seems like the model is really good at predicting the number 1 and gets quite a few numbers confused with other similar looking numbers. The model most notably gets the number 4 confused with 9, and the number 3 confused with 5. Both numbers can look very similar to each other in hand writing, so it makes sense! Also, the highest confusion count is only 36 samples, and for roughly 1k sample per number, that sounds pretty good!

Exporting the Model & Live Demo

In order to create the live demo at the top of this page, I had to come up with a way to export the trained model. Obviously the model's weights only exist in memory during training time, so you need to export them to disk in some format. To stick with the theme of simplicity and because the model is small, I decided to just export the model's weights into a JSON file (don't laugh). This wasy very simple in my model implementation:

python

1class Network():
2  # ... rest of implementation ...
3
4  def save_network(self, file_name: str = "model.json"):
5      model_data = {}
6
7      model_data["weights"] = []
8
9      for weight_layer_index, weight_layer in enumerate(self.weights):
10          weight_data = {}
11          weight_data["weight_layer_index"] = weight_layer_index
12          weight_data["weights"] = []
13
14          for weight in weight_layer:
15              weight_data["weights"].append(weight)
16
17          model_data["weights"].append(weight_data)
18
19      model_data["biases"] = []
20
21      for bias_layer_index, bais_layer in enumerate(self.biases):
22          bias_data = {}
23          bias_data["bias_layer_index"] = bias_layer_index
24          bias_data["biases"] = []
25
26          for bias in bais_layer:
27              bias_data["biases"].append(bias)
28
29          model_data["biases"].append(bias_data)
30
31      model_data["input_layer_size"] = len(self.layers[0])
32      model_data["output_layer_size"] = len(self.layers[-1])
33      model_data["hidden_layer_size"] = len(self.layers[1]) # Same size for all hidden layers
34      model_data["learning_rate"] = self.learning_rate
35      model_data["hidden_layer_count"] = self.hidden_layer_count
36
37      with open(file_name, "w") as file:
38          file.write(json.dumps(model_data))

This is pretty much as simple as it gets, no fancy code here. The generated JSON file will contain the architecture of the model alongside weights and biases. From there, we can reconstruct the model and only run forward propagation. Therefore, to integrate the model we trained above, I exported it and wrote some TypeScript code to run forward propogation against the weights and biases. The model's JSON file is accessible here if you would like to view it.

Conclusion

This project was very fun. I enjoyed learning from the very primitive and fundamental bits of modern AI, researching, writing code, and piecing it together myself. I set out to not look at any other code and not use AI assistance during this project to challenge myself to learn as much as possible.This project was not as smooth as it might seem from start to finish. I had several issues during my first revision of the back propogation algorithm. I was miscalculating partial derivatives by selecting the wrong weights for neurons, I fed the model bad data, and so forth. However with a few print statements I was able to debug these issues pretty quickly. The first time I saw the model train to >90% accuracy was very exciting.I also started this project back in early 2025, but shelved it because I started a new job and my focus drifted. Im sure others can understand the feeling of starting a project with a ton of excitement, only for it to linger. This project was on my mind throughout the entirety of 2025, and I am very happy to finally finish this first iteration of it.There are tons of improvements I can make to this implementation which I will probably make and write about in the coming months, so watch out for those! I also have a lot of ideas for other projects I would like to explore. The tech world has changed dramatically in the last year and there many topics to explore.