5 Infographics to Better Understand Artificial Intelligence

Written By Andrew Rice

As AI and machine learning are increasingly becoming part of day-to-day conversations, it may be helpful to familiarize yourself with some common concepts. Here I present 5 infographics that I hope will illuminate some of the core concepts useful for understanding, interacting with or investing in machine learning or artificial intelligence models or strategies.

One note before I jump in. I’ve done my best to reduce the number of technical words used to convey these concepts. The terms used are either unavoidable and/or are necessary to explain these concepts. Many of the words will be further explained later in this article, and you will begin to see how these concepts fit together and the bigger picture will come into focus as you go.

  1. Neural Network vs. Deep Learning
  2. Supervised vs. Unsupervised Learning
  3. Training, Validation and Testing Data
  4. Optimizer
  5. Loss Function

1. Neural Network vs. Deep Learning

Neural Networks and Deep Learning in particular are the bedrocks on which much of modern AI applications are built, including facial recognition software, spam detection, sentiment analysis, chat bots, image generators and more. You may have heard one or both of these words used in relation to artificial intelligence and wondered what they are, and the truth is, they are pretty much different ways to describe the same thing.

This image shows a simplified network architecture for a basic neural network compared to a deep learning neural network.

Created internally by Algorithmic Investment Models and Beaumont Capital Management.

 

Input data is connected to a “hidden layer” of nodes. Each node has an associated weight. The weights are multiplied by the incoming data and then submitted to an output layer where an activation function[1] is applied to the incoming values and results in an output result. (Output could be a categorization (e.g., cat or dog), a predicted value (e.g., forward 1-month return for a stock), or a matrix of data that can be translated into an image, for a few examples.

The real difference between neural networks and deep learning (or deep neural networks) is the number of hidden layers between the input data and the output results. Models referred to as deep learning have many hidden layers whereas a neural network may have as few as one hidden layer. The algorithms[2] by which they can be optimized are the same.

2. Supervised vs. Unsupervised Learning

There are two primary motivating reasons why an analyst might train a machine learning model: (1) to try to predict some future event or state or (2) to better understand a set of data. Supervised machine learning methods are often used for purpose 1 and unsupervised machine learning methods are often used for purpose 2.

This image shows the generalized algorithmic flow for a supervised machine learning method and an unsupervised machine learning method.

Created internally by Algorithmic Investment Models and Beaumont Capital Management.

Supervised machine learning models are provided with the “answers” by the human initiating the model.  These models are generally attempting to return a predicted result as close as possible to the “right answer” (or “actual result”) provided in the training data. An example of a supervised model would be a neural network trained to read handwriting. Each image this neural network trains on would have an associated label or category that the algorithm is trying to accurately match. For example, an image of a handwritten 9 would be associated with the supplied label “9”.

The flow of the modeling in supervised learning is that the model is provided with input data, it applies a function or formula to the incoming data to produce an output answer, the output answer is compared to the human-supplied answers (or the “right answers”), a loss is computed, and this loss is then fed back into the model to adjust the function weights to hopefully produce a more accurate answer (or smaller loss) the next time through.

Unsupervised machine learning models are not provided with “answers” but rather seek to uncover underlying patterns within the supplied data. An example of an unsupervised model would be nearest neighbor clustering – using supplied data, the algorithm attempts to group the records into a targeted number of clusters where the members of the cluster are as similar as possible. In this case, the human initiating the model does not have any preconceived notion of what the cluster membership should look like; in other words, the model isn’t attempting to replicate a “right answer”.

The flow of the modeling in unsupervised learning removes the “my answers” part from the supervised learning flow. Instead, some loss function—such as variance in the output value among group members—is applied directly to the model’s answers and fed back into the modeling process until no further improvements can be made.

 

3. Training, Validation and Testing Data

To avoid overfitting their models to the training dataset, quantitative analysts will generally split their data into three distinct sets:

    • Training data: This is the data that the machine learning algorithm will use to develop a model that hopefully makes accurate predictions or generates some other useful output.
    • Validation data: Many machine learning algorithms will use validation data for determining whether they have reached a decent stopping point, generally comparing losses on the validation data in the current stage to losses in the prior stage.
    • Testing data: Machine learning algorithms won’t always hold out a third data set for testing, but when they do, the reason is generally because the algorithm itself has associated “hyper parameters” – values related to the speed at which losses are incorporated into weight changes. To ensure that those parameters aren’t overfit to the training/validation data, a third dataset will be held out to ensure the model works out of sample.

To provide a more specific example of how one might bucket a dataset into three parts like this, suppose you wanted to build a model on S&P 500 data from 2013 – 2023 to attempt to predict its price level one month out. I’d start by taking all the dates from 2020 onward and putting those into a “Test” dataset. Then, I’d go through the remaining daily records from 2013 – 2020 and flip a coin for each date. If heads, I’d put them in the “Train” dataset and if tails, I’d put them in the “Validation” dataset. (Of course I would not literally be flipping a coin but would use a computer algorithm to virtually flip the coin).

This image shows how a sample of 10 data records might be split up and assigned to various training/testing/validation datasets.

Created internally by Algorithmic Investment Models and Beaumont Capital Management.

There are a few ways the data may be split into these 1-3 buckets. A simple randomization process is the most straightforward. However, it may be desirable to split datasets sequentially in time so the out- of-sample validation or testing data also occurs during a time period the model didn’t see in the training data. This can provide greater confidence that the algorithm will work in a real-world environment.

One of the greatest pitfalls to using a model to predict some future or unknown event is it will be too biased toward the data it was trained on (the in-sample data) and therefore not work in a real-world setting. In fact, a model that has a large number of parameters (such as a deep learning model) might possibly “memorize” the training dataset. This would look extremely accurate during training and likely wildly inaccurate when implemented in a real-world environment. Thoughtful sample design can help mitigate these problems.

 

4. Optimizer

An optimizer is an algorithm used in machine learning and artificial intelligence to take the output from a loss function and translate it into changes in model weights (hopefully in a direction that reduces the amount of loss computed at the next step). Think of the optimizer like a golf coach. They watch the player hit several balls and have them make some slight tweaks here and there to their stance, hand positioning, club speed, etc. Then they will observe some more swings and make further tweaks as necessary, always with an eye towards maximizing distance and accuracy.

This image shows a generalized process flow for how an optimizer uses the data provided to develop a formula that produces “accurate” output (accurate as defined by the human designing the model).

Created internally by Algorithmic Investment Models and Beaumont Capital Management.

 

The way the various machine learning methods differ is primarily through their optimization algorithms. Neural net optimizers implement gradient descent through backpropagation as their optimizer (a future part 2 of this blog post might cover these terms in more detail). For genetic algorithms, the optimizer takes the most successful formulas (those with the lowest computed loss) of a given generation of model and “breeds” and “mutates” them in an effort to find a model that results in less loss than the prior one.

Particularly in deep learning, but also in other machine learning approaches, optimizers often have parameters associated with them (referred to as hyper parameters). These parameters determine how weights are adjusted in proportion to the computed loss.

 

5. Loss Function

A loss function is a mathematical formula that is used to compute the difference between the model’s prediction and the “right answers” for a training dataset (assuming the model is supervised).

This image shows the relationship between the weight assigned to a variable and the resulting “loss” computation for the model. The dots with numbers show where weights were set for subsequent training steps 1 through 5 and their resulting loss.

Created internally by Algorithmic Investment Models and Beaumont Capital Management.

Loss functions are generally matched to the specific optimization approach being taken as well as the type of prediction being made. There are often deep-in-the-weeds mathematical reasons for preferring one loss function over another, but for the layperson the important things to know are that:

1. The loss function is usually measuring the difference between the model prediction and the “right answers”,

2. Machine learning models are nearly always aiming to minimize the loss, and

3. The change in loss from one training cycle to the next usually governs when the machine learning algorithm will halt training.

Note how in the image above, an optimizer will not always find the absolute best weight settings to achieve globally minimal loss. At step four, the size of the weight adjustment was not large enough to find the global minimum, but a potentially adequate local minimum is found instead.

[1] Think of an activation function as a mathematical formula that produces values in an expected range, for example, limiting values between 0 and 1.)

[2] algorithms are a set of mathematical instructions that, when followed, will transform an input value to some sort of output value).