I struggle hard because the teacher doesnt go into the math behind “batches and epochs”.

Situation:
We have a dataset of 627 rows to train the machine.

The teacher explains:
batchsize=32 (ok) → epochs = amount of times the machine will get trained by the same dataset.

My question is:
Why would i load the same set of data more than once into the machine? Loading the same data once or 1000000 doesnt has an impact on the result even if we shuffle the rows with which the batches gets filled (in the end it is 627/627).

After I brainstormed:

Is it like a row on a lottery-paper? Means: 627 rows → the shuffle-algorithm picks 32 rows randomly → theese rows gets passed as a batch into the machine → epochs++ (32/627) and start from new (0/627)? That means the same rows could get picked by the shuffle-algo again (what somehow makes sense due to the fact that the dataset is also just the result of 1 event and not of 100000 events).

But the teacher mentioned "loading the complete dataset → epochs++

Because the machine is “learning” by gradient-decent. Meaning it’s calculating the n-dimensional angle on the point of the loss function created by it’s current prediction of the data and the actual result. Then it takes the angle and moves one step downwards in assumption this will get it closer to a minimum. This “step” is back-propogated through the prediciton-calculations of the machine.
After that, the machine supposedly is slightly better at predicting.

However, it did NOT completly memorize all the data, because that would be overfitting.
It did NOT calculate the actual mathematical minimum of the loss function, because that would be computational freaking expensive.
Because it didn’t calculate the actual minimum, but only took a step into it’s “direction”, the step has to be small - so it doesn’t overshoot the target and might have gotten worse at predicting.

Soooo the model didn’t get all the useful information that was in the dataset.
Hence seeing the exact same data again, might still be useful. Hence you show it again.
Putting it into batches and shuffling it is there so the model doesn’t learn random patterns withint the dataset - by virtue of not allowing patterns to form.
Also with batches the loss and gradient-decent are just a lot easier to calculate again, while still allowing a decent fitting to take place.

For shuffling, the training-algorithm should shuffle the dataset, then train on it batch-by-batch. That’s one epoch.
When that is finished, it shuffles again to create a new random order and go though it in batchsize sets again.
This is done number-of-epochs times. That is, until you use early-stopping methods because the model can still overfit on the dataset and basically “memorize” it, instead of making good general predictions.