Deep Learning in Scala 3 from scratch

Deep Learning in Scala 3 from scratch Alexey Novakov Rhein-Main
Scala Enthusiasts

About Me • Solution Architect at EPAM Germany (BigData, Cloud)
• Functional Programmer • 5 years working with Scala, 10 years with Java • I often talk at Rhein-Main Scala Enthusiasts Meetup • I like music, guitars and astronomy ALEXEY NOVAKOV 2 Twitter: @alexey_novakov Blog: https://novakov-alexey.github.io/

Goal of the Talk • Get an idea of the
Deep Neural Network computation • Use Scala 3 features in the implementation • Inspire someone to write good & long-lasting Scala library for Deep Learning

Agenda # # # # # I N T R
O T O D E E P L E A R N I N G T E N S O R S N E T W O R K I M P L E M E N T A T I O N M O D E L T R A I N I N G , T E S T M E T R I C S V I S U A L I Z A T I O N 4

INTRO TO DEEP LEARNING 5

Neuron Model 6 Biological nueron model !𝑥𝑖 𝑤𝑖 + 𝑏
1 2 n 𝑓 (𝑧) Non-Linear Activation function Summing junction w1 w2 wn z Output x1 x2 xn Artificial nueron model Weights Inputs Bias Y Parameters

Deep Neural Network 7 " 𝑋𝑖 𝑊𝑖 + 𝑏 1
2 n 𝑓(𝑧) Summing junction W1 W2 Wn z Output X1 X2 Xn Just for single neuron only Input: input data (encoded) Layers: Hidden: trained weights Output: predicted value [0 .. 1] Dense (fully-connected) Layer: every neuron from one layer is connected to every neuron to another layer Deep Network has multiple hidden layers for more efficient learning

Deep Feed Forward Network 8 1. transforms patterns from input
to ouput (forward propagation) 2. consists of dense layers 3. no back-loops 4. Backpropagation plus Gradient Descent learning algorithms are commonly used to update the weights/biases Inputs y Error (Delta) Training algorithm (Gradient Descent) Training Data Adjusting the weights initialized randomly Loss/Cost function

Loss Curve 9 loss epoch Meaning: - Lower is better
- Model learns parameters while training

Loss/Cost function 10 Problem Output Loss Function Formula Regression Numerical
1. Mean Squared Error (MSE)/ Quadratic Loss 2. Mean Absolute Error (MAE) 3. Huber Loss …. 𝑀𝑆𝐸 = ∑!"# $ 𝑦𝑖 − - 𝑦𝑖 2 𝑛 Classification Binary Binary Cross Entropy / Log Loss − 1 𝑁 ! !"# % 𝑦! ∗ log - 𝑦𝑖 + 1 − 𝑦! ∗ log(1 − - 𝑦𝑖 ) Classification Single label, multiple classes Cross Entropy − 1 𝑁 ! !"# % 𝑦𝑖 ∗ log(7 𝑦𝑖 ) Classification Multiple labels, Multiple classes Binary Cross Entropy -/-

How to feed 1 or many data records into the
network having N layers with multi-neurons each? Network: 12 x 6 x 6 x 1 11

Linear Algebra : Matrix Multiplication 12 First Hidden Layer N
= 12 Single Record: [1 x 12] ……. = [1 x 6] Dot product Record Batch of 16: broadcasting T

Activation Functions: f(z) = a 13 f (in the literature
as g or 𝝋 ) is applied element-wise for each neuron: f (xT * w + b) https://www.researchgate.net/publication/315667264_Efficient_Processing_of_Deep_Neural_Networks_A_Tutorial_and_Survey (0, 1) [-1, 1]

Dot Product 14 def matMul[T: ClassTag]( a: Array[Array[T]], b: Array[Array[T]]
)(using n: Numeric[T]): Array[Array[T]] = val rows = a.length val cols = b.head.length val out = Array.ofDim[T](rows, cols) for i <- (0 until rows).indices do for j <- (0 until cols).indices do var sum = n.zero for k <- b.indices do sum = sum + (a(i)(k) * b(k)(j)) out(i)(j) = sum out assert( a.head.length == b.length, "The number of columns in the first matrix should be equal to the number of rows in the second" ) Math rule: Scala 3

GENERIC MULTI-DIMENSIONAL ARRAY

Shape 16 •x •weight •bias •z, a •y, yHat Any
of these can have different shape: Scalar (1) Vector row, column (n) Matrix (n, m) Cube (n, m, k)

Tensor is N-dimensional array of data 17 Rank 0 Tensor
Scalar Rank 1 Tensor Vector Rank 2 Tensor Matrix Rank 3 Tensor Rank 4 Tensor

Tensor in Scala 18 sealed trait Tensor[T]: def length: Int
def shape: List[Int] case class Tensor0D[T: ClassTag](data: T) extends Tensor[T]: override val shape: List[Int] = length :: Nil override val length: Int = 1 Scalar

19 case class Tensor1D[T: ClassTag](data: Array[T]) extends Tensor[T]: override def
shape: List[Int] = List(data.length) override def length: Int = data.length Vector case class Tensor2D[T: ClassTag]( data: Array[Array[T]] ) extends Tensor[T]: override def shape: List[Int] = val (r, c) = (data.length, data.headOption.map(_.length).getOrElse(0)) List(r, c) override def length: Int = data.length Matrix

Operations 20 extension [T: ClassTag: Numeric](t: Tensor[T]) // dot product
def *(that: Tensor[T]): Tensor[T] = TensorOps.mul(t, that) // Hadamard product – elementwise mul def multiply(that: Tensor[T]): Tensor[T] = TensorOps.multiply(t, that) def -(that: T): Tensor[T] = TensorOps.subtract(t, Tensor0D(that)) def -(that: Tensor[T]): Tensor[T] = TensorOps.subtract(t, that) def +(that: Tensor[T]): Tensor[T] = TensorOps.plus(t, that) def +(that: T): Tensor[T] = TensorOps.plus(t, Tensor0D(that)) def sum: T = TensorOps.sum(t) Scala 3

DATASET

Churn Modeling 22 Customer Exists the Bank? Yes No Binary
Classifier RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenu re,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exit ed 1,15634602,Hargrave,619,France,Female,42,2,0,1,1,1,101348.88,1 2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0 3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1 4,15701354,Boni,699,France,Female,39,1,0,2,0,0,93826.63,0 CreditScore, Geography, Gender, Age, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary Raw Data (not encoded): Input features as X: Target as Y: Exited

Data Preparation: Encoding 23 Geography Gender France Female Spain Female
France Female France Female Spain Female Spain Male France Male Germany Female France Male France Male France … Male … Label Encoding One-hot Encoding Classes: - France -> 0.0, Germany -> 1.0, Spain -> 2.0 - Female -> 0, Male -> 1 619.0 1.0 0.0 0.0 0.0 42.0 2.0 0.0 1.0 1.0 1.0 101348.88 608.0 0.0 0.0 1.0 0.0 41.0 1.0 83807.86 1.0 0.0 1.0 112542.58 502.0 1.0 0.0 0.0 0.0 42.0 8.0 159660.8 3.0 1.0 0.0 113931.57 699.0 1.0 0.0 0.0 0.0 39.0 1.0 0.0 2.0 0.0 0.0 93826.63 850.0 0.0 0.0 1.0 0.0 43.0 2.0 125510.82 1.0 1.0 1.0 79084.1 645.0 0.0 0.0 1.0 1.0 44.0 8.0 113755.78 2.0 1.0 0.0 149756.71 822.0 1.0 0.0 0.0 1.0 50.0 7.0 0.0 2.0 1.0 1.0 10062.8 376.0 0.0 1.0 0.0 0.0 29.0 4.0 115046.74 4.0 1.0 0.0 119346.88 501.0 1.0 0.0 0.0 1.0 44.0 4.0 142051.07 2.0 0.0 1.0 74940.5 684.0 1.0 0.0 0.0 1.0 27.0 2.0 134603.88 1.0 1.0 1.0 71725.73 Geography Gender

Data Preparation: Scaling 24 For all columns, for each value:
-0.32620511055784646 0.9971540476049313 -0.5787069743095328 -0.5737804629586442 -1.095.932.718.282.640 0.29350274665868764 -10.417.075.899.390.700 -0.44001395250984926 -1.002.753.789.548.960 -0.5787069743095328 17.426.525.728.049.500 -1.095.932.718.282.640 0.19815392375611882 -13.874.682.079.485.900 -15.367.173.385.927.800 0.9971540476049313 -0.5787069743095328 -0.5737804629586442 -1.095.932.718.282.640 0.29350274665868764 10.328.561.181.180.300 0.501495558183992 0.9971540476049313 -0.5787069743095328 -0.5737804629586442 -1.095.932.718.282.640 0.007456277950981121 -13.874.682.079.485.900 20.637.805.704.342.100 -1.002.753.789.548.960 -0.5787069743095328 17.426.525.728.049.500 -1.095.932.718.282.640 0.3888515695612565 -10.417.075.899.390.700 -0.05720239321674894 -1.002.753.789.548.960 -0.5787069743095328 17.426.525.728.049.500 0.9123735274249709 0.48420039246382535 10.328.561.181.180.300 17.740.853.363.745.600 0.9971540476049313 -0.5787069743095328 -0.5737804629586442 0.9123735274249709 10.562.933.298.792.300 0.6870955001085142 -2.840.345.891.861.180 -1.002.753.789.548.960 17.278.174.350.548.800 -0.5737804629586442 -1.095.932.718.282.640 -0.9460319510747073 -0.35018635392004 -15.470.635.969.520.500 0.9971540476049313 -0.5787069743095328 -0.5737804629586442 0.9123735274249709 0.48420039246382535 -0.35018635392004 0.3463016827948973 0.9971540476049313 -0.5787069743095328 -0.5737804629586442 0.9123735274249709 -1.136.729.596.879.840 -10.417.075.899.390.700 (value(i, j) – stats.column(j).mean) / stats.column(j).stdDev . . .

Preprocessing API 25 case class LabelEncoder[T: ClassTag: Ordering]( classes: Map[T,
T] = Map.empty[T, T] ): def fit(samples: Tensor1D[T]): LabelEncoder[T] = ??? def transform(t: Tensor2D[T], col: Int): Tensor2D[T] = ??? case class OneHotEncoder[ T: Ordering: ClassTag, U: Numeric: Ordering: ClassTag ]( classes: Map[T, U] = Map.empty[T, U] ): def fit(samples: Tensor1D[T]): OneHotEncoder[T, U] = ??? def transform(t: Tensor2D[T], col: Int): Tensor2D[T] = ???

val x = prepareData(data) val y = dataLoader.cols[Double](-1) val ((xTrain,
xTest), (yTrain, yTest)) = (x, y).split(0.2f) Train, Test Data 26 val dataLoader = TextLoader(Path.of("data", "Churn_Modelling.csv")).load() val data = dataLoader.cols[String](3, -1) Loading data from CSV to Tensor[String]: Encode categorical data, scale and transform to Double: // x is Tensor shape 10_000 x 12 // y is Tensor shape 10_000 x 1 // returns composition of encoders val encoders = createEncoders[Double](data) val numericData = encoders(data) val scaler = StandardScaler[Double]().fit(numericData) val prepareData = (t: Tensor2D[String]) => { val numericData = encoders(t) scaler.transform(numericData) } 8000 2000 8000 2000

NETWORK IMPLEMENTATION

Network: Layers 28 case class Layer[T]( w: Tensor[T], b: Tensor[T],
f: ActivationFunc[T], units: Int = 1) trait ActivationFunc[T]: val name: String def apply(x: Tensor[T]): Tensor[T] def derivative(x: Tensor[T]): Tensor[T] trait Loss[T]: def apply( actual: Tensor[T], predicted: Tensor[T] ): T

Model 29 sealed trait Model[T]: def train(x: Tensor[T], y: Tensor[T],
epochs: Int): Model[T] def layers: List[Layer[T]] def predict(x: Tensor[T]): Tensor[T] case class Sequential[T: ClassTag: RandomGen: Fractional, U]( lossFunc: Loss[T], losses: List[T] = Nil, learningRate: T, batchSize: Int = 16, layerStack: Int => List[Layer[T]] = _ => List.empty[Layer[T]], layers: List[Layer[T]] = Nil )(using optimizer: Optimizer[U]) extends Model[T] To be specified by user Hyper params

User API 30 val ann = Sequential[Double, StandardGD]( binaryCrossEntropy, learningRate
= 0.002d, batchSize = 64 ) .add(Dense(relu, 6)) .add(Dense(relu, 6)) .add(Dense(sigmoid)) case class Dense[T]( f: ActivationFunc[T], units: Int = 1 ) extends LayerCfg[T] update weights & biases on every 64 training records

Layer Stack 31 def add(layer: LayerCfg[T]): Sequential[T, U] = copy(layerStack
= (inputs: Int) => val currentLayers = layerStack(inputs) val prevInput = currentLayers.lastOption.map(_.units).getOrElse(inputs) val w = random2D(prevInput, layer.units) val b = zeros(layer.units) (currentLayers :+ Layer(w, b, layer.f, layer.units)) ) case class Sequential ... sealed trait LayerCfg[T]: def units: Int def f: ActivationFunc[T] Weights shape w.r.t. units and inputs: if inputs = 12 then: 1st hidden layer shape: 12 x 6 2nd hidden layer shape: 6 x 6 output layer: 6 x 1

Training Algorithm 32 x: Tensor[T], y: Tensor[T] layers: List[Layer[T]] =>
activations = activate(x, layers) => error = predicted - y => layers = updateWeights(layers, activations, error) Repeat while epoch < n All you need to remember from this presentation! input variables internal state

Train N epochs 33 def train(x: Tensor[T], y: Tensor[T], epochs:
Int): Model[T] = val actualBatches = y.batches(batchSize).toArray val batches = x.batches(batchSize).zip(actualBatches).toArray val layers = getOrInitLayers(x.cols) val (updatedLayers, epochLosses) = (1 to epochs).foldLeft(layers, List.empty[T]) { case ((lrs, losses), epoch) => val (trained, avgLoss) = trainEpoch(batches, lrs, epoch) (trained, losses :+ avgLoss) } 1st loop copy( layers = updatedLayers, losses = epochLosses )

Train on batches: forward 34 private def trainEpoch( batches: Array[(Array[Array[T]],
Array[Array[T]])], layers: List[Layer[T]], epoch: Int ) = val index = (1 to batches.length) val (trained, losses) = batches.zip(index).foldLeft(layers, List.empty[T]) { case ((layers, batchLoss), ((xBatch, yBatch), i)) => // forward val activations = activate(xBatch.as2D, layers) val actual = yBatch.as2D val predicted = activations.last.a val error = predicted - actual val loss = lossFunc(actual, predicted) Goes through the layers 2nd loop

Activation 35 def activate[T: Numeric: ClassTag]( input: Tensor[T], layers: List[Layer[T]]
): List[Activation[T]] = layers .foldLeft(input, ListBuffer.empty[Activation[T]]) { case ((x, acc), Layer(w, b, f, _, _)) => val z = x * w + b val a = f(z) (a, acc :+ Activation(x, z, a)) } ._2 .toList case class Activation[T](x: Tensor[T], z: Tensor[T], a: Tensor[T]) Layer input Layer activation Layer activity current activity is next layer input b f(z) w x a

Train on batches: backward 36 // backward val updatedLayers =
optimizer.updateWeights( layers, activations, error ) (updatedLayers, batchLoss :+ loss) } (trained, getAvgLoss(losses))

OPTIMIZERS

Optimizer 38 type Stub trait Optimizer[U]: def updateWeights[T: ClassTag: Fractional](
layers: List[Layer[T]], activations: List[Activation[T]], error: Tensor[T], learningRate: T ): List[Layer[T]] = … given Optimizer[Stub] with override def updateWeights[T: ClassTag: Fractional]( layers: List[Layer[T]], … ): List[Layer[T]] = layers // does nothing Scala 3

Without Optimizer: Stub 39 epoch: 1/100, avg. loss: NaN, metrics:
[accuracy: 0.359] epoch: 2/100, avg. loss: NaN, metrics: [accuracy: 0.359] epoch: 3/100, avg. loss: NaN, metrics: [accuracy: 0.359] epoch: 4/100, avg. loss: NaN, metrics: [accuracy: 0.359] epoch: 5/100, avg. loss: NaN, metrics: [accuracy: 0.359] epoch: 6/100, avg. loss: NaN, metrics: [accuracy: 0.359] epoch: 7/100, avg. loss: NaN, metrics: [accuracy: 0.359] epoch: 8/100, avg. loss: NaN, metrics: [accuracy: 0.359] epoch: 9/100, avg. loss: NaN, metrics: [accuracy: 0.359] epoch: 10/100, avg. loss: NaN, metrics: [accuracy: 0.359] … Loss is greater than Double.MAX val model = ann.train(xTrain, yTrain, epochs = 100)

With Optimizer (1) 40 type StandardGD weights: List[Layer[T]], activations: List[Activation[T]],
error: Tensor[T], learningRate: T )(using n: Fractional[T]): List[Layer[T]] = … given Optimizer[StandardGD] with override def updateWeights[T: ClassTag](

With Optimizer (2): Backpropagation + Gradient Descent 41 layers.zip(activations) .foldRight(List.empty[Layer[T]],
error, None: Option[Tensor[T]]) { case ( (l @ Layer(w, b, f, _, _), Activation(x, z, _)), (lrs, prevDelta, prevWeight) ) => val delta = (prevWeight match case Some(pw) => prevDelta * pw.T case None => prevDelta ) multiply f.derivative(z) val wGradient = x.T * delta val bGradient = delta.sum val newWeight = w - (learningRate * wGradient) val newBias = b - (learningRate * bGradient) val updated = l.copy(w = newWeight, b = newBias) +: lrs (updated, delta, Some(w)) } ._1 Goes backward through the layers

With Optimizer (3) 42 epoch: 1/100, avg. loss: 0.8061420654867331, metrics:
[accuracy: 0.70675] epoch: 2/100, avg. loss: 0.5271817345700976, metrics: [accuracy: 0.793875] epoch: 3/100, avg. loss: 0.5055016076889828, metrics: [accuracy: 0.793375] epoch: 4/100, avg. loss: 0.49368974906385815, metrics: [accuracy: 0.7945] epoch: 5/100, avg. loss: 0.48540839233676397, metrics: [accuracy: 0.79525] epoch: 6/100, avg. loss: 0.4788697196516788, metrics: [accuracy: 0.7965] epoch: 7/100, avg. loss: 0.4732941117845138, metrics: [accuracy: 0.796375] epoch: 8/100, avg. loss: 0.46855840601887444, metrics: [accuracy: 0.7985] epoch: 9/100, avg. loss: 0.4645757985260151, metrics: [accuracy: 0.8015] epoch: 10/100, avg. loss: 0.46127288371357456, metrics: [accuracy: 0.802375] … epoch: 100/100, avg. loss: 0.35699497553205667, metrics: [accuracy: 0.86125]

Test 43 val testPredicted = model.predict(xTest) val value = accuracy(yTest,
testPredicted) println(s"test accuracy = $value") test accuracy = 0.8245 // Single test val example = TextLoader( "n/a,n/a,n/a,600,France,Male,40,3,60000,2,1,1,50000,n/a" ).cols[String](3, -1) val testExample = prepareData(example) val yHat = model.predict(testExample) val exited = predictedToBinary(yHat.as0D.data) == 1 println(s"Exited customer? $exited") Exited customer? false shape: 1x1, Tensor2D[Double]: [[0.054950115637072916]]

44 How does “predict” method calculate the target value?

Test 45 sealed trait Model[T]: def predict(x: Tensor[T]): Tensor[T] Feed
forward -> def predict(x: Tensor[T]): Tensor[T] = activate(x).last.a case class Sequential … extends Model[T]

Thank you! Questions? 46 1. Artificial Neural Network in Scala
- part 1 https://novakov-alexey.github.io/ann-in-scala-1/ https://novakov-alexey.github.io/ann-in-scala-2/ 2. Artificial Neural Network in Scala - part 2 https://novakov-alexey.github.io/tensorflow-scala/ 3. TensorFlow Scala - Linear Regression via ANN 4. Linear Regression with Gradient Descent https://novakov-alexey.github.io/linear-regression/ 5. Linear Regression with Adam Optimizer https://novakov-alexey.github.io/adam-optimizer/ https://github.com/novakov-alexey/deep-learning-scala 0. Mini-library source code https://arxiv.org/pdf/1609.04747.pdf 6. An overview of gradient descent optimization algorithms Twitter: @alexey_novakov Blog: https://novakov-alexey.github.io/ More Information on ANN:

Images 47 Images: 1. Biological Neuron https://en.wikipedia.org/wiki/Biological_neuron_model#/media/File:Neuron3.png https://www.researchgate.net/figure/How-a-neural-network-works_fig1_308094593 2. How
a neural network works

Deep Learning in Scala 3 from scratch

Deep Learning in Scala 3 from scratch

More Decks by Alexey Novakov

Other Decks in Programming

Featured

Transcript