Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Deep Learning in Scala 3 from scratch

Deep Learning in Scala 3 from scratch

Scala as language with it's ability to write highly concise and declarative code, is a perfect match to express neural network algorithms as well. We will leverage features like type inference, the REPL, operator overloading, extension
methods, first class functions as well as the new Scala 3 "optional
braces syntax" to implement Deep Learning Algorithms.

Alexey Novakov

March 25, 2021
Tweet

More Decks by Alexey Novakov

Other Decks in Programming

Transcript

  1. About Me • Solution Architect at EPAM Germany (BigData, Cloud)

    • Functional Programmer • 5 years working with Scala, 10 years with Java • I often talk at Rhein-Main Scala Enthusiasts Meetup • I like music, guitars and astronomy ALEXEY NOVAKOV 2 Twitter: @alexey_novakov Blog: https://novakov-alexey.github.io/
  2. Goal of the Talk • Get an idea of the

    Deep Neural Network computation • Use Scala 3 features in the implementation • Inspire someone to write good & long-lasting Scala library for Deep Learning
  3. Agenda # # # # # I N T R

    O T O D E E P L E A R N I N G T E N S O R S N E T W O R K I M P L E M E N T A T I O N M O D E L T R A I N I N G , T E S T M E T R I C S V I S U A L I Z A T I O N 4
  4. Neuron Model 6 Biological nueron model !𝑥𝑖 𝑤𝑖 + 𝑏

    1 2 n 𝑓 (𝑧) Non-Linear Activation function Summing junction w1 w2 wn z Output x1 x2 xn Artificial nueron model Weights Inputs Bias Y Parameters
  5. Deep Neural Network 7 " 𝑋𝑖 𝑊𝑖 + 𝑏 1

    2 n 𝑓(𝑧) Summing junction W1 W2 Wn z Output X1 X2 Xn Just for single neuron only Input: input data (encoded) Layers: Hidden: trained weights Output: predicted value [0 .. 1] Dense (fully-connected) Layer: every neuron from one layer is connected to every neuron to another layer Deep Network has multiple hidden layers for more efficient learning
  6. Deep Feed Forward Network 8 1. transforms patterns from input

    to ouput (forward propagation) 2. consists of dense layers 3. no back-loops 4. Backpropagation plus Gradient Descent learning algorithms are commonly used to update the weights/biases Inputs y Error (Delta) Training algorithm (Gradient Descent) Training Data Adjusting the weights initialized randomly Loss/Cost function
  7. Loss Curve 9 loss epoch Meaning: - Lower is better

    - Model learns parameters while training
  8. Loss/Cost function 10 Problem Output Loss Function Formula Regression Numerical

    1. Mean Squared Error (MSE)/ Quadratic Loss 2. Mean Absolute Error (MAE) 3. Huber Loss …. 𝑀𝑆𝐸 = ∑!"# $ 𝑦𝑖 − - 𝑦𝑖 2 𝑛 Classification Binary Binary Cross Entropy / Log Loss − 1 𝑁 ! !"# % 𝑦! ∗ log - 𝑦𝑖 + 1 − 𝑦! ∗ log(1 − - 𝑦𝑖 ) Classification Single label, multiple classes Cross Entropy − 1 𝑁 ! !"# % 𝑦𝑖 ∗ log(7 𝑦𝑖 ) Classification Multiple labels, Multiple classes Binary Cross Entropy -/-
  9. How to feed 1 or many data records into the

    network having N layers with multi-neurons each? Network: 12 x 6 x 6 x 1 11
  10. Linear Algebra : Matrix Multiplication 12 First Hidden Layer N

    = 12 Single Record: [1 x 12] ……. = [1 x 6] Dot product Record Batch of 16: broadcasting T
  11. Activation Functions: f(z) = a 13 f (in the literature

    as g or 𝝋 ) is applied element-wise for each neuron: f (xT * w + b) https://www.researchgate.net/publication/315667264_Efficient_Processing_of_Deep_Neural_Networks_A_Tutorial_and_Survey (0, 1) [-1, 1]
  12. Dot Product 14 def matMul[T: ClassTag]( a: Array[Array[T]], b: Array[Array[T]]

    )(using n: Numeric[T]): Array[Array[T]] = val rows = a.length val cols = b.head.length val out = Array.ofDim[T](rows, cols) for i <- (0 until rows).indices do for j <- (0 until cols).indices do var sum = n.zero for k <- b.indices do sum = sum + (a(i)(k) * b(k)(j)) out(i)(j) = sum out assert( a.head.length == b.length, "The number of columns in the first matrix should be equal to the number of rows in the second" ) Math rule: Scala 3
  13. Shape 16 •x •weight •bias •z, a •y, yHat Any

    of these can have different shape: Scalar (1) Vector row, column (n) Matrix (n, m) Cube (n, m, k)
  14. Tensor is N-dimensional array of data 17 Rank 0 Tensor

    Scalar Rank 1 Tensor Vector Rank 2 Tensor Matrix Rank 3 Tensor Rank 4 Tensor
  15. Tensor in Scala 18 sealed trait Tensor[T]: def length: Int

    def shape: List[Int] case class Tensor0D[T: ClassTag](data: T) extends Tensor[T]: override val shape: List[Int] = length :: Nil override val length: Int = 1 Scalar
  16. 19 case class Tensor1D[T: ClassTag](data: Array[T]) extends Tensor[T]: override def

    shape: List[Int] = List(data.length) override def length: Int = data.length Vector case class Tensor2D[T: ClassTag]( data: Array[Array[T]] ) extends Tensor[T]: override def shape: List[Int] = val (r, c) = (data.length, data.headOption.map(_.length).getOrElse(0)) List(r, c) override def length: Int = data.length Matrix
  17. Operations 20 extension [T: ClassTag: Numeric](t: Tensor[T]) // dot product

    def *(that: Tensor[T]): Tensor[T] = TensorOps.mul(t, that) // Hadamard product – elementwise mul def multiply(that: Tensor[T]): Tensor[T] = TensorOps.multiply(t, that) def -(that: T): Tensor[T] = TensorOps.subtract(t, Tensor0D(that)) def -(that: Tensor[T]): Tensor[T] = TensorOps.subtract(t, that) def +(that: Tensor[T]): Tensor[T] = TensorOps.plus(t, that) def +(that: T): Tensor[T] = TensorOps.plus(t, Tensor0D(that)) def sum: T = TensorOps.sum(t) Scala 3
  18. Churn Modeling 22 Customer Exists the Bank? Yes No Binary

    Classifier RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenu re,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exit ed 1,15634602,Hargrave,619,France,Female,42,2,0,1,1,1,101348.88,1 2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0 3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1 4,15701354,Boni,699,France,Female,39,1,0,2,0,0,93826.63,0 CreditScore, Geography, Gender, Age, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary Raw Data (not encoded): Input features as X: Target as Y: Exited
  19. Data Preparation: Encoding 23 Geography Gender France Female Spain Female

    France Female France Female Spain Female Spain Male France Male Germany Female France Male France Male France … Male … Label Encoding One-hot Encoding Classes: - France -> 0.0, Germany -> 1.0, Spain -> 2.0 - Female -> 0, Male -> 1 619.0 1.0 0.0 0.0 0.0 42.0 2.0 0.0 1.0 1.0 1.0 101348.88 608.0 0.0 0.0 1.0 0.0 41.0 1.0 83807.86 1.0 0.0 1.0 112542.58 502.0 1.0 0.0 0.0 0.0 42.0 8.0 159660.8 3.0 1.0 0.0 113931.57 699.0 1.0 0.0 0.0 0.0 39.0 1.0 0.0 2.0 0.0 0.0 93826.63 850.0 0.0 0.0 1.0 0.0 43.0 2.0 125510.82 1.0 1.0 1.0 79084.1 645.0 0.0 0.0 1.0 1.0 44.0 8.0 113755.78 2.0 1.0 0.0 149756.71 822.0 1.0 0.0 0.0 1.0 50.0 7.0 0.0 2.0 1.0 1.0 10062.8 376.0 0.0 1.0 0.0 0.0 29.0 4.0 115046.74 4.0 1.0 0.0 119346.88 501.0 1.0 0.0 0.0 1.0 44.0 4.0 142051.07 2.0 0.0 1.0 74940.5 684.0 1.0 0.0 0.0 1.0 27.0 2.0 134603.88 1.0 1.0 1.0 71725.73 Geography Gender
  20. Data Preparation: Scaling 24 For all columns, for each value:

    -0.32620511055784646 0.9971540476049313 -0.5787069743095328 -0.5737804629586442 -1.095.932.718.282.640 0.29350274665868764 -10.417.075.899.390.700 -0.44001395250984926 -1.002.753.789.548.960 -0.5787069743095328 17.426.525.728.049.500 -1.095.932.718.282.640 0.19815392375611882 -13.874.682.079.485.900 -15.367.173.385.927.800 0.9971540476049313 -0.5787069743095328 -0.5737804629586442 -1.095.932.718.282.640 0.29350274665868764 10.328.561.181.180.300 0.501495558183992 0.9971540476049313 -0.5787069743095328 -0.5737804629586442 -1.095.932.718.282.640 0.007456277950981121 -13.874.682.079.485.900 20.637.805.704.342.100 -1.002.753.789.548.960 -0.5787069743095328 17.426.525.728.049.500 -1.095.932.718.282.640 0.3888515695612565 -10.417.075.899.390.700 -0.05720239321674894 -1.002.753.789.548.960 -0.5787069743095328 17.426.525.728.049.500 0.9123735274249709 0.48420039246382535 10.328.561.181.180.300 17.740.853.363.745.600 0.9971540476049313 -0.5787069743095328 -0.5737804629586442 0.9123735274249709 10.562.933.298.792.300 0.6870955001085142 -2.840.345.891.861.180 -1.002.753.789.548.960 17.278.174.350.548.800 -0.5737804629586442 -1.095.932.718.282.640 -0.9460319510747073 -0.35018635392004 -15.470.635.969.520.500 0.9971540476049313 -0.5787069743095328 -0.5737804629586442 0.9123735274249709 0.48420039246382535 -0.35018635392004 0.3463016827948973 0.9971540476049313 -0.5787069743095328 -0.5737804629586442 0.9123735274249709 -1.136.729.596.879.840 -10.417.075.899.390.700 (value(i, j) – stats.column(j).mean) / stats.column(j).stdDev . . .
  21. Preprocessing API 25 case class LabelEncoder[T: ClassTag: Ordering]( classes: Map[T,

    T] = Map.empty[T, T] ): def fit(samples: Tensor1D[T]): LabelEncoder[T] = ??? def transform(t: Tensor2D[T], col: Int): Tensor2D[T] = ??? case class OneHotEncoder[ T: Ordering: ClassTag, U: Numeric: Ordering: ClassTag ]( classes: Map[T, U] = Map.empty[T, U] ): def fit(samples: Tensor1D[T]): OneHotEncoder[T, U] = ??? def transform(t: Tensor2D[T], col: Int): Tensor2D[T] = ???
  22. val x = prepareData(data) val y = dataLoader.cols[Double](-1) val ((xTrain,

    xTest), (yTrain, yTest)) = (x, y).split(0.2f) Train, Test Data 26 val dataLoader = TextLoader(Path.of("data", "Churn_Modelling.csv")).load() val data = dataLoader.cols[String](3, -1) Loading data from CSV to Tensor[String]: Encode categorical data, scale and transform to Double: // x is Tensor shape 10_000 x 12 // y is Tensor shape 10_000 x 1 // returns composition of encoders val encoders = createEncoders[Double](data) val numericData = encoders(data) val scaler = StandardScaler[Double]().fit(numericData) val prepareData = (t: Tensor2D[String]) => { val numericData = encoders(t) scaler.transform(numericData) } 8000 2000 8000 2000
  23. Network: Layers 28 case class Layer[T]( w: Tensor[T], b: Tensor[T],

    f: ActivationFunc[T], units: Int = 1) trait ActivationFunc[T]: val name: String def apply(x: Tensor[T]): Tensor[T] def derivative(x: Tensor[T]): Tensor[T] trait Loss[T]: def apply( actual: Tensor[T], predicted: Tensor[T] ): T
  24. Model 29 sealed trait Model[T]: def train(x: Tensor[T], y: Tensor[T],

    epochs: Int): Model[T] def layers: List[Layer[T]] def predict(x: Tensor[T]): Tensor[T] case class Sequential[T: ClassTag: RandomGen: Fractional, U]( lossFunc: Loss[T], losses: List[T] = Nil, learningRate: T, batchSize: Int = 16, layerStack: Int => List[Layer[T]] = _ => List.empty[Layer[T]], layers: List[Layer[T]] = Nil )(using optimizer: Optimizer[U]) extends Model[T] To be specified by user Hyper params
  25. User API 30 val ann = Sequential[Double, StandardGD]( binaryCrossEntropy, learningRate

    = 0.002d, batchSize = 64 ) .add(Dense(relu, 6)) .add(Dense(relu, 6)) .add(Dense(sigmoid)) case class Dense[T]( f: ActivationFunc[T], units: Int = 1 ) extends LayerCfg[T] update weights & biases on every 64 training records
  26. Layer Stack 31 def add(layer: LayerCfg[T]): Sequential[T, U] = copy(layerStack

    = (inputs: Int) => val currentLayers = layerStack(inputs) val prevInput = currentLayers.lastOption.map(_.units).getOrElse(inputs) val w = random2D(prevInput, layer.units) val b = zeros(layer.units) (currentLayers :+ Layer(w, b, layer.f, layer.units)) ) case class Sequential ... sealed trait LayerCfg[T]: def units: Int def f: ActivationFunc[T] Weights shape w.r.t. units and inputs: if inputs = 12 then: 1st hidden layer shape: 12 x 6 2nd hidden layer shape: 6 x 6 output layer: 6 x 1
  27. Training Algorithm 32 x: Tensor[T], y: Tensor[T] layers: List[Layer[T]] =>

    activations = activate(x, layers) => error = predicted - y => layers = updateWeights(layers, activations, error) Repeat while epoch < n All you need to remember from this presentation! input variables internal state
  28. Train N epochs 33 def train(x: Tensor[T], y: Tensor[T], epochs:

    Int): Model[T] = val actualBatches = y.batches(batchSize).toArray val batches = x.batches(batchSize).zip(actualBatches).toArray val layers = getOrInitLayers(x.cols) val (updatedLayers, epochLosses) = (1 to epochs).foldLeft(layers, List.empty[T]) { case ((lrs, losses), epoch) => val (trained, avgLoss) = trainEpoch(batches, lrs, epoch) (trained, losses :+ avgLoss) } 1st loop copy( layers = updatedLayers, losses = epochLosses )
  29. Train on batches: forward 34 private def trainEpoch( batches: Array[(Array[Array[T]],

    Array[Array[T]])], layers: List[Layer[T]], epoch: Int ) = val index = (1 to batches.length) val (trained, losses) = batches.zip(index).foldLeft(layers, List.empty[T]) { case ((layers, batchLoss), ((xBatch, yBatch), i)) => // forward val activations = activate(xBatch.as2D, layers) val actual = yBatch.as2D val predicted = activations.last.a val error = predicted - actual val loss = lossFunc(actual, predicted) Goes through the layers 2nd loop
  30. Activation 35 def activate[T: Numeric: ClassTag]( input: Tensor[T], layers: List[Layer[T]]

    ): List[Activation[T]] = layers .foldLeft(input, ListBuffer.empty[Activation[T]]) { case ((x, acc), Layer(w, b, f, _, _)) => val z = x * w + b val a = f(z) (a, acc :+ Activation(x, z, a)) } ._2 .toList case class Activation[T](x: Tensor[T], z: Tensor[T], a: Tensor[T]) Layer input Layer activation Layer activity current activity is next layer input b f(z) w x a
  31. Train on batches: backward 36 // backward val updatedLayers =

    optimizer.updateWeights( layers, activations, error ) (updatedLayers, batchLoss :+ loss) } (trained, getAvgLoss(losses))
  32. Optimizer 38 type Stub trait Optimizer[U]: def updateWeights[T: ClassTag: Fractional](

    layers: List[Layer[T]], activations: List[Activation[T]], error: Tensor[T], learningRate: T ): List[Layer[T]] = … given Optimizer[Stub] with override def updateWeights[T: ClassTag: Fractional]( layers: List[Layer[T]], … ): List[Layer[T]] = layers // does nothing Scala 3
  33. Without Optimizer: Stub 39 epoch: 1/100, avg. loss: NaN, metrics:

    [accuracy: 0.359] epoch: 2/100, avg. loss: NaN, metrics: [accuracy: 0.359] epoch: 3/100, avg. loss: NaN, metrics: [accuracy: 0.359] epoch: 4/100, avg. loss: NaN, metrics: [accuracy: 0.359] epoch: 5/100, avg. loss: NaN, metrics: [accuracy: 0.359] epoch: 6/100, avg. loss: NaN, metrics: [accuracy: 0.359] epoch: 7/100, avg. loss: NaN, metrics: [accuracy: 0.359] epoch: 8/100, avg. loss: NaN, metrics: [accuracy: 0.359] epoch: 9/100, avg. loss: NaN, metrics: [accuracy: 0.359] epoch: 10/100, avg. loss: NaN, metrics: [accuracy: 0.359] … Loss is greater than Double.MAX val model = ann.train(xTrain, yTrain, epochs = 100)
  34. With Optimizer (1) 40 type StandardGD weights: List[Layer[T]], activations: List[Activation[T]],

    error: Tensor[T], learningRate: T )(using n: Fractional[T]): List[Layer[T]] = … given Optimizer[StandardGD] with override def updateWeights[T: ClassTag](
  35. With Optimizer (2): Backpropagation + Gradient Descent 41 layers.zip(activations) .foldRight(List.empty[Layer[T]],

    error, None: Option[Tensor[T]]) { case ( (l @ Layer(w, b, f, _, _), Activation(x, z, _)), (lrs, prevDelta, prevWeight) ) => val delta = (prevWeight match case Some(pw) => prevDelta * pw.T case None => prevDelta ) multiply f.derivative(z) val wGradient = x.T * delta val bGradient = delta.sum val newWeight = w - (learningRate * wGradient) val newBias = b - (learningRate * bGradient) val updated = l.copy(w = newWeight, b = newBias) +: lrs (updated, delta, Some(w)) } ._1 Goes backward through the layers
  36. With Optimizer (3) 42 epoch: 1/100, avg. loss: 0.8061420654867331, metrics:

    [accuracy: 0.70675] epoch: 2/100, avg. loss: 0.5271817345700976, metrics: [accuracy: 0.793875] epoch: 3/100, avg. loss: 0.5055016076889828, metrics: [accuracy: 0.793375] epoch: 4/100, avg. loss: 0.49368974906385815, metrics: [accuracy: 0.7945] epoch: 5/100, avg. loss: 0.48540839233676397, metrics: [accuracy: 0.79525] epoch: 6/100, avg. loss: 0.4788697196516788, metrics: [accuracy: 0.7965] epoch: 7/100, avg. loss: 0.4732941117845138, metrics: [accuracy: 0.796375] epoch: 8/100, avg. loss: 0.46855840601887444, metrics: [accuracy: 0.7985] epoch: 9/100, avg. loss: 0.4645757985260151, metrics: [accuracy: 0.8015] epoch: 10/100, avg. loss: 0.46127288371357456, metrics: [accuracy: 0.802375] … epoch: 100/100, avg. loss: 0.35699497553205667, metrics: [accuracy: 0.86125]
  37. Test 43 val testPredicted = model.predict(xTest) val value = accuracy(yTest,

    testPredicted) println(s"test accuracy = $value") test accuracy = 0.8245 // Single test val example = TextLoader( "n/a,n/a,n/a,600,France,Male,40,3,60000,2,1,1,50000,n/a" ).cols[String](3, -1) val testExample = prepareData(example) val yHat = model.predict(testExample) val exited = predictedToBinary(yHat.as0D.data) == 1 println(s"Exited customer? $exited") Exited customer? false shape: 1x1, Tensor2D[Double]: [[0.054950115637072916]]
  38. Test 45 sealed trait Model[T]: def predict(x: Tensor[T]): Tensor[T] Feed

    forward -> def predict(x: Tensor[T]): Tensor[T] = activate(x).last.a case class Sequential … extends Model[T]
  39. Thank you! Questions? 46 1. Artificial Neural Network in Scala

    - part 1 https://novakov-alexey.github.io/ann-in-scala-1/ https://novakov-alexey.github.io/ann-in-scala-2/ 2. Artificial Neural Network in Scala - part 2 https://novakov-alexey.github.io/tensorflow-scala/ 3. TensorFlow Scala - Linear Regression via ANN 4. Linear Regression with Gradient Descent https://novakov-alexey.github.io/linear-regression/ 5. Linear Regression with Adam Optimizer https://novakov-alexey.github.io/adam-optimizer/ https://github.com/novakov-alexey/deep-learning-scala 0. Mini-library source code https://arxiv.org/pdf/1609.04747.pdf 6. An overview of gradient descent optimization algorithms Twitter: @alexey_novakov Blog: https://novakov-alexey.github.io/ More Information on ANN: