Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Things You Might Not Have Tried with Ecto

Things You Might Not Have Tried with Ecto

If you're only using Ecto for schema modules, changeset functions, inserts, and updates, you're missing out on some great features.

Without a database: Ecto.Types can be used to convert stringly-typed data from any source to native Elixir.

Ecto migrations can represent primary keys that aren't integers and more closely reflect unique identifiers in your domain; join tables that use composite primary keys that represent what’s unique about the join instead of an arbitrary id column; accurately represent the difference between time keeping in the database and in your Ecto schemas using different timestamp configs.

The full power of Postgres is available through Ecto constraints: unique constraints can be tuned to specific conditions; exclusion constraints make ranges and scheduling easy; and check constraints are even more general and allow modeling complete validation logic in the database. Constraints ensure that different code paths can't corrupt your data and violate business rules. Leaning on the database constraints is crucial when doing bulk operations in concurrent ETL pipelines, but Luke will show how you don't need to abandon changeset and validations just because insert_all doesn't support inserting changesets.

Elle Imhoff

March 01, 2019
Tweet

More Decks by Elle Imhoff

Other Decks in Technology

Transcript

  1. Things You Might Not Have Tried with Ecto Luke Imhoff

    Creator of IntelliJ Elixir Hi, I'm Luke Imhoff. I'm known across the Internet as KronicDeth. I work at DockYard in the R&D Department. I created IntelliJ Elixir, the Elixir plugin for JetBrains IDEs like IntelliJ, Rubymine, WebStorm, etc. This is Things You Might Not Have Tried with Ecto.
  2. •POA Network •DockYard •Gaslight •Plataformatec https://github.com/poanetwork/blockscout The techniques and examples

    I'm about to show you were gathered while working on BlockScout: an open-source Ethereum blockchain explorer. The project is sponsored by POA Network. The work was a collaboration between POA Network, DockYard, Gaslight, and Plataformatec. The examples will be specific to project and its domain, but the techniques apply to any complex domain that needs to ingest millions to 100s of millions rows in days instead of months.
  3. Without a Database With a Database Ecto Even before the

    split of Ecto into `ecto` and `ecto_sql` in version 3, Ecto could be used with or without a database.
  4. •Ecto.Changeset •Ecto.Changeset.cast/4 DB Casting Ecto.Type When I say casting with

    Ecto, you probably think of Ecto.Changeset and Ecto.Changeset.cast, but that's not the source of Ecto's casting. Instead, Ecto.Changeset.cast is internally using Ecto.Type.
  5. •cast/1 •dump/1 •equal?/2 •load/1 •type/2 DB Ecto.Type Ecto.Type is a

    behaviour with 5 callbacks: `cast`, `dump`, `equal?`, `load`, and `type`.
  6. DB Ecto.Type Callback Input Output cast/1 External Elixir dump/1 Elixir

    Adapter equal?/2 Elixir Elixir load/1 Adapter Elixir type/1 N/A Adapter I always forget if the callbacks are talking about the external, Elixir, or adapter type, so I made this nice table. Cast converts external input, usually strings to an Elixir type. Dump then takes the Elixir type and passes it to the adapter's type. Load does the opposite of dump.
  7. Ecto.Type DB External cast Elixir dump insert get/all Adapter load

    Adapter Elixir No Update External cast Adapter load Elixir Elixir equal? dump Adapter update If you're wondering how the callbacks are used when there is a database...
  8. Insert Ecto.Type External cast Elixir dump Adapter For insert, you're

    taking in external inputs, casting to Elixir, and then dumping to the database adapter.
  9. get/all Ecto.Type Adapter load Elixir For read operations, like `get`

    and `all`, you're reading the adapter type from the database and loading it into the Elixir type
  10. Update Ecto.Type No Update External cast Adapter load Elixir Elixir

    equal? dump Adapter Updates are combining...
  11. Ecto.Type Ecto.Type No Update External cast Adapter load Elixir Elixir

    equal? dump Adapter ... the type from the adapter by loading...
  12. Ecto.Type Ecto.Type No Update External cast Adapter load Elixir Elixir

    equal? dump Adapter ...and the external inputs through casting.
  13. Ecto.Type Ecto.Type No Update External cast Adapter load Elixir Elixir

    equal? dump Adapter If the Elixir type is like Decimal and normal Elixir `==` does not work for comparison, the optional `equal?` callback can be implemented.
  14. Ecto.Type Ecto.Type No Update External cast Adapter load Elixir Elixir

    equal? dump Adapter If the values are equal, no change is recorded, so no update happens...
  15. Ecto.Type Ecto.Type No Update External cast Adapter load Elixir Elixir

    equal? dump Adapter ... otherwise the change goes to the database.
  16. Ecto.Type Hash Callback Input Output cast/1 "0x[A-Fa-f0-6]*" binary dump/1 binary

    binary equal?/2 N/A N/A load/1 binary binary type/1 N/A :binary Now that've you've seen how Ecto.Type is used, let's see an actual example. Our use case is converting hashes. Hashes are hexadecimal so that people can read them, but the hash itself is really a run of bytes, so in Elixir we have the better representation as a binary. We're not using a database yet, but for the adapter this would be :binary or :bytea for Postgrex.
  17. Hash Binary Problems • One Protocol Implementation • Can't Restrict

    Length So, binaries are great for pattern matching in Elixir, but they have some problems: Our project was using Phoenix for the UI, so we'd like the Elixir type to implement String.Chars, Phoenix.HTML.Safe, and Phoenix.Params. We'd also like to restrict the length. I say restrict because we want to ensure that the length can't be created wrong, not only validate the length from user input. Moving the length check into the type also ensures that is consistent everywhere and we can't forget a validate_length call for a schema's cast function.
  18. Hash Struct %Hash{ byte_count: integer, bytes: binary } To be

    able to implement protocols that aren't just on the generic `binary` type, we need the Elixir representation to be a `struct`. So we wrap the original `binary` in a `Hash` `struct`.
  19. Hash Struct %Hash{ byte_count: integer, bytes: binary } The bytes

    binary has a size that can be pattern matched out, but we want to prevent incorrectly sized binaries from being directly set in the `Hash`, so...
  20. Hash Struct %Hash{ byte_count: integer, bytes: binary } ... we

    also record the expected byte_count in the struct. This caught some bugs where we mixed up Hashes of different sizes.
  21. Hash Sizes Name Size Usage Hash.Full 32 Block of Transactions

    or Transaction Hash.Address 20 User or Account Hash.Nonce 16 Proof-of-Work We had 3 sizes of hashes for 4 different usages: full 32-byte hashes are used for primary keys for blocks of transactions and transactions themselves. Hashes truncated to 20 bytes are used for addresses, which represent individual people or collectives represented by a contract. Hashes truncated to 16 bytes are called nonces and used as proof-of-work for a miner to show they found a number that when concatenated with the block will prefix the new hash with a certain number of zeros.
  22. Hash t:t/0 defmodule Explorer.Chain.Hash do defstruct ~w(byte_count bytes)a @max_byte_count 32

    @type byte_count :: 1..unquote(@max_byte_count) @type t :: %__MODULE__{ byte_count: byte_count, bytes: <<_::_*8>> } @callback byte_count() :: byte_count() # ... end The `Hash` module itself defines the shared `Hash` `struct`.
  23. Hash c:byte_count/0 defmodule Explorer.Chain.Hash do defstruct ~w(byte_count bytes)a @max_byte_count 32

    @type byte_count :: 1..unquote(@max_byte_count) @type t :: %__MODULE__{ byte_count: byte_count, bytes: <<_::_*8>> } @callback byte_count() :: byte_count() # ... end Because of the multiple sizes, we made the Hash module be pluggable as a behaviour. The implementation modules such as the Hash.Full implement the byte_count callback to restrict the byte_count field to a specific value.
  24. Hash Full defmodule Explorer.Chain.Hash.Full do alias Explorer.Chain.Hash @behaviour Hash @byte_count

    32 @impl Hash def byte_count, do: @byte_count # ... end Here we see the Hash.Full. It implements the byte_count callback for the Hash behaviour to restrict the size to exactly 32 bytes.
  25. Hash Full defmodule Explorer.Chain.Hash.Full do alias Explorer.Chain.Hash @behaviour Ecto.Type @impl

    Ecto.Type def cast(term) do Hash.cast(__MODULE__, term) end @impl Ecto.Type def dump(term) do Hash.dump(__MODULE__, term) end @impl Ecto.Type def load(term) do Hash.load(__MODULE__, term) end @impl Ecto.Type def type, do: :binary # ... end The Ecto.Type callbacks then defer to Hash module.
  26. Hash cast/2 defmodule Explorer.Chain.Hash do # ... def cast(callback_module, term)

    when is_atom(callback_module) do byte_count = callback_module.byte_count() case term do ... end end # ... end Let's first look at Hash.cast/2 which allows our callback modules to implement Ecto.Type.cast/1, but with a specific byte count.
  27. Hash cast/2 defmodule Explorer.Chain.Hash do # ... def cast(callback_module, term)

    when is_atom(callback_module) do # ... case term do %__MODULE__{ byte_count: ^byte_count, bytes: <<_::big-integer-size(byte_count)-unit(@bits_per_byte)>> } = cast -> {:ok, cast} # ... end end # ... end The first type of input, which all Ecto.Type.cast/1 implementations must handle, is the value is already cast, in which case we pass it through.
  28. Hash cast/2 defmodule Explorer.Chain.Hash do # ... def cast(callback_module, term)

    when is_atom(callback_module) do # ... case term do %__MODULE__{ byte_count: ^byte_count, bytes: <<_::big-integer-size(byte_count)-unit(@bits_per_byte)>> } = cast -> {:ok, cast} # ... end end # ... end You can see the double-check of the byte count in the pattern - checking both the `byte_count` field and the bytes `size` in the binary pattern. The calling code can't force Hashes of internally inconsistent size or that differ from the declared size for the field in an Ecto.Schema module.
  29. Hash cast/2 defmodule Explorer.Chain.Hash do # ... def cast(callback_module, term)

    when is_atom(callback_module) do # ... case term do # ... <<_::big-integer-size(byte_count)-unit(@bits_per_byte)>> -> {:ok, %__MODULE__{byte_count: byte_count, bytes: term}} # ... end end # ... end The second input was a plain binary, which we needed to handle to interact with other APIs. We ended up storing a lot of hashes in ETS tables and storing just the binary instead of the struct saved a lot of memory. The binary's size can be checked against the required byte_count for the Hash implementation.
  30. Hash cast/2 defmodule Explorer.Chain.Hash do # ... def cast(callback_module, term)

    when is_atom(callback_module) do # ... case term do # ... <<"0x", hexadecimal_digits::binary>> -> cast_hexadecimal_digits(hexadecimal_digits, byte_count) # ... end end # ... end The third input is what you're think of as a hash - it's a human-readable 0x-prefixed hexadecimal string.
  31. Hash cast/2 defmodule Explorer.Chain.Hash do # ... def cast(callback_module, term)

    when is_atom(callback_module) do # ... case term do # ... <<"0x", hexadecimal_digits::binary>> -> cast_hexadecimal_digits(hexadecimal_digits, byte_count) # ... end end # ... end It calls `cast_hexadecimal_digits`
  32. Hash cast_hexadecimal_digits/2 defmodule Explorer.Chain.Hash do # ... defp cast_hexadecimal_digits(hexadecimal_digits, byte_count)

    when is_binary(hexadecimal_digits) do hexadecimal_digit_count = byte_count_to_hexadecimal_digit_count(byte_count) with ^hexadecimal_digit_count <- String.length(hexadecimal_digits), {:ok, bytes} <- Base.decode16(hexadecimal_digits, case: :mixed) do {:ok, %__MODULE__{byte_count: byte_count, bytes: bytes}} else _ -> :error end end # ... end We want bytes as output and want to skip first having to make an integer and then converting it back to bytes, so we use Base.decode16, which efficiently goes from hexadecimal encoding to a binary.
  33. Hash cast/2 defmodule Explorer.Chain.Hash do # ... def cast(callback_module, term)

    when is_atom(callback_module) do # ... case term do # ... integer when is_integer(integer) -> cast_integer(integer, byte_count) # ... end end # ... end The fourth input to cast is a normal Elixir integer.
  34. Hash cast/2 defmodule Explorer.Chain.Hash do # ... def cast(callback_module, term)

    when is_atom(callback_module) do # ... case term do # ... integer when is_integer(integer) -> cast_integer(integer, byte_count) # ... end end # ... end It calls cast_integer
  35. Hash cast_integer/2 defmodule Explorer.Chain.Hash do # ... defp cast_integer(integer, byte_count)

    when is_integer(integer) do max_integer = byte_count_to_max_integer(byte_count) case integer do in_range when 0 <= in_range and in_range <= max_integer -> {:ok, %__MODULE__{ byte_count: byte_count, bytes: <<integer::big-integer-size(byte_count)-unit(@bits_per_byte)>> }} _ -> :error end end # ... end When casting from an integer, the important thing is to check the range...
  36. Hash cast_integer/2 defmodule Explorer.Chain.Hash do # ... defp cast_integer(integer, byte_count)

    when is_integer(integer) do max_integer = byte_count_to_max_integer(byte_count) case integer do in_range when 0 <= in_range and in_range <= max_integer -> {:ok, %__MODULE__{ byte_count: byte_count, bytes: <<integer::big-integer-size(byte_count)-unit(@bits_per_byte)>> }} _ -> :error end end # ... end ...as the binary syntax will silently throw away bits of the integer that don't fit the given size and units. There is no overflow exception for constructed binaries.
  37. Hash Testing {:ok, block_hash} = "block_hash" |> sequence(& &1) |>

    Hash.Full.cast() We don't really expect user or external inputs to be integers as it's not really correct to think of the collection of bytes as one continuous number, but it makes tests easier... because we can use a sequence to generate incrementing numbers and then cast it to a Hash.
  38. Hash cast/2 defmodule Explorer.Chain.Hash do # ... def cast(callback_module, term)

    when is_atom(callback_module) do # ... case term do # ... _ -> :error # ... end end # ... end The final input type is a catchall that returns `:error` as cast callbacks should avoid throwing exceptions and instead return :error for any unrecognized inputs.
  39. Hash Full defmodule Explorer.Chain.Hash.Full do alias Explorer.Chain.Hash @behaviour Ecto.Type @impl

    Ecto.Type def cast(term) do Hash.cast(__MODULE__, term) end @impl Ecto.Type def dump(term) do Hash.dump(__MODULE__, term) end @impl Ecto.Type def load(term) do Hash.load(__MODULE__, term) end @impl Ecto.Type def type, do: :binary # ... end After cast, the next callback is dump
  40. Hash dump/2 defmodule Explorer.Chain.Hash do # ... def dump(callback_module, term)

    when is_atom(callback_module) do byte_count = callback_module.byte_count() case term do # ensure inconsistent `t` with a different `byte_count` from the # `callback_module` isn't dumped to the database, # in case `%__MODULE__{}` is set in a field value directly %__MODULE__{ byte_count: ^byte_count, bytes: <<_::big-integer-size(byte_count)-unit(@bits_per_byte)>> = bytes } -> {:ok, bytes} _ -> :error end end # ... end Dump is simpler. We only need to extract the bytes for the binary format used in the adapter. The comment hints at a bug that needed to be prevented - copypasta for the wrong hash type being directly set, as happened in some of our tests. Using the wrong type of hash could make the tests fail or more perniciously, accidentally pass.
  41. Hash load/2 defmodule Explorer.Chain.Hash do # ... def load(callback_module, term)

    do byte_count = callback_module.byte_count() case term do # ensure that only hashes of `byte_count` that matches # `callback_module` can be loaded back from database to # prevent using `Ecto.Type` with wrong byte_count on a # database column <<_::big-integer-size(byte_count)-unit(@bits_per_byte)>> -> {:ok, %__MODULE__{byte_count: byte_count, bytes: term}} _ -> :error end end # ... end Load approaches the bug from the other direction. We have a lot of data, and a lot of ways to write it. In some cases, we're using SQL to directly set one field from another, such as when doing a data migration that involves foreign keys. We want the Ecto.Type at runtime to verify data to catch mistakes in other parts of the code as early as possible.
  42. Hash DB? • Validate Input • Separate Invalid/Not Found •

    Alternative Datastores Wait, wait... I said Ecto.Type was useful even when you don't have a database, but I just showed you dump and load, which are definitely only used for a database. So, how do we use Hash.Full.cast without the database? We use it to check inputs without round tripping to the database. This pre-check is important because Repo.get and Repo.all don't pass through a changeset, so we would have passed bad inputs. It also means we can use the Hashes to lookup non-Ecto datastores, such as ETS tables and remote APIs.
  43. Without a Database With a Database Ecto Let's move on

    from without a database to when we are using a database...
  44. DB Migrations • Primary Keys • Join Tables • Timestamps

    • Constraints You can't have a database without first creating it, so a logical point for talking about Ecto with a database is the migrations.
  45. Migrations Primary Keys defmodule Explorer.Repo.Migrations.CreateLogs do use Ecto.Migration def change

    do create table(:logs) do # ... end end end • Primary Key • id • Serial • bigint `create table` creates a default primary key called `id`. In Postgres it is a serial. A serial is an automatically incrementing bigint. Knowing this is important in case you need to add a primary key to a table later.
  46. Migrations Primary Keys defmodule Explorer.Repo.Migrations.CreateAddress do use Ecto.Migration def change

    do create table(:addresses, primary_key: false) do # ... add(:hash, :bytea, null: false, primary_key: true) # ... end end end When we want a non-integer primary key, such a binary hash or UUID, we need to tell `create table` to not create the default primary key, but then mark the field that is the primary key instead.
  47. Migrations Composite Primary Keys defmodule Explorer.Repo.Migrations.CreateInternalTransactions do use Ecto.Migration def

    change do create table(:internal_transactions, primary_key: false) do # ... add( :transaction_hash, references(:transactions, column: :hash, on_delete: :delete_all, type: :bytea), null: false, primary_key: true ) add(:index, :integer, null: false, primary_key: true) # ... end # ... end end This also works for composite primary keys: you can tag more than one column as `primary_key: true` and they will be turned into a composite primary key.
  48. Migrations Composite Primary Keys defmodule Explorer.Repo.Migrations.CreateInternalTransactions do use Ecto.Migration def

    change do create table(:internal_transactions, primary_key: false) do # ... add( :transaction_hash, references(:transactions, column: :hash, on_delete: :delete_all, type: :bytea), null: false, primary_key: true ) add(:index, :integer, null: false, primary_key: true) # ... end # ... end end This example about internal transactions also shows that a column can both be foreign key using references and a primary key or part of a composite primary key at the same time.
  49. Migrations Terminology Block 1 (consensus) • Transaction 1 • Transaction

    2 • Transaction 3 • Transaction 4 Block 2 (consensus) Block 1 (non-consensus) • Transaction 1 • Transaction 2 Parent Nephew Uncle Block 1 (non-consenus) is an uncle of Block 2 Block 2 is a nephew of Block 1 (non-consenus) A brief interlude to explain some domain terminology: a block is a batch of transactions. The blockchain is like git in that any block hash includes the hash of the previous block, or the parent. The blockchain, being eventually consistent means that different part of the network can think they represent the consensus. Ethereum as part of its design decided to pay people that pick the incorrect fork by allowing those blocks to be included as uncles on the consensus linked-list. The important part is that the correct block is the nephew while the incorrect block is the uncle. We care to track this information as attacks on the network can be tracked by seeing if a transaction appears in both an uncle and a consensus block.
  50. Migrations Join Tables defmodule Explorer.Repo.Migrations.CreateBlockSecondDegreeRelations do use Ecto.Migration def change

    do create table(:block_second_degree_relations, primary_key: false) do add(:nephew_hash, references(:blocks, column: :hash, type: :bytea), null: false) add(:uncle_hash, :bytea, null: false) end create(unique_index(:block_second_degree_relations, [:nephew_hash, :uncle_hash], name: :nephew_hash_to_uncle_hash)) create(unique_index(:block_second_degree_relations, [:uncle_hash, :nephew_hash], name: :uncle_hash_to_nephew_hash)) end end In our project, we need to track that relationships between uncle and nephew blocks. Because the relationship is many-to-many we need a join table. Neither foreign key is really primary, so neither has a `primary_key: true` option.
  51. Migrations Join Tables defmodule Explorer.Repo.Migrations.CreateBlockSecondDegreeRelations do use Ecto.Migration def change

    do create table(:block_second_degree_relations, primary_key: false) do add(:nephew_hash, references(:blocks, column: :hash, type: :bytea), null: false) add(:uncle_hash, :bytea, null: false) # ... end create(unique_index(:block_second_degree_relations, [:nephew_hash, :uncle_hash], name: :nephew_hash_to_uncle_hash)) create(unique_index(:block_second_degree_relations, [:uncle_hash, :nephew_hash], name: :uncle_hash_to_nephew_hash)) end end Instead, we need to do what `primary_key` would do manually: (1) null: false on the columns, and (2) a unique index on all ordered pairs of the foreign keys to the joined tables.
  52. Migrations Join Tables defmodule Explorer.Repo.Migrations.CreateBlockSecondDegreeRelations do use Ecto.Migration def change

    do create table(:block_second_degree_relations, primary_key: false) do add(:nephew_hash, references(:blocks, column: :hash, type: :bytea), null: false) add(:uncle_hash, :bytea, null: false) # ... end create(unique_index(:block_second_degree_relations, [:nephew_hash, :uncle_hash], name: :nephew_hash_to_uncle_hash)) create(unique_index(:block_second_degree_relations, [:uncle_hash, :nephew_hash], name: :uncle_hash_to_nephew_hash)) end end The order of keys matters for Postgres using an index. There must be a prefix of keys that match the planned join order because the join needs to be able to select a subtree of the index for it to be efficient, so when you have multiple foreign keys you need an index for each direction the keys can be used.
  53. Migrations Lazy Joins defmodule Explorer.Repo.Migrations.CreateBlockSecondDegreeRelations do use Ecto.Migration def change

    do create table(:block_second_degree_relations, primary_key: false) do add(:nephew_hash, references(:blocks, column: :hash, type: :bytea), null: false) add(:uncle_hash, :bytea, null: false) # ... end # ... end end You may have noticed that nephew_hash is a foreign key using `references` while uncle_hash is only a `bytea`. This is because we're getting the data from a remote API, so we know that the nephew is connected to the uncle before we necessarily know about the uncle as an entry in the blocks table as the response isn't fully linked. That's why I call this a lazy join table, as the uncles are lazily fetched asynchronously.
  54. Migrations Lazy Joins defmodule Explorer.Repo.Migrations.CreateBlockSecondDegreeRelations do use Ecto.Migration def change

    do create table(:block_second_degree_relations, ...) do # ... add(:uncle_fetched_at, :utc_datetime_usec, default: fragment("NULL"), null: true) # ... end # ... end end To keep track that we haven't fetched the uncle yet, we can't use the `uncle_hash` column because we do get the uncle_hash from the remote API. Instead, we have a separate column that marks when we did the fetch.
  55. Migrations Partial Index defmodule Explorer.Repo.Migrations.CreateBlockSecondDegreeRelations do use Ecto.Migration def change

    do # ... create( unique_index(:block_second_degree_relations, [:nephew_hash, :uncle_hash], name: :unfetched_uncles, where: "uncle_fetched_at IS NULL" ) ) end end We need a way to efficiently query for the nephew's that don't have their uncles fetched. We setup a partial index, which is an index with a where clause, for all rows that have `uncle_fetched_at` as `NULL`. We came to this pattern of `NULL` timestamp with partial index for reboot recovery after first trying anti-joins to find foreign keys that didn't point to existent primary keys in the block table. Using the partial index takes only a few seconds while the anti-join took 10s of minutes in some cases when the block table contained 5-7 millions rows.
  56. Migrations Custom Index defmodule Explorer.Repo.Migrations.CreateUsers do use Ecto.Migration def change

    do create table(:users) do # .. add(:username, :string, null: false) # ... end # ... create(index(:users, ["lower(username)"], unique: true, name: :unique_username)) # ... end end You don't have to directly index the column values and instead can index any value computed from the columns of the row. We created a unique index on the lower case version of the username to allow case-preserving of the display, but case-insensitive uniqueness without the need to change the collation of the column.
  57. Migrations Timestamps defmodule Explorer.Repo.Migrations.CreateBlockSecondDegreeRelations do use Ecto.Migration def change do

    create table(:block_second_degree_relations, ...) do # ... add(:uncle_fetched_at, :utc_datetime_usec, default: fragment("NULL"), null: true) # ... end # ... end end You may have noticed, that you've seen the timestamp type, utc_datetime_usec, already.
  58. Migrations Timestamps defmodule Explorer.Repo.Migrations.CreateAddress do use Ecto.Migration def change do

    create table(:addresses, primary_key: false) do # ... timestamps(null: false, type: :utc_datetime_usec) end # ... end end For the more standard `timestamps` call, the type key is used to override the default type. By default, `null: false` is set, but I've gotten in the habit of explicitly putting `null: false` or `null: true` in migrations for fields to show I thought about the nullability of the column.
  59. Migrations Timestamps :type Ecto Elixir μsec N/A :naive_datetime %NaiveDateTime{} ❌

    :naive_datetime :naive_datetime %NaiveDateTime{} ❌ :naive_datetime_usec :naive_datetime_usec %NaiveDateTime{} ✅ :utc_datetime :utc_datetime %DateTime{} ❌ :utc_datetime_usec :utc_datetime_usec %DateTime{} ✅ Without the `:type` option, the ecto type is `:naive_datetime`. If you want Elixir to know the database is using UTC, you need to explicitly use `:utc_datetime` for the Ecto Type and then you'll be able to use `%DateTime{}` in Elixir. In Ecto 3.0, the _usec suffix was added to say you expect microseconds to be included as not all datasources could track microseconds. Assuming microseconds worked was an Ecto 2.X bug.
  60. Migrations Check Constraints create( constraint( :transactions, :collated_block_number, check: "block_hash IS

    NULL OR block_number IS NOT NULL" ) ) Because we had multiple writers reaching the same table over multiple code paths, we wanted to make sure that the database was enforcing any validations as constraints. This started simple enough: for a domain example, a transaction can be pending in which case it it isn't associated with a block...
  61. Migrations Check Constraints create( constraint( :transactions, :collated_block_number, check: "block_hash IS

    NULL OR block_number IS NOT NULL" ) ) ... or it is not pending and assigned a denormalized block number.
  62. Migrations Check Constraints create( constraint( :transactions, :status, check: """ (internal_transactions_indexed_at

    IS NULL AND status IS NULL) OR (block_hash IS NOT NULL AND internal_transactions_indexed_at IS NULL) OR (block_hash IS NOT NULL AND status IS NOT NULL) """ ) ) Unlike validations that can be skipped because of imperative logic in your schema's changeset function, check constraints are always run. There must always be some way to calculate TRUE in any valid state and FALSE in any invalid state. You can end up with some pretty complex rules if your table represents multiple states for a single field.
  63. Migrations Compact Constraints • Truth Table • Karnaugh Map •

    Leave a comment I got in the habit of writing out the truth table to check my work so I could explain how each combination of columns could occur. But a truth table would require to a lot of AND and ORs to do the full conjunction of disjunctions, so I used a Karnaugh Map to simplify the tables to the minimum column checks. Finally, I left those truth tables and Karnaugh map in comments in the migrations so that anyone can later check the logic.
  64. Migrations Exclusion Constraint • Operator instead of Equality • Multiple

    rows instead of 1 row • GiST index instead of B-Tree Beyond check constraints, Postgres supports exclusion constraints. Exclusion constraints are a generalization of unique indexes in that they can use any operator instead of only equality. Exclusion constraints differ from check constraints because exclusion constraints can compare multiple rows while check can only see one row. By using a GiST index, we can apply the exclusion to more complex types, like ranges, tsvectors, and inet types.
  65. Migrations Exclusion Constraint defmodule Explorer.Repo.Migrations.CreateBlockRewards do use Ecto.Migration def change

    do create table(:block_rewards, primary_key: false) do add(:block_range, :int8range) # ... end # ... create(constraint(:block_rewards, :no_overlapping_ranges, exclude: ~s|gist (block_range WITH &&)|)) end end You can create an exclusion constraint using a GiST index and the && operator on a range type. && checks for overlap. For us that was an int8 type because we cared about values assigned to a range of block numbers. This ensured we didn't have multiple rewards for the same bock number without having to list out the reward for all 7 million blocks now and into the future or do a sequential scan to check for collisions inside a trigger when we used two columns instead of the range type.
  66. Migrations Scheduling defmodule MyApp.CreateSchedules do use Ecto.Migration def change do

    create table(:schedules, primary_key: false) do add(:date_range, :date8range) # ... end # ... create(constraint(:schedules, :no_overlapping_ranges, exclude: ~s|gist (date_range WITH &&)|)) end end This also works for timespans if you use a daterange type. This can prevent double-booking of people or resources.
  67. DB Bulk Operations • Validation • Chaining • Locking •

    Division Moving out of the setup of the database with migrations, we start to the actual work of inserting, updating, and reading the data. Our application functioned as a way of caching and exploring data from an external API, so that means we had an Extract-Transform-and-Load, or ETL process, that involved bulk operations on the database instead of single row operations.
  68. DB Bulk Operations • insert_all(schema_or_source, entries, opts) • update_all(queryable, updates,

    opts) • delete_all(queryable, opts) Looking at the arguments, none of the operations take changesets, which means we can't pass changesets and get back validations.
  69. DB insert_all entries •%{name => value} •[{name, value}] Focusing on

    insert_all, entries can be either a map or keyword list of field names pointing to their value.
  70. DB Valid entries changes_list(Address, %{params_list: params_list}) changes_list(Address, %{params_list: params_list, with:

    :balance_changeset}) We would like those field values to be validated first, so how do we do that? Let's plan out our API. We want to specify the ecto schema module that has the changeset function once.
  71. DB Valid entries changes_list(Address, %{params_list: params_list}) changes_list(Address, %{params_list: params_list, with:

    :balance_changeset}) Unlike the `changeset` function, we don't want to pass one map of params, but a whole list of maps.
  72. DB Valid entries changes_list(Address, %{params_list: params_list}) changes_list(Address, %{params_list: params_list, with:

    :balance_changeset}) We also need a way to call a changeset function that isn't called `changeset`, so we take inspiration from Ecto's other APIs and add a `:with` option to override the `changeset` function name.
  73. DB Valid entries defp changes_list(ecto_schema_module, %{params_list: params_list} = options) do

    changeset_function_name = Map.get(options, :with, :changeset) struct = ecto_schema_module.__struct__() params_list |> Stream.map(&apply(ecto_schema_module, changeset_function_name, [struct, &1])) |> Enum.reduce({:ok, []}, fn changeset = %Changeset{valid?: false}, {:ok, _} -> {:error, [changeset]} changeset = %Changeset{valid?: false}, {:error, acc_changesets} -> {:error, [changeset | acc_changesets]} %Changeset{changes: changes, valid?: true}, {:ok, acc_changes} -> {:ok, [changes | acc_changes]} %Changeset{valid?: true}, {:error, _} = error -> error end) end We're going to extract our arguments and options.
  74. DB Valid entries defp changes_list(ecto_schema_module, %{params_list: params_list} = options) do

    changeset_function_name = Map.get(options, :with, :changeset) struct = ecto_schema_module.__struct__() params_list |> Stream.map(&apply(ecto_schema_module, changeset_function_name, [struct, &1])) |> Enum.reduce({:ok, []}, fn changeset = %Changeset{valid?: false}, {:ok, _} -> {:error, [changeset]} changeset = %Changeset{valid?: false}, {:error, acc_changesets} -> {:error, [changeset | acc_changesets]} %Changeset{changes: changes, valid?: true}, {:ok, acc_changes} -> {:ok, [changes | acc_changes]} %Changeset{valid?: true}, {:error, _} = error -> error end) end Validate with the changeset function
  75. DB Valid entries defp changes_list(ecto_schema_module, %{params_list: params_list} = options) do

    changeset_function_name = Map.get(options, :with, :changeset) struct = ecto_schema_module.__struct__() params_list |> Stream.map(&apply(ecto_schema_module, changeset_function_name, [struct, &1])) |> Enum.reduce({:ok, []}, fn changeset = %Changeset{valid?: false}, {:ok, _} -> {:error, [changeset]} changeset = %Changeset{valid?: false}, {:error, acc_changesets} -> {:error, [changeset | acc_changesets]} %Changeset{changes: changes, valid?: true}, {:ok, acc_changes} -> {:ok, [changes | acc_changes]} %Changeset{valid?: true}, {:error, _} = error -> error end) end Add reduce the changeset to a single ok or error tuple
  76. DB changes_list defp changes_list(...) do # ... params_list |> Stream.map(

    &apply(ecto_schema_module, changeset_function_name, [struct, &1])) # ... end Let's zoom-in to see those steps in detail. This is still pretty dense, so let's fill it in with some specifics.
  77. DB changes_list defp changes_list(...) do # ... params_list |> Stream.map(

    &apply(Address, changeset_function_name, [struct, &1])) # ... end We'll change `ecto_schema_module` to `Address`
  78. DB changes_list defp changes_list(...) do # ... params_list |> Stream.map(

    &apply(Address, :changeset, [struct, &1])) # ... end Then fill in `:changeset` for `changeset_function_name`.
  79. DB changes_list defp changes_list(...) do # ... params_list |> Stream.map(

    &Address.changeset( struct, &1)) # ... end Now that the module and function are no longer variables, we can replace the `apply` with a normal call, which makes it more obvious that we're validating each map of params in the list separately and we'll end up with a `Stream` of `Changesets`.
  80. DB changes_list defp changes_list(...) do # ... |> Enum.reduce({:ok, []},

    fn changeset = %Changeset{valid?: false}, {:ok, _} -> {:error, [changeset]} changeset = %Changeset{valid?: false}, {:error, acc_changesets} -> {:error, [changeset | acc_changesets]} %Changeset{changes: changes, valid?: true}, {:ok, acc_changes} -> {:ok, [changes | acc_changes]} %Changeset{valid?: true}, {:error, _} = error -> error end) end We need to convert those `changesets` into either an `:error` or an `:ok` `tuple`. We're going to use the all or nothing approach, so if we encounter an invalid `changeset` we switch to an `:error` `tuple`.
  81. DB changes_list defp changes_list(...) do # ... |> Enum.reduce({:ok, []},

    fn changeset = %Changeset{valid?: false}, {:ok, _} -> {:error, [changeset]} changeset = %Changeset{valid?: false}, {:error, acc_changesets} -> {:error, [changeset | acc_changesets]} %Changeset{changes: changes, valid?: true}, {:ok, acc_changes} -> {:ok, [changes | acc_changes]} %Changeset{valid?: true}, {:error, _} = error -> error end) end From then, on we ignore any valid changesets
  82. DB changes_list defp changes_list(...) do # ... |> Enum.reduce({:ok, []},

    fn changeset = %Changeset{valid?: false}, {:ok, _} -> {:error, [changeset]} changeset = %Changeset{valid?: false}, {:error, acc_changesets} -> {:error, [changeset | acc_changesets]} %Changeset{changes: changes, valid?: true}, {:ok, acc_changes} -> {:ok, [changes | acc_changes]} %Changeset{valid?: true}, {:error, _} = error -> error end) end ... and only accumulate more errors.
  83. DB changes_list defp changes_list(...) do # ... |> Enum.reduce({:ok, []},

    fn changeset = %Changeset{valid?: false}, {:ok, _} -> {:error, [changeset]} changeset = %Changeset{valid?: false}, {:error, acc_changesets} -> {:error, [changeset | acc_changesets]} %Changeset{changes: changes, valid?: true}, {:ok, acc_changes} -> {:ok, [changes | acc_changes]} %Changeset{valid?: true}, {:error, _} = error -> error end) end The only way to exit the reduce with an `:ok` tuple is for all changesets to be valid.
  84. DB changes_list defp changes_list(...) do # ... |> Enum.reduce({:ok, []},

    fn changeset = %Changeset{valid?: false}, {:ok, _} -> {:error, [changeset]} changeset = %Changeset{valid?: false}, {:error, acc_changesets} -> {:error, [changeset | acc_changesets]} %Changeset{changes: changes, valid?: true}, {:ok, acc_changes} -> {:ok, [changes | acc_changes]} %Changeset{valid?: true}, {:error, _} = error -> error end) end We can extract out the `map` that `insert_all` needs from the `changes` field in the `changeset`.
  85. DB insert_all ** (Postgrex.Error) ERROR 23505 (unique_violation) duplicate key value

    violates unique constraint "addresses_pkey" table: addresses constraint: addresses_pkey We had multiple Tasks running this `insert_all` and the remote API could supply the information for this table to any number of those Tasks, so we ran into errors like this where we're violating unique constraints. Because `insert_all` doesn't use `Changeset`s, there's no support built into Ecto to convert the `Postgrex.Error` to a `Changeset` error.
  86. DB Upsert Repo.insert_all(Address, changes, conflict_target: :hash, on_conflict: :replace_all) Thankfully, `insert_all`

    supports a `conflict_target` and `on_conflict` option, so can convert an insert into an upsert.
  87. DB on_conflict Query Repo.insert_all(Address, changes, conflict_target: :hash, on_conflict: on_conflict_query) Using

    `:replace_all` for `:on_conflict` led to a problem where foreign key constraint thought we were deleting a referenced primary key. The executor doesn't check whether the value is the same or not. All it cares is that the column in the row is being set. When this happened, we shifted away from last-write-wins and instead thought of specific business rules about how each column should update.
  88. DB on_conflict Query from(address in Address, update: [ set: [

    ... ] ] ) The general structure of an `on_conflict` `Ecto.Query` is a `from` of the inserted table with a nested `:update` `:set`.
  89. DB on_conflict Query from(address in Address, update: [ set: [

    contract_code: fragment( "COALESCE(?, EXCLUDED.contract_code)", address.contract_code ) ... ] ] ) The individual `:set` arguments are updates to columns. One useful one is using coalesce to pick the non-null value.
  90. DB on_conflict Query from(address in Address, update: [ set: [

    contract_code: fragment( "COALESCE(?, EXCLUDED.contract_code)", address.contract_code ) ... ] ] ) Although `coalesce` is supported in Ecto 3, Ecto 3 does not understand the EXCLUDED metavariable, ...
  91. DB on_conflict Query from(address in Address, update: [ set: [

    contract_code: fragment( "COALESCE(?, EXCLUDED.contract_code)", address.contract_code ) ... ] ] ) ... so we use fragment so we can use the SQL directly.
  92. DB on_conflict Query from(address in Address, update: [ set: [

    contract_code: fragment( "COALESCE(?, EXCLUDED.contract_code)", address.contract_code ) ... ] ] ) We need to use `?` to bind in address.contract_code because Ecto aliases tables to short names like a0 or t0, which we can't guess cleanly.
  93. DB EXCLUDED • Pseudo-table • Columns from new row •

    Called excluded because it was excluded from being added • Includes BEFORE INSERT triggers The EXCLUDED metavariable presents a fake table. It's purpose is to hold the single row that was excluded from being inserted. The columns in EXCLUDED reflect the row immediately before they would replace the current row, so BEFORE INSERT triggers have already fired. If you take all the EXCLUDED column values, it is the same as using `:replace_all`.
  94. DB on_conflict Query from(address in Address, update: [ set: [

    # ARGMAX on two columns fetched_coin_balance: fragment( """ CASE WHEN EXCLUDED.fetched_coin_balance_block_number IS NOT NULL AND (? IS NULL OR EXCLUDED.fetched_coin_balance_block_number >= ?) THEN EXCLUDED.fetched_coin_balance ELSE ? END """, address.fetched_coin_balance_block_number, address.fetched_coin_balance_block_number, address.fetched_coin_balance ), # MAX on two columns fetched_coin_balance_block_number: fragment( "GREATEST(EXCLUDED.fetched_coin_balance_block_number, ?)", address.fetched_coin_balance_block_number ) ] ]) The other pattern that we found useful was to correlate updates. In this case, we're using GREATEST on the `fetched_coin_balance_block_number` to take the most recent block number between what was already in the database and what is being inserted. `GREATEST` is a nice operator that lets you take the max of two value on the same row as opposed to MAX itself in SQL for taking the max of two values on the same column.
  95. DB on_conflict Query from(address in Address, update: [ set: [

    # ARGMAX on two columns fetched_coin_balance: fragment( """ CASE WHEN EXCLUDED.fetched_coin_balance_block_number IS NOT NULL AND (? IS NULL OR EXCLUDED.fetched_coin_balance_block_number >= ?) THEN EXCLUDED.fetched_coin_balance ELSE ? END """, address.fetched_coin_balance_block_number, address.fetched_coin_balance_block_number, address.fetched_coin_balance ), # MAX on two columns fetched_coin_balance_block_number: fragment( "GREATEST(EXCLUDED.fetched_coin_balance_block_number, ?)", address.fetched_coin_balance_block_number ) ] ]) We use `CASE` to emulate an ARGMAX, that is taking the value from the current row or the EXCLUDED row that matches the one taken by GREATEST below. This lets us pick the value for the greatest block number. This way we take the most recent value in realtime and what time that was instead of last-write-wins, which may not match realtime because distributed systems and multiple writing Tasks. This would be similar to ensuring that your checking account shows the most recent balance and when the balance was last updated on an account summary page.
  96. DB Duplicate Row ** (Postgrex.Error) ERROR 21000 (cardinality_violation) ON CONFLICT

    DO UPDATE command cannot affect row a second time hint: Ensure that no rows proposed for insertion within the same command have duplicate constrained values. Great, that fixed everything right? Nope. We now got this error. Even though `on_conflict` deals with a row already existing, Postgres does not let duplicate rows in the same INSERT statement as it assumes that you likely have a bug.
  97. DB De-duplicate •map() •MapSet.t() •Enum.uniq_by So how did we get

    around this? We started to use Maps and MapSets with explicit merge rules to ensure that we wouldn't have duplicate params before calling `changes_list`. I recommend avoiding using `Enum.uniq_by` because it throws away the differences on the non-unique keys, which can hide bugs earlier in the ETL process.
  98. DB Extraction • Merge rules match between code paths •

    Merge rules match between Elixir and SQL In our specific case, we had so many code paths being able to generate address references that the code for extraction was moved to a separate module so the rules could remain consistent. It was important that the Elixir merge rules for any two addresses used the same logic as the on_conflict option passed to insert_all so whether a conflict was detected in Elixir or SQL, it would be resolved the same.
  99. DB Deadlock ** (Postgrex.Error) ERROR 40P01 (deadlock_detected): deadlock detected Process

    440 waits for ShareLock on transaction 1804; blocked by process 433. Process 433 waits for ShareLock on transaction 1823; blocked by process 440. We got rid of unique index violations and duplicate row errors, but now we hit deadlocks. I didn't even think deadlocks were possible in Postgres's MultiVersion Concurrency Control (MVCC).
  100. DB Sorting def insert_addresses(changes_list) do # order so that row

    ShareLocks are grabbed in a consistent order ordered_changes_list = sort_address_changes_list(changes_list) end defp sort_address_changes_list(changes_list) do Enum.sort_by(changes_list, & &1.hash) end https://dockyard.com/blog/2018/06/15/you-still-need- to-care-about-locks-with-transactions To fix the deadlock, we had to make sure that the `insert_all`s in different `Task`s wrote the rows in the same order, which means picking an arbitrary order and sorting all writes to the same table in that order. The order is arbitrary because the unique key of an address is a Hash, which by its nature is randomly distributed. If you'd like more details of how I figured this out, I wrote a blog post about it and it is available on the DockYard blog at the link below.
  101. DB Ecto.Multi Multi.new() |> Multi.delete_all(:delete_addresses, old_addresses_query) |> Multi.insert_all(:insert_addresses, Address, changes)

    |> Multi.update_all(:update_address, address_reference_query) |> Multi.run(:run_dependent, fn repo, changes_so_far -> dependent(repo, changes_so_far) end) |> Repo.transaction() So far, I've only shown a single `insert_all`, which can only insert into a single table. To insert into multiple tables, in one transaction, you can use Ecto.Multi. Ecto.Multi has similar functions to Ecto.Repo. For bulk operations the functions we care about are `delete_all`, `insert_all`, `update_all`, and `run`.
  102. DB Ecto.Multi Multi.new() |> Multi.delete_all(:delete_addresses, old_addresses_query) |> Multi.insert_all(:insert_addresses, Address, changes)

    |> Multi.update_all(:update_address, address_reference_query) |> Multi.run(:run_dependent, fn repo, changes_so_far -> dependent(repo, changes_so_far) end) |> Repo.transaction() All steps of a Multi need a user specified name.
  103. DB Ecto.Multi Multi.new() |> Multi.delete_all(:delete_addresses, old_addresses_query) |> Multi.insert_all(:insert_addresses, Address, changes)

    |> Multi.update_all(:update_address, address_reference_query) |> Multi.run(:run_dependent, fn repo, changes_so_far -> dependent(repo, changes_so_far) end) |> Repo.transaction() Those names are used for keys in the `changes_so_far` map that is used for Multi.run.
  104. DB Ecto.Multi Multi.new() |> Multi.delete_all(:delete_addresses, old_addresses_query) |> Multi.insert_all(:insert_addresses, Address, changes)

    |> Multi.update_all(:update_address, address_reference_query) |> Multi.run(:run_dependent, fn repo, changes_so_far -> dependent(repo, changes_so_far) end) |> Repo.transaction() Multi.run is the most-flexible step for a Multi. It can be used to take the output of any previous steps to do dependent database operations, but Multi.run doesn't even need to affect the database. It can run anything, which can include updating ETS tables or doing remote API interactions that should still be transactional.
  105. DB Ecto.Multi Multi.new() |> Multi.delete_all(:delete_addresses, old_addresses_query) |> Multi.insert_all(:insert_addresses, Address, changes)

    |> Multi.update_all(:update_address, address_reference_query) |> Multi.run(:run_dependent, fn repo, changes_so_far -> dependent(repo, changes_so_far) end) |> Repo.transaction() Multi is like Stream, in that it doesn't do anything until passed to a different API. In Stream's case those are the Enum functions. In Multi's case, it is calling Repo.transaction.
  106. DB Data Dependencies 1. Primary Key 2. Foreign Key references()

    ensures orderability Part of using Ecto.Multi or any operations in the same transaction is ensuring the ordering. Primary keys need to be inserted before foreign keys that point to those primary keys. As long as you're using references in your migrations and true foreign key constraints in the database, it shouldn't be possible to have data dependency loops. The order of your Multi steps should be able to match the order of the migrations that setup the tables as you can't create a table with a foreign key to a non- existent table.
  107. DB Import.Runner defmodule Explorer.Chain.Import.Runner do alias Ecto.Multi @callback ecto_schema_module() ::

    module() @callback run( Multi.t(), changes_list, %{optional(atom()) => term()} ) :: Multi.t() # ... end Having all the table's validation and Multi calls in one module led to over 1500 lines of code in one file. Inside the file there was a lot of repeated patterns, which to me screamed a refactoring opportunity to extract modules and a behaviour.
  108. DB Import.Runner defmodule Explorer.Chain.Import.Runner do alias Ecto.Multi @callback ecto_schema_module() ::

    module() @callback run( Multi.t(), changes_list, %{optional(atom()) => term()} ) :: Multi.t() # ... end The `ecto_schema_module` callback is used by the validation layer, to validate params across all runners before any runner is run. This ensures we don't eat up database resources for transactions that will fail part way though due to validation errors.
  109. DB Import.Runner defmodule Explorer.Chain.Import.Runner do alias Ecto.Multi @callback ecto_schema_module() ::

    module() @callback run( Multi.t(), changes_list, %{optional(atom()) => term()} ) :: Multi.t() # ... end The `run` callback mimics Multi.run, but simplifies the process in that only the `changes_list` for the specific `ecto_schema_module` is passed to the callback instead each runner having to pick out their correct `changes_list`.
  110. DB Import.Runner defmodule Explorer.Chain.Import.Runner do alias Ecto.Multi @callback ecto_schema_module() ::

    module() @callback run( Multi.t(), changes_list, %{optional(atom()) => term()} ) :: Multi.t() # ... end Since the `run` callback both takes in and returns a Multi, it doesn't have to do a single Multi operation, but any number of operations
  111. DB Import.Runner defmodule Explorer.Chain.Import.Runner.Blocks do # ... @impl Import.Runner def

    run(multi, changes_list, options) do multi |> Multi.run(:derive_transaction_forks, ...) |> Multi.run(:fork_transactions, ...) |> Multi.run(:lose_consenus, ...) |> Multi.run(:blocks, ...) |> Multi.run(:uncle_fetched_block_second_degree_relations, ...) |> Multi.run(:internal_transaction_transaction_block_number, ...) end # ... end We found chaining operations around the `insert_all` into the Runner's `ecto_schema_module` useful for two different use cases...
  112. DB Import.Runner defmodule Explorer.Chain.Import.Runner.Blocks do # ... @impl Import.Runner def

    run(multi, changes_list, options) do multi |> Multi.run(:derive_transaction_forks, ...) |> Multi.run(:fork_transactions, ...) |> Multi.run(:lose_consenus, ...) |> Multi.run(:blocks, ...) |> Multi.run(:uncle_fetched_block_second_degree_relations, ...) |> Multi.run(:internal_transaction_transaction_block_number, ...) end # ... end Before updating the runner's table for any anticipated updates in the upsert
  113. DB Import.Runner defmodule Explorer.Chain.Import.Runner.Blocks do # ... @impl Import.Runner def

    run(multi, changes_list, options) do multi |> Multi.run(:derive_transaction_forks, ...) |> Multi.run(:fork_transactions, ...) |> Multi.run(:lose_consenus, ...) |> Multi.run(:blocks, ...) |> Multi.run(:uncle_fetched_block_second_degree_relations, ...) |> Multi.run(:internal_transaction_transaction_block_number, ...) end # ... end ... and after the table update to denormalize any data.
  114. DB Import.Runner defmodule Explorer.Chain.Import do # ... @runners [ Import.Addresses,

    Import.Address.CoinBalances, Import.Blocks, Import.Block.Rewards, Import.Block.SecondDegreeRelations, Import.Transactions, Import.Transaction.Forks, Import.InternalTransactions, Import.Logs, Import.Tokens, Import.TokenTransfers, Import.Address.CurrentTokenBalances, Import.Address.TokenBalances ] # ... end With all our validation and Multi.runs separated by table. We could declare the data dependencies with a simple list for the order. This is 13 tables in one transaction and we ran into issues with transactions timing out even after 2 minutes.
  115. DB Import.Runner defmodule Explorer.Chain.Import do # ... @runners [ Import.Addresses,

    Import.Address.CoinBalances, Import.Blocks, Import.Block.Rewards, Import.Block.SecondDegreeRelations, Import.Transactions, Import.Transaction.Forks, Import.InternalTransactions, Import.Logs, Import.Tokens, Import.TokenTransfers, Import.Address.CurrentTokenBalances, Import.Address.TokenBalances ] # ... end Most of the timeouts were due to either inserting addresses or Postgres implicitly holding ShareLocks to ensure foreign keys remain valid when inserting into tables that reference addresses. You could run into similar problems if you a lot of your tables reference your user table or accounts table, etc.
  116. DB Separate Updates • Is it confusing? • Is it

    recoverable? We needed a way to break up the transaction. Thinking through the problem, we realized that although addresses with no other data aren't very useful, them being in the database and then the application crashing leaves the database still being in a recoverable state.
  117. DB Import.Stage defmodule Explorer.Chain.Import.Stage do @callback runners() :: [Runner.t(), ...]

    @callback multis( runner_to_changes_list, %{optional(atom()) => term()} ) :: {[Multi.t()], runner_to_changes_list} end Since we could separate address upserts, we needed a way to represent separate transactions using the Import.Runner behaviour. I added a layer, called Import.Stage, that is allowed to consume one or more Runners.
  118. DB Import.Stage defmodule Explorer.Chain.Import.Stage do @callback runners() :: [Runner.t(), ...]

    @callback multis( runner_to_changes_list, %{optional(atom()) => term()} ) :: {[Multi.t()], runner_to_changes_list} end Importantly, a Stage, unlike a Runner can spit out a list of Multis, which means we can chunk the problematic, slow updates into separate Multis and therefore separate DB transactions to beat the timeout.
  119. DB Import.Stage defmodule Explorer.Chain.Import do # ... defp runner_to_changes_list_to_multis(runner_to_changes_list, options)

    do {multis, ...} = Enum.flat_map_reduce(@stages, runner_to_changes_list, fn stage, remaining_runner_to_changes_list -> stage.multis(remaining_runner_to_changes_list, options) end) # ... multis end # ... end To use the stages we collect all the Multis.
  120. DB Import.Stage defmodule Explorer.Chain.Import do # ... defp import_transactions(multis, options)

    do Enum.reduce_while(multis, {:ok, %{}}, fn multi, {:ok, acc_changes} -> case Repo.transaction(multi, timeout: options[:timeout]) do {:ok, changes} -> {:cont, {:ok, Map.merge(acc_changes, changes)}} {:error, _, _, _} = error -> {:halt, error} end end) end # ... end ... and run them in their own transaction while we don't hit an error.
  121. DB Autovacuum • Insert writes tuples • Autovacuum defrags old

    tuples • ON CONFLICT always writes Breaking up the transaction got us past the timeout problem, but we were still seeing a lot of activity in Postgres. Autovacuum seemed to be running constantly. Autovacuum runs to gather metadata, such as which tuples are active and the cardinality, or approximate row count of the tables. This metadata is used to determine the query plan, such as which join to do first and whether to use an index or a sequential scan. Autovacuum is critical to keeping Postgres fast, so we couldn't just disable it. What we determined was happening is that although `on_conflict` prevents errors and allows us to do upserts, it is always writing. Even if the address or any other table's row is identical, Postgres does not care. It still writes a new row, which internally Postgres calls a tuple.
  122. DB ON CONFLICT WHERE from(address in Address, update: [ set:

    ... ], where: fragment("COALESCE(?, EXCLUDED.contract_code) IS DISTINCT FROM ?", address.contract_code, address.contract_code ) or fragment( "EXCLUDED.fetched_coin_balance_block_number IS NOT NULL AND " <> "(? IS NULL OR EXCLUDED.fetched_coin_balance_block_number >= ?)", address.fetched_coin_balance_block_number, address.fetched_coin_balance_block_number ) or fragment( "GREATEST(?, EXCLUDED.nonce) IS DISTINCT FROM ?", address.nonce, address.nonce ) ) Thankfully, Postgres and Ecto are once again great: The Ecto.Query we pass to `on_conflict`, can have a `where` clause. Only when the `where` clause is true will the `on_conflict` write a new tuple.
  123. DB ON CONFLICT WHERE from(address in Address, update: [ set:

    ... ], where: fragment("COALESCE(?, EXCLUDED.contract_code) IS DISTINCT FROM ?", address.contract_code, address.contract_code ) or fragment( "EXCLUDED.fetched_coin_balance_block_number IS NOT NULL AND " <> "(? IS NULL OR EXCLUDED.fetched_coin_balance_block_number >= ?)", address.fetched_coin_balance_block_number, address.fetched_coin_balance_block_number ) or fragment( "GREATEST(?, EXCLUDED.nonce) IS DISTINCT FROM ?", address.nonce, address.nonce ) ) Postgres has a useful operator, IS DISTINCT FROM, that can be used to compare the update to the old value.
  124. DB IS DISTINCT FROM Left Right != IS DISTINCT FROM

    NULL NULL NULL FALSE NULL VALUE NULL TRUE VALUE NULL NULL TRUE VALUE VALUE FALSE FALSE VALUE1 VALUE2 TRUE TRUE IS DISTINCT FROM is better than != because it does not propagate NULL, so you can get a boolean value when comparing NULL with itself and NULL with non-NULL values to detect change.
  125. DB IS DISTINCT FROM TUPLE where: fragment( "(EXCLUDED.block_hash, EXCLUDED.block_number, EXCLUDED.created_contract_address_hash,

    EXCLUDED.cumulative_gas_used, EXCLUDED.cumulative_gas_used, EXCLUDED.from_address_hash, EXCLUDED.gas, EXCLUDED.gas_price, EXCLUDED.gas_used, EXCLUDED.index, EXCLUDED.internal_transactions_indexed_at, EXCLUDED.input, EXCLUDED.nonce, EXCLUDED.r, EXCLUDED.s, EXCLUDED.status, EXCLUDED.to_address_hash, EXCLUDED.v, EXCLUDED.value) IS DISTINCT FROM (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)", transaction.block_hash, transaction.block_number, transaction.created_contract_address_hash, transaction.cumulative_gas_used, transaction.cumulative_gas_used, transaction.from_address_hash, transaction.gas, transaction.gas_price, transaction.gas_used, transaction.index, transaction.internal_transactions_indexed_at, transaction.input, transaction.nonce, transaction.r, transaction.s, transaction.status, transaction.to_address_hash, transaction.v, transaction.value ) IS DISTINCT FROM also works on tuples, so instead of chaining together `fragment`s with `or` we can use a single fragment with one IS DISTINCT FROM for all columns. The only downside is it gets very long and is purely positional, so it is very easy to mess up the column name matching between EXCLUDED and the Ecto columns. Using IS DISTINCT FROM with the tuple format has the advantage that you're sending a smaller query to the server though. This was important because we also ran into issues with sending too long of query strings to Postgres. We had to limit our inserts to 500 rows to prevent exceptions due to buffer sizes in Postgres.
  126. DB ETL 1. Unique 2. Validate 3. Key Dependencies 4.

    Transaction Cracking 5. Multi.run 6. Sort Changes 7. on_conflict 8. SELECT table/EXCLUDED 9. WHERE DISTINCT So, the final version of the ETL process: 1. Ensure only unique params are getting to the load step using maps or MapSets with explicit merge rules that match updates in the Extract step 2. Validate the params and extract the valid changes from the changesets 3. Order the inserts across tables so foreign keys are guaranteed to reference existing primary keys 4. Break up the inserts into transactions that will insert in a reasonable time to prevent timeouts, but still allow recovery on reboot. 5. Use Multi.run to group operations across tables into a single transaction 6. For each insert of update, sort the changes to prevent deadlocks 7. When inserting, use on_conflict to upsert 8. Don't use replace_all, but instead use an Ecto.Query for better than last-write-win rules, which can use EXCLUDED to refer to the new row 9. To prevent slowdowns due to autovacuuming, use the WHERE clause on the on_conflict query so you don't write rows unless a change has occurred.
  127. Conclusion Without a Database • Ecto.Changeset • Ecto.Type With a

    Database • Indexes • Partial • GiST • Types • Ecto.Type.dump/load • Constraints • Check • Exclude • Ecto.Multi • insert_all • run • Upsert • on_conflict • where Ecto can be used with or without a database. Without a database, Ecto.Changeset can be used for validation, but for more pervasive type validation and casting, you can use Ecto.Type directly. With a database, Ecto.Migrations allow you access to the full power of Postgres indexes, types, and constraints without having to drop down to executing SQL strings. Ecto.Type can help with a database by allowing your Elixir type to be richer than the Postgres type. Ecto.Multi can help with ETL processes by allowing the code 1 DB transaction to be contributed by multiple Elixir functions and modules in charge of different business rules. When the logic becomes too complex for insert_all, Multi.run is an escape hatch to allow pre- and post- processing around the insert_all. To convert those insert_alls to upserts, on_conflict can be used. :replace_all works, but using an Ecto.Query for specific business rules is better. To improve the throughput of the upserts and lower write load, use a where clause, so that only changed rows are written
  128. Contact Personal Email [email protected] Twitter/ Slack @KronicDeth Meetup https://www.meetup.com/ Austin-Elixir

    Professional Email Luke.Imhoff@DockYard.com Twitter @DockYard Hire https://dockyard.com/contact/ hire-us IntelliJ Elixir IDE https://www.jetbrains.com/idea/download Plugin https://plugins.jetbrains.com/plugin/7522 GitHub https://github.com/KronicDeth/intellij-elixir IntelliJ Community Edition is FREE! Links If you have questions you can contact me by email or on Twitter or Elixir Lang slack @KronicDeth. I also host the Austin Elixir meetup if you're near Austin Texas.
 If you're interested in working with me, you can contact DockYard with the Hire link or email me directly and I'll get you in touch with our sale department. If you're interested in IntelliJ Elixir, you can download IntelliJ Community Edition for Free. Plugin install happens inside the IDE, but you can also view plugins in the JetBrains Plugin repository and the source for the plugin is available on GitHub.