A Primer on Blockchain Storage Economics

Shreemoy Mishra
RootstockLabs: Research & Technology
10 min readFeb 9, 2021

--

Most web and mobile applications rely on cloud-storage solutions. Storage costs charged to developers and enterprises are determined using tiered-pricing schemes that depend on several factors e.g. amount of data stored, duration, bandwidth, latency, security, replication and distributed-designs etc. There are also contractual obligations and service level guarantees.

Things are quite different in trustless, peer to peer, blockchain networks like Bitcoin, Ethereum and RSK. In this post we explore some basic engineering-economics of storing data in blockchains.

Storing Blocks vs Storing State

When thinking about the costs of data storage in Ethereum-like blockchains, we distinguish between historical transaction data (stored in blocks) from State data. The “state” — at any point in time — refers to a “snapshot” of account balances, smart contract code that power decentralized applications, and data stored on-chain for use by these applications. Usually, state just refers to the current one. There is no common reference clock in blockchains. So, when referring to some previous state, it is typical to specify a block height or number as reference.

Using decentralized applications involves interacting with smart contracts. By interaction, we mean

  • a Transaction sent from a user’s account …
  • usually through their cryptographic wallet software in a browser or mobile app…
  • to a smart contract — identified by the contract’s address on the blockchain…
  • with some data for the contract to process.

Most transactions involve reading and modifying some state data, and these changes must be persistent across interactions. Contracts and any contract-specific data that are no longer needed can be removed from state. There are even financial incentives to do so and reduce state size.

At any point in time, the blockchain’s state is the net outcome of all historical transactions until that point. When a new blockchain node starts for the first time, it re-creates the current state of the chain by downloading blocks from (older) peers and executing (i.e. replaying and verifying) all the transactions in every block starting from block 0 (called the “Genesis” block). This initial sync can takes hours — or even days — depending on bandwidth and processing power. A node that has been offline for some time has to do something similar, but it can start from wherever it left off — i.e. from the snapshot of the state when it was last online.

Whether starting from scratch or simply catching up, the process of re-executing all prior blocks is sequential — the very nature of blockchains like Bitcoin and RSK prevents parallel execution of transactions. Trying to process transactions in a parallel fashion can fail, or much worse — lead to inconsistent state changes. After all, the whole point of blockchains is for non-trusting peers to reach consensus on a single historical order of transactions.

Block and transaction data are typically encoded in a manner specified by the chain’s protocol. These are then stored on disk using key-value databases. The state, however, is stored using tree-like abstract data structures called tries. Ethereum uses variants with 16 branches (hexary tries) while RSK uses a binary trie. See this post for a rich comparison of these approaches. State data is stored in trie leaf nodes. Leaf nodes used to store data that are specific to smart contracts are called Storage Cells. Depending on the implementation (e.g Geth, Turbo-Geth, RSKJ), the tries may be stored in encoded form or expanded form. But ultimately, as with transaction data, they are backed up by key-value databases.

The cumulative size of blockchain history (blocks, transactions, transaction receipts) in Ethereum is over a Terabyte. It simply won’t fit on my laptop’s 512GB SSD disk — I need an external drive. Ethereum’s current state is smaller, less than 100 gigabytes (GB). While the state can fit on my computer’s internal disk, it is far too big to fit on RAM (which is about 8GB). These storage limitations have a large impact on node and network performance. The RSK blockchain is younger and smaller than Ethereum. But even then, the storage requirements are not trivial.

Implicit and Explicit Costs of Storage

In Ethereum and RSK, blockchain storage costs — as charged to users via transaction fees — are almost entirely related to storing or modifying state, not for storing historical transaction data. Of course, those operating blockchain nodes have to cover the costs of storing both state and storing the (much larger) history.

We cannot process transactions without state data. To execute even a simple transfer of coins from Marina’s account to Celia’s we need to know Marina’s current balance. The record of her past transactions does not play any direct role. In Ethereum like settings, someone operating a blockchain client (or node) will try to keep as much of the blockchain’s state as practical in RAM. This is unlike Bitcoin, where new transactions are constructed using previous transactions as input! There are no “accounts” or “balances” in Bitcoin. In fact, some people say there is no state in Bitcoin — only transactions with “leftover money” i.e. Unspent Transaction Outputs (UTXOs). Those familiar with Satoshi’s paper will recall that the only way to avoid Bitcoin’s fundamental security problem — the double spending attack — is to be aware of all prior transactions. This holds for all blockchains. For maximum security, anyone running a full node must maintain the entire set of historical transactions.

As the number of users, accounts, and contracts grows, it becomes impossible to store all of the state data on RAM. Imagine that the size of the state is currently at 50GB. And suppose I dedicate 5GB of my computer’s RAM to storing state — enough to store only 10% of it. Then, 90% of the time, the state data needed to execute some transaction will not be in RAM. We need to go look for it and retrieve it from disk. This associated IO latency slows things down, and it gets worse as state size continues to increase. IO latency can also be a target for denial of service (DoS) attacks by malicious actors. The fact that reads are blocking and transactions processing cannot be parallelized (not at present anyway) also make things worse.

IO latency is an implicit cost of storing blockchain data — developers can use tools to keep track of such performance costs. Some Ethereum developers try and restrict their smart contracts from accessing too much state data. They are willing to trade off higher fees (for additional computations) for lower IO latency. Most developers, however, are much more familiar with the explicit costs of storing data. As mentioned earlier, these costs are primarily driven by state data — not the storage used to store previous transactions.

Smart contracts are executed in the environment of the Ethereum Virtual Machine (EVM). Individual instructions — that form the basis of the code — are called EVM opcodes. Each EVM opcode has an associated computational cost- measured in a unit called gas . The total cost of all the operations required by a transaction determines that transaction’s fees. Miners (or block producers) collect fees as compensation. When broadcasting their (cryptographically-signed) transactions to the network, users (i.e. transaction senders) must specify how much they are willing to pay in fees. This is called a transaction’s gaslimit. After a transaction is executed, any leftover gas is refunded to the sender.

Saving or modifying data in a smart contract’s storage nodes (called Storage Cells) utilizes an EVM opcode called SSTORE. Each storage cell can hold 32 bytes of data. The cost of an SSTORE operation depends on how it is used: to create a new cell, to reset a value, or to clear data.

When used to create a new storage cell the SSTORE opcode costs 20,000 gas. Using the same SSTORE opcode to delete a cell (changing the value from non-zero to zero) results in a refund of 15,000 gas. Updating some previously stored (non-zero) value to a non-zero value in a storage cell costs 5000 gas.

These are some costs of writing data to blockchain state — what about reading? The SLOAD opcode is used to read data from a contract’s storage cell, and it costs 200 gas per read. Reading an account’s balance (using BALANCE) costs 400 gas. These values are for the RSK blockchain.

Of course, these details of various opcodes and gas costs remain hidden from regular users — they only need to be aware that transaction fees increase with complexity — richer interactions cost more.

The gas costs were originally calibrated by Ethereum developers to the execution time (e.g. in nanoseconds) of different operations. These are engineering costs or computational resource costs. As blockchain software and hardware evolve over time, these engineering costs drift away from the initial benchmarks. Therefore, these references ought to be re-calibrated from time to time. One example is a recently adopted proposal in Ethereum, EIP-1884, which altered some of these engineering costs. For instance, in Ethereum, an SLOAD now costs 800 gas (up from 200), while checking the balance now costs 700 gas (up from 400).

The path from engineering costs to actual economic and business costs goes through “fee-market” economics. The first component is gasprice, which links each unit of gas to the native cryptocurrency like Ether or BTC (in RSK). The second factor is more obvious — the exchange rate between the native coin and a fiat currency like the dollar.

The fact that the exchange rate, such as the US dollar to Ethereum or Bitcoin, changes over time comes as no surprise. However, people are sometimes surprised by the volatility in gas price.

The reason gas prices change with time is because there is limited space in blocks for transactions. Limits are needed to ensure blocks are created at regular intervals without imposing too much computational burden on nodes. Instead of directly limiting the number of transactions, the block limit is expressed in terms of the total transaction fees (in gas) that can be collected in a block. Currently, this is around 12.5 million gas in Ethereum.

The key thing though is that users (i.e. transaction senders) set their own offer for gas price — and block producers (miners) prioritize transactions that offer them more money for the same computational cost.

This “market competition” between users to have their transactions included in a block (before others’) makes gas price determination somewhat like an auction for limited blockspace. There is a (highly debated) proposal to alter the way gas prices evolve in Ethereum. But that is another topic.

Gaming the Market: Gas Arbitrage

Introducing economic incentives into any system with limited resources can lead to undesirable side effects and externalities. One example is the interaction between “fixed” engineering costs (in gas) and the “variable” economic ones: gas price and exchange rates.

Two EVM opcodes — SSTORE and SELF_DESTRUCT (delete a previously deployed contract) — offer gas refunds for removing unwanted information from blockchain state. These refunds serve as gas subsidies and are intended to encourage developers to reduce state size. The refunds are not actually paid out in coin! Rather, they are more like coupons and can only be used to partially offset transaction fees, and are applied at the very end of a transaction’s execution.

However, some users take advantage of this incentive scheme to implement a sort of “gas bank” (e.g. see gastoken). They strategically store data on chain when gasprice is cheap, and then claim refunds for deleting that data (stored earlier) when gasprice is very high. This behavior is called gas arbitrage and it is frowned upon because it contributes to storage bloat. In Ethereum, rampant use of such patterns have lead to proposals to remove refunds entirely.

Accounting for Storage Costs

Anyone running a full blockchain node has to cover storage costs — including bandwidth, storage drives, and IO access. Transaction senders pay for storage costs as part of transaction fees. At the other end, blockchain miners collect the transaction fees associated with a block. In RSK transaction fees are their only source of income. In Ethereum, miners also earn a block subsidy. Lately, even in Ethereum, transaction fees have become a significant part of revenue.

What about non-mining full nodes? They do not receive any transaction fees.

Professional node operators — such as crypto exchanges, merchants, oracles and other service providers — operate nodes with massive amounts of RAM and disk storage. Some of them even provide archiving services by storing multiple snapshots of blockchain state at various points in time. This allows them to serve complex queries about an account or contract’s historical balance or transactions. Being service providers, they have revenue streams to cover their large operating costs.

Individuals running full nodes do not receive any compensation — they do so perhaps driven by altruism or security — a wallet connected to a full node offers maximum security. Reducing storage costs offers the highest benefit to individual peers and can encourage more users to run full nodes.

Current State of Affairs in Ethereum

Early proponents of the DeFi movement had imagined that these innovations would enable ridiculously cheap value transfers and financial services for “the masses”. However, at present, even simple transfers of the native coin (not tokens) cost about $8 in Ethereum. This is because, while the engineering costs of executing transactions have not changed much, the economics has changed dramatically.

With ETH at close to $1800 and gasPrice around 200 gwei (a billionth of an ETH) — eachSSTORE operation to store additional data costs about $8. That’s $8 to store 32 bytes! Modifying an existing value (using the same opcode) is 4 times cheaper. $2 to modify 32 bytes of data is hardly something to cheer about. An ERC20 token transfer involves a balance check (400 gas), two balance updates (5000 gas each), the basic transaction fee (21000 gas), and additional fees for interacting with the token’s smart contract. In terms of gas, these costs add up to about 37,000 gas —around $13 at present— for a simple token transfer!

Current State in RSK

At current gasPrices, simple send transactions on RSK cost about $0.05. Not exactly cheap - Bitcoin at over $40,000 will do that - after all, RSK’s native currency is pegged 1:1 with Bitcoin. Nevertheless, with a ratio of 8 to 0.05, payments are 160 times cheaper on RSK than Ethereum.

Staying with the storage costs theme, an SSTORE opcode on RSK also costs about $0.05. And this 1:160 cost advantage over Ethereum carries over generally to all costs of interacting with decentralized applications.

The RSK community has several proposals and research projects to improve the economics of blockchain storage. For example, RSKIP-215, is a proposal to adopt consensus-determined state checkpointing. Such synchronized, state checkpoints can allow nodes to prune old blocks and transactions from storage. Another proposal — state access rent — explores implementing state access fees along a time-dimension to provide incentives for better use of storage resources.

In the Ethereum community there are proposals called scaling solutions — which can offload hundreds or thousands of transactions to a secondary layer. They can provide significant reductions in storage used and also reduce transaction fees. These approaches are also being actively researched in the RSK community. In addition to these initiatives, the community is also working on integrating novel solutions for incentivized, torrent-like, decentralized storage of arbitrary content using technologies like IPFS and Swarm — e.g. see RIF Storage solutions.

--

--