On-Chain Series V: Leveraging Azure L-Series Machines for High-Performance Blockchain Data Processing

Our NFS-based solution was a major leap forward in optimizing block data retrieval, but Ethereum’s vast address data introduced new challenges. Initially, Redis was our go-to solution for handling address metadata. Redis is ideal for this purpose because it allows us to quickly increment counters and set specific properties on each address. However, as our operations scaled, the costs associated with running large Redis instances became unsustainable. Managing 300 million addresses required approximately 750GB of RAM —an expense well beyond our project budget.

We initially tried to optimize costs by archiving data in batches, moving less frequently accessed data from Redis to blob storage. However, this approach significantly impacted performance; our block processing speed plummeted from 4,000 blocks per minute to just 400 — a tenfold decrease in speed and a corresponding tenfold increase in the time required to process all blocks. It became clear that while Redis was unmatched in speed, it was not practical at this scale due to the prohibitive cost of RAM. This is where Azure’s L-Series machines, equipped with NVMe storage, provided a compelling alternative. Although NVMe storage is slower than Redis, it offers a much more cost-effective solution for batch processing large datasets, especially during the backfill phase.

Ethereum’s ecosystem, with its millions of interactions and addresses, demands an infrastructure capable of high-speed processing without incurring prohibitive costs. The smallest Azure L8 machine, with 1.8TB of NVMe storage at $400 per month, provided sufficient capacity to handle our data processing needs. During the initial backfill, which involved processing and storing metadata for all 300 million Ethereum addresses, we used approximately 450GB of the available storage. This setup allowed us to process the entire Ethereum blockchain — including transactions, logs, and traces — across all addresses in under 48 hours.

To fully leverage the capabilities of the L-Series machines, we optimized our processing pipeline to manage Ethereum’s extensive data efficiently:

Data Ingestion: By utilizing the NVMe storage on the L8 machine, we designed our pipeline to sequentially process up to 10,000 blocks per minute. The read times from the NFS server for 10,000 blocks ranged between 10 seconds for the smaller, early Ethereum blocks, to 30 seconds for the most recent ones. Processing 10,000 blocks required an additional 4–5 seconds to parse all transactions, logs, and traces, and to count interactions. The final step — saving or merging the address metadata on the NVMe disk — took another 20–30 seconds per 10,000 blocks, which corresponded to processing approximately 1 million addresses per batch, so that is 1 million reads, merges and writes in under 30 seconds. With this setup, we were able to process 10,000 blocks per minute, enabling us to accurately capture critical metrics such as the first transaction, and perform comprehensive analyses based on transactions, logs, and traces across all interacting Ethereum addresses.
Batch Processing for Address Metadata: During the initial backfill, we utilized the L8 machine to process large batches of address metadata efficiently. This capability allowed us to complete the backfill process in just two days, ensuring that our entire dataset was up-to-date and ready for further analysis.
Data Migration to Cloudflare R2: After completing the backfill, we migrated all processed data to Cloudflare R2 blobs for long-term storage. This migration freed up resources on the L8 machine, allowing us to maintain a cost-effective infrastructure while ensuring that the data remained easily accessible and cost-efficient for future use.

Reacting Quickly to Bugs and Expanding Capabilities

The ability to achieve a 48-hour turnaround on comprehensive stats across all Ethereum addresses has been a game-changer. This rapid processing capability enables us to quickly identify and react to any bugs or missed behaviours in our systems. If an issue arises, we can promptly reprocess the entire dataset, ensuring that our analyses remain accurate and up-to-date. This quick turnaround also facilitates the rapid expansion of our services. As we integrate new blockchains, we can confidently scale our infrastructure to handle the additional data, knowing that we can maintain the same level of performance and accuracy.

Transition to Real-Time Processing

Following the backfill and the migration of data to Cloudflare R2, we scaled down the L8 machine to a more cost-effective F series configuration with 2 CPUs and 4GB of RAM, reducing our ongoing monthly costs to just $70. In this phase, we rely on Redis for state management, storing the most active addresses up to 70% of the VM’s RAM. Less active data is periodically offloaded to blob storage, ensuring that our system remains efficient and cost-effective while still capable of real-time processing. By selectively managing data in Redis and offloading older data to blob storage, we strike a balance between performance and cost, ensuring that our infrastructure can continue to scale with our growing needs. Plus when we are up to date with the blockchain we only need to process an average of 5 blocks a minute for Ethereum and that makes the Redis and Blob combination that can handel up to 400 blocks per minute an ideal and cost effective solution.

Technical Configuration

To maximize performance and cost efficiency, we carefully optimized the technical setup on our L-Series machines:

Partition Structure: The NVMe storage was configured with ext4, utilizing specific tuning parameters (-i 1700 and -O large_dir) to provide over 1 billion inodes on the 1.8TB drive. This configuration is more than capable of managing Ethereum’s 300 million addresses and supports efficient file operations at scale. We experimented with the XFS file system but found no significant performance improvements over ext4, so we standardized on ext4 across our infrastructure.
File Organization: We organized the folders for address metadata based on the first four characters of each address. This structure uses 43,680 inodes for the folders, as the permutations of 16 (hexadecimal) taken four at a time yield 43,680 permutations. This organization ensures that file operations remain efficient even at scale, maximizing the performance benefits of NVMe storage and enabling rapid processing and retrieval of data.

Leveraging Azure’s L-Series machines has dramatically improved our blockchain data processing capabilities. The combination of the NFS server for quick access to full block data, NVMe storage, and a well-structured file system allowed us to process Ethereum’s entire blockchain — including detailed stats based on transactions, logs, and traces across all addresses — in under 48 hours. This rapid turnaround not only allows us to quickly address any issues but also provides the flexibility to expand our capabilities as needed. By migrating processed data to Cloudflare R2 and scaling down the L8 machine, we’ve also managed to maintain a cost-effective infrastructure that supports real-time processing.

In our next post, we’ll discuss the Address Metadata Endpoint, diving into its capabilities and exploring the various use cases it unlocks for developers, analysts, and other blockchain enthusiasts. This endpoint, built on the foundation of our optimized infrastructure, is poised to offer unparalleled insights into blockchain addresses and their interactions.

*NFS (Network File System) is a distributed file system protocol that allows a computer to access files over a network as if they were located on its local storage. NFS was originally developed by Sun Microsystems in the 1980s and has since become a widely used standard for sharing files across UNIX and Linux systems