Entering the realm of on-chain data processing exposed us to one of the most common challenges in the blockchain world: managing the immense scale and variance of blockchain data. This post outlines our evolution from relying heavily on blob storage to adopting a highly optimised NFS*-based solution, which drastically reduced blockchain data retrieval times — from 30 days to under 24 hours for sequential processing of the entire Ethereum blockchain.
*NFS (Network File System) is a distributed file system protocol that allows access to files over a network as if they were on a local disk. Unlike blob storage, which is object-based and typically optimised for large amounts of unstructured data, NFS is file-based and provides more granular access to files, making it highly suitable for scenarios requiring low-latency access and strong file system semantics.
As our operations grew and we integrated with multiple blockchains and decentralized exchanges, our data storage strategy needed to adapt. Blob storage initially seemed ideal — scalable, cost-effective, and capable of handling vast amounts of data, including entire blocks, transactions, and metadata. However, as we scaled, we ran into challenges that demanded a more advanced solution.
While blob storage proved to be excellent for long-term archival of blockchain data, it quickly became apparent that it was not designed for the high-speed, frequent access required by our real-time processing and backfilling pipelines. Adding new decentralized exchange (DEX) integrations almost daily exacerbated the situation. Here’s why:
- Latency Issues: Each data retrieval operation from blob storage could take up to 100ms. While this might seem trivial, it resulted in significant delays when dealing with millions of blocks. Processing Ethereum’s 20.5 million blocks would have taken around 30 days — far too long when rapid DEX integration was a priority.
- Inefficient Data Retrieval: The semi-structured nature of blockchain data meant that extracting necessary information often required reading and parsing large, cumbersome chunks of data. This inefficiency severely hampered our ability to process data at the required speed.
- Rising Storage Costs: Although blob storage itself was cost-effective, the inefficiencies in data retrieval forced us to over-provision infrastructure to compensate for the delays. The need to download and unzip large amounts of data in parallel for quick access drove operational costs higher, making the existing setup unsustainable.
To overcome these challenges, we embarked on a mission to make our data more accessible internally without losing the external cost benefits of blob storage. The solution was a hybrid storage model that balanced the need for long-term, cost-effective storage with the demand for rapid, real-time data access.
Our new approach revolved around a two-tier storage solution:
- Long-Term Blob Storage: We continued to use blob storage for long-term archival. All blockchain data was stored in a gzipped format, significantly reducing storage costs. This setup was ideal for archival purposes where speed wasn’t a critical factor.
- Short-Term Unzipped Storage: For data requiring frequent access, particularly in real-time processing, we introduced a short-term storage solution using an NFS server. This allowed us to store unzipped data in a structured format, ensuring fast access and retrieval.
Technical Implementation
The foundation of our new infrastructure was a carefully optimised NFS server. Here’s how we implemented it:
- NFS (Network File System): We deployed a, reasonable specification, NFS server (4 virtual cores, 16gb of ram) with attached storage as the core of our short-term storage strategy. This setup bypassed the latency issues associated with blob storage, reducing read times from 100ms to under 1ms per full block read. Even for Bitcoin, where full blocks saved in JSON format can be quite large (up to 14MB), we achieved read times under 1ms (with slightly longer parsing times).
- Folder Structure and Data Organisation: To optimise file access, we organized blockchain data into a hierarchical folder structure, with blocks divided into folders containing 10,000 blocks each. This structure allowed us to perform file listing and maintenance efficiently, even with large datasets.
To further enhance performance, we implemented several key optimizations on our NFS server:
- File System Choice: We selected ext4 as our file system, optimizing it with specific parameters to handle a large number of files and directories efficiently. This included setting a block size of 4096 and an inode ratio of 8192, which gave us over 2 billion inodes — more than enough to manage Ethereum’s 20 million blocks. We can even index by transaction (one file per transaction) if we adjust the inode ratio to 4096 on our 16TB attached drive.
- Directory Structure: We structured the directories to avoid the typical limitations of file systems when handling large numbers of files. To accommodate this, we ran "tune2fs -O large_dir /dev/{partition}". By splitting blockchain data into directories of 10,000 blocks each, we maintained manageable directory sizes that facilitated fast access and scalability.
- Cost and Performance Balance: This infrastructure setup, including the NFS server on Azure and the 16TB attached drive, costs around $800 per month. This is significantly more cost-effective than constant reading and unzipping blob files or over-provisioning other resources, which would have driven costs into the thousands per month. More importantly, it allowed us to run multiple DEX catchups in parallel off the same disk, maximising efficiency.
The shift to this hybrid storage model delivered transformative results:
- Dramatic Speed Increase: With the NFS-based short-term storage, we drastically reduced block processing times. We now read and parse 15,000 blocks per minute, cutting the time required to process Ethereum’s entire blockchain history to under 24 hours. Which means that we can have a new DEX live and contribution to price discovery in under a week for Ethereum (2 days research, 1–2 days testing and writing the integration code and 1 day to catch up)
- Cost Efficiency: The combination of gzipped blob storage for archival and unzipped NFS storage for real-time access allowed us to maintain a lean operational cost structure, resulting in significant savings without compromising performance.
Our journey from relying heavily on blob storage to adopting a more sophisticated, tiered storage strategy has been a game-changer for our on-chain data processing infrastructure. By making blockchain data readily accessible in an unzipped, structured format on an NFS server, we’ve achieved the performance necessary to handle the vast scale of blockchain data. This approach not only meets our current needs but also lays a scalable foundation for future growth in on-chain data processing.
In the next post of this series, we’ll explore the specific technical improvements that make this infrastructure so effective and how it is poised to support even more sophisticated on-chain data processing tasks.
If you’re interested in learning more about CCData’s market-leading data solutions and indices, please contact us directly.