Blog

On-Chain Data Series I: Ingesting Blockchain Data – The Backbone of On-Chain Intelligence

Welcome to our latest blog series on blockchain data,  shedding light on the complexities involved in managing and utilising blockchain data, from the extraction of raw data to its transformation into actionable insights for diverse applications.

  • February 15, 2024
  • Vlad Cealicu

Welcome to our latest blog series on blockchain data analytics and management. This series aims to shed light on the complexities involved in managing and utilising blockchain data, from the extraction of raw data directly from the blockchain to its transformation into actionable insights for diverse applications.

This series is broken into three parts:

  • Part One: Ingesting Blockchain Data – The Backbone of On-Chain Intelligence
  • Part Two: Streamlining Blockchain Data Output and Distribution
  • Part Three: Unlocking DeFi's Potential: CCData's Uniswap V3 Integration Explained

This series is designed for anyone curious about how blockchain data works and its impact on the digital asset space. Whether you're new to blockchain or looking to deepen your knowledge, we invite you to join us as we explore the crucial role of data in unlocking the potential of blockchain technology.

The first part of this series, Ingesting Blockchain Data – The Foundations of On-Chain Intelligence, explores the basics of data ingestion, the first step in understanding blockchain activities. We'll show you how we connect to blockchain networks, process real-time data, and ensure that every piece of information is accurate and reliable.

In the digital asset sector , data ingestion plays a key role in driving the information we obtain from on-chain activities. At CCData, our proficiency lies in capturing this real-time data with precision and reliability, ensuring that each transaction and block is accounted for and processed accurately. In this blog post, we explore the mechanics of our data ingestion process, which is the foundation for the advanced analytics and DeFi integrations that follow.


Establishing Robust Connections

To collect blockchain data, we must first establish secure and reliable connections with blockchain nodes. This task involves more than simply maintaining a continuous link to the node; it also requires ensuring the robustness and resilience of these connections.

Our multi-source node ingestion system is designed to connect to multiple blockchain data sources. For Ethereum, the primary nodes we operate are Nethermind and Geth, which we augment with external RPC providers such as QuickNode. This multi-node multi-source strategy is crucial for two reasons:

  • Redundancy: In the event of a node failure, our system can switch to alternative sources without interrupting the data stream.
  • Verification: Data from multiple sources can be cross-verified to ensure accuracy and consistency.

We establish node connections using a combination of polling and subscription-based methods. Polling allows us to actively query nodes for new blocks at regular intervals, while subscriptions use WebSocket connections to receive new data as soon as it's broadcasted by the node.

Ensuring a Steady Stream: Real-Time Data Ingestion

To handle the real-time nature of blockchain data, we've implemented a streamlined ingestion process that captures blocks as they are propagated on the network.

Our input system is tasked with the initial reception of data. It's built to process high-volume requests efficiently, ensuring minimal latency between block creation and data capture.

We run one input per data source so for Ethereum, we have an input for our Nethermind node, one for our Geth node and one for the QuickNode RPC endpoint. 

Upon receipt, the data undergoes preliminary validation to ensure structural integrity. This includes checks for data completeness and format correctness before it's passed onto the queuing system.

We use Redis for inter-process communication and for state storage due to its exceptional performance characteristics as an in-memory data store, which are ideal for handling the velocity and volume of incoming blockchain data.

We employ Redis lists, utilising LPUSH and BRPOPLPUSH commands for managing our data queues. This allows us to maintain a FIFO (First-In-First-Out) structure, which is essential for preserving the chronological order of the blockchain data.

The integrity of data within the queues is paramount. To ensure this, we implement a combination of Redis transactions and hash sets. Transactions are used to execute a sequence of commands atomically, while hash sets allow us to efficiently manage block metadata and track the last processed block number.

Given the performance-critical nature of the system, we continuously monitor and tune the Redis instance. This includes optimising memory usage and managing data persistence to balance speed with reliability.

The Complexity of Reorganisations

Blockchain reorganisations, or reorgs, are events where the blockchain diverges into two potential paths due to temporary discrepancies in block additions by different miners. Handling reorgs is crucial as they can lead to inconsistencies in data if not managed properly.

const processBlockReorgRecursively = async (redisClient, blockingRedisClient, providerKey, blockNumberToReorg, maxBlockchainReorg) => {
  const { err: processBlockErr, returnData: blockData } = await processAndUpdateBlockData(blockingRedisClient, providerKey, blockNumberToReorg, true);
  if (processBlockErr) {
    statsAdvanced.incr('processreorg:error');
    return processBlockErr;
  }
  statsAdvanced.incr('processreorg:success');
  const parentBlockHashStatusResponse = await blockchainCommonModule.checkParentBlockHashStatus(redisClient, providerKey, blockData.NUMBER, blockData.METADATA.hash, blockData.METADATA.parentHash, maxBlockchainReorg);
  if (parentBlockHashStatusResponse.err) {
    statsAdvanced.incr('checkparenthash:redis:multi:error');
    return parentBlockHashStatusResponse.err;
  }
  statsAdvanced.incr('checkparenthash:redis:multi:success');
  if (parentBlockHashStatusResponse.hasParentHashChanged) {
    return processBlockReorgRecursively(redisClient, blockingRedisClient, providerKey, parentBlockHashStatusResponse.parentBlockNumber, maxBlockchainReorg);
  }
  return null;
};

const getBlockData = async () => {
  // other code
  // ...
  const parentBlockHashStatusResponse = await blockchainCommonModule.checkParentBlockHashStatus(localRedisClient, providerKey, blockData.NUMBER, blockData.METADATA.hash, blockData.METADATA.parentHash, scriptParams.MAX_BLOCKCHAIN_REORG);
  if (parentBlockHashStatusResponse.err) {
    statsAdvanced.incr('checkparenthash:redis:multi:error');
    return { err: parentBlockHashStatusResponse.err };
  }
  statsAdvanced.incr('checkparenthash:redis:multi:success');
  if (!parentBlockHashStatusResponse.hasParentHashChanged) {
    continue;
  }
  const processBlockReorgErr = await processBlockReorgRecursively(localRedisClient, localBlockingRedisClient, providerKey, parentBlockHashStatusResponse.parentBlockNumber, scriptParams.MAX_BLOCKCHAIN_REORG);
  if (processBlockReorgErr) {
    return { err: processBlockReorgErr };
  }
}

Our processBlockReorgRecursively function stands at the heart of our reorg handling strategy. This recursive method ensures that any changes in the blockchain due to reorgs are tracked and managed efficiently. In the case of a reorg, the affected block and all subsequent blocks up to the latest correct block are reprocessed to ensure data integrity.

To manage reorgs effectively, we employ the following steps:

  • Detect a Reorg: By monitoring the parent hash of the latest block against our records.
  • Reprocess Affected Blocks: Trigger the processBlockReorgRecursively function to handle the reorg.
  • Validate Data Consistency: Compare transaction hashes between metadata and block receipts to ensure the integrity of the data.

This approach to handling reorgs is a crucial part of our blockchain data ingestion process, allowing us to have accurate and reliable data down the line.

Looking Forward:

The process of ingesting blockchain data is an intricate yet understated element of the data lifecycle. It’s the foundation upon which all further data processing, analysis, and integration are built. By ensuring that this first step is executed flawlessly, we lay the groundwork for the advanced on-chain analytics and DeFi integrations that empower our clients to make informed decisions.

In this blog post, we've taken a deep dive into the technical intricacies of blockchain data ingestion, emphasising the importance of handling reorgs with precision. As we continue to evolve our processes and technologies, our commitment to data accuracy and integrity remains unwavering, providing our clients with the most reliable and actionable on-chain data available.

On-Chain Data Series I: Ingesting Blockchain Data – The Backbone of On-Chain Intelligence

Welcome to our latest blog series on blockchain data analytics and management. This series aims to shed light on the complexities involved in managing and utilising blockchain data, from the extraction of raw data directly from the blockchain to its transformation into actionable insights for diverse applications.

This series is broken into three parts:

  • Part One: Ingesting Blockchain Data – The Backbone of On-Chain Intelligence
  • Part Two: Streamlining Blockchain Data Output and Distribution
  • Part Three: Unlocking DeFi's Potential: CCData's Uniswap V3 Integration Explained

This series is designed for anyone curious about how blockchain data works and its impact on the digital asset space. Whether you're new to blockchain or looking to deepen your knowledge, we invite you to join us as we explore the crucial role of data in unlocking the potential of blockchain technology.

The first part of this series, Ingesting Blockchain Data – The Foundations of On-Chain Intelligence, explores the basics of data ingestion, the first step in understanding blockchain activities. We'll show you how we connect to blockchain networks, process real-time data, and ensure that every piece of information is accurate and reliable.

In the digital asset sector , data ingestion plays a key role in driving the information we obtain from on-chain activities. At CCData, our proficiency lies in capturing this real-time data with precision and reliability, ensuring that each transaction and block is accounted for and processed accurately. In this blog post, we explore the mechanics of our data ingestion process, which is the foundation for the advanced analytics and DeFi integrations that follow.


Establishing Robust Connections

To collect blockchain data, we must first establish secure and reliable connections with blockchain nodes. This task involves more than simply maintaining a continuous link to the node; it also requires ensuring the robustness and resilience of these connections.

Our multi-source node ingestion system is designed to connect to multiple blockchain data sources. For Ethereum, the primary nodes we operate are Nethermind and Geth, which we augment with external RPC providers such as QuickNode. This multi-node multi-source strategy is crucial for two reasons:

  • Redundancy: In the event of a node failure, our system can switch to alternative sources without interrupting the data stream.
  • Verification: Data from multiple sources can be cross-verified to ensure accuracy and consistency.

We establish node connections using a combination of polling and subscription-based methods. Polling allows us to actively query nodes for new blocks at regular intervals, while subscriptions use WebSocket connections to receive new data as soon as it's broadcasted by the node.

Ensuring a Steady Stream: Real-Time Data Ingestion

To handle the real-time nature of blockchain data, we've implemented a streamlined ingestion process that captures blocks as they are propagated on the network.

Our input system is tasked with the initial reception of data. It's built to process high-volume requests efficiently, ensuring minimal latency between block creation and data capture.

We run one input per data source so for Ethereum, we have an input for our Nethermind node, one for our Geth node and one for the QuickNode RPC endpoint. 

Upon receipt, the data undergoes preliminary validation to ensure structural integrity. This includes checks for data completeness and format correctness before it's passed onto the queuing system.

We use Redis for inter-process communication and for state storage due to its exceptional performance characteristics as an in-memory data store, which are ideal for handling the velocity and volume of incoming blockchain data.

We employ Redis lists, utilising LPUSH and BRPOPLPUSH commands for managing our data queues. This allows us to maintain a FIFO (First-In-First-Out) structure, which is essential for preserving the chronological order of the blockchain data.

The integrity of data within the queues is paramount. To ensure this, we implement a combination of Redis transactions and hash sets. Transactions are used to execute a sequence of commands atomically, while hash sets allow us to efficiently manage block metadata and track the last processed block number.

Given the performance-critical nature of the system, we continuously monitor and tune the Redis instance. This includes optimising memory usage and managing data persistence to balance speed with reliability.

The Complexity of Reorganisations

Blockchain reorganisations, or reorgs, are events where the blockchain diverges into two potential paths due to temporary discrepancies in block additions by different miners. Handling reorgs is crucial as they can lead to inconsistencies in data if not managed properly.

const processBlockReorgRecursively = async (redisClient, blockingRedisClient, providerKey, blockNumberToReorg, maxBlockchainReorg) => {
  const { err: processBlockErr, returnData: blockData } = await processAndUpdateBlockData(blockingRedisClient, providerKey, blockNumberToReorg, true);
  if (processBlockErr) {
    statsAdvanced.incr('processreorg:error');
    return processBlockErr;
  }
  statsAdvanced.incr('processreorg:success');
  const parentBlockHashStatusResponse = await blockchainCommonModule.checkParentBlockHashStatus(redisClient, providerKey, blockData.NUMBER, blockData.METADATA.hash, blockData.METADATA.parentHash, maxBlockchainReorg);
  if (parentBlockHashStatusResponse.err) {
    statsAdvanced.incr('checkparenthash:redis:multi:error');
    return parentBlockHashStatusResponse.err;
  }
  statsAdvanced.incr('checkparenthash:redis:multi:success');
  if (parentBlockHashStatusResponse.hasParentHashChanged) {
    return processBlockReorgRecursively(redisClient, blockingRedisClient, providerKey, parentBlockHashStatusResponse.parentBlockNumber, maxBlockchainReorg);
  }
  return null;
};

const getBlockData = async () => {
  // other code
  // ...
  const parentBlockHashStatusResponse = await blockchainCommonModule.checkParentBlockHashStatus(localRedisClient, providerKey, blockData.NUMBER, blockData.METADATA.hash, blockData.METADATA.parentHash, scriptParams.MAX_BLOCKCHAIN_REORG);
  if (parentBlockHashStatusResponse.err) {
    statsAdvanced.incr('checkparenthash:redis:multi:error');
    return { err: parentBlockHashStatusResponse.err };
  }
  statsAdvanced.incr('checkparenthash:redis:multi:success');
  if (!parentBlockHashStatusResponse.hasParentHashChanged) {
    continue;
  }
  const processBlockReorgErr = await processBlockReorgRecursively(localRedisClient, localBlockingRedisClient, providerKey, parentBlockHashStatusResponse.parentBlockNumber, scriptParams.MAX_BLOCKCHAIN_REORG);
  if (processBlockReorgErr) {
    return { err: processBlockReorgErr };
  }
}

Our processBlockReorgRecursively function stands at the heart of our reorg handling strategy. This recursive method ensures that any changes in the blockchain due to reorgs are tracked and managed efficiently. In the case of a reorg, the affected block and all subsequent blocks up to the latest correct block are reprocessed to ensure data integrity.

To manage reorgs effectively, we employ the following steps:

  • Detect a Reorg: By monitoring the parent hash of the latest block against our records.
  • Reprocess Affected Blocks: Trigger the processBlockReorgRecursively function to handle the reorg.
  • Validate Data Consistency: Compare transaction hashes between metadata and block receipts to ensure the integrity of the data.

This approach to handling reorgs is a crucial part of our blockchain data ingestion process, allowing us to have accurate and reliable data down the line.

Looking Forward:

The process of ingesting blockchain data is an intricate yet understated element of the data lifecycle. It’s the foundation upon which all further data processing, analysis, and integration are built. By ensuring that this first step is executed flawlessly, we lay the groundwork for the advanced on-chain analytics and DeFi integrations that empower our clients to make informed decisions.

In this blog post, we've taken a deep dive into the technical intricacies of blockchain data ingestion, emphasising the importance of handling reorgs with precision. As we continue to evolve our processes and technologies, our commitment to data accuracy and integrity remains unwavering, providing our clients with the most reliable and actionable on-chain data available.

Stay Up To Date

Get our latest research, reports and event news delivered straight to your inbox.