Mainnet Phase-2 Engineering Update

An Update on Progress

As we approach Phase-2 launch, we want to provide our community with greater visibility into the technical challenges our engineering teams are actively addressing and the progress being made. This update offers a detailed look at both the protocol development work and the infrastructure scaling efforts underway, including the issues we’ve encountered during recent stress testing and how we’re systematically resolving them. Our commitment to transparency means sharing not just our successes, but also the technical hurdles we’re working through as we prepare for mainnet launch.

TLDR

Protocol Progress: The team is in the final stretch toward Phase 2 launch, with significant progress on critical requirements. The “Crossing the Narrow Sea” contest successfully stress-tested our XDM (cross-domain messaging) system, revealing and helping us fix a serious double-minting bug along with other performance issues. We’ve completed essential benchmarking work and implemented numerous stability improvements. The audit process is advancing well, with XDM review nearly complete and only two remaining categories: sudo on domains and domain snap sync functionality.

Astral Challenges & Solutions: Our block explorer is experiencing growing pains as network activity increases, with 32 million extrinsics and 145 million events now on Taurus. We’ve identified and are systematically addressing performance bottlenecks, achieving 5x indexing speed improvements through database optimizations, Redis queue architecture, and eliminating unnecessary API calls. Additional work includes resolving SubQuery incompatibilities and fixing staking indexer accuracy issues.

Bottom Line: Both teams are making progress toward Phase 2 readiness while proactively addressing scalability challenges that come with increased network usage.

Protocol

The protocol team has been focused on finishing phase two requirements. These currently fall into two areas:

  • benchmarking extrinsic weights
  • addressing XDM (cross-domain messaging) issues that were surfaced during the recent “Crossing the Narrow Sea” contest

The XDM testing was very successful and surfaced many issues that became most apparent under heavy load. These varied in criticality, from highlighting documentation gaps to discovering a potentially very serious double-minting bug.

Many in our community are closely following our P1 audit issues on GitHub and have noted that at times this list has been somewhat dynamic. I want to explain how our audit is being conducted and how we are approaching launch readiness. Additionally, I want our community to understand that our team is extremely focused on fully launching the protocol. We are all stakeholders and want to see a successful phase two launch as soon as possible.

In order to minimize the time between code completion and launching with an audited codebase, SR Labs has been conducting an ongoing audit as we merge code that requires auditing. When a pull request is merged, if it requires auditing, it is tagged with an audit P1-P3 tag. This helps SR Labs prioritize what code they should be auditing next. These tags may change over time as certain features or bug fixes become higher priority, or as features are de-prioritized such as the permissioned EVM or potentially with domains snap sync. When conducting testing, such as we did with XDM recently, any serious issue found will need to be addressed and audited prior to network launch. This is why several new items have been added to the queue recently.

Recent pull requests

The recent activity in our GitHub repository reflects the focused efforts described above, with pull requests clustered around key phase two priorities.

XDM Cross-Domain Messaging Improvements

Following the “Crossing the Narrow Sea” contest, we’ve implemented several critical XDM fixes. The before mentioned double-minting bug was addressed in a series of PRs. PR #3514 ensures proper fund minting sequencing, PR #3522 captures fallible routes to properly mark failed transfers, and PR #3534 ensures no partial state during XDM processing. We’ve also optimized XDM processing, addressing the WASM runtime memory allocation issue many operators were experiencing in PR #3544 and added minimum transfer amounts for cross-chain transactions via PR #3545.

Benchmarking and Weight Calibration

The benchmarking work marked as a phase two requirement is progressing with PR #3543 fixing broken pallet-domains benchmarks, PR #3542 adding benchmarks for pallet domains extension, and PR #3541 implementing XDM extension benchmarks. These are essential for accurate extrinsic weight calculations before phase 2 launch. Now that the benchmarks are written we will run the benchmarks on a reference machine and update extrinsic weights accordingly.

Network Stability and DSN Improvements

Several fixes have been made to improve DSN performance. PR #3525 fixes multi-piece object retrieval bugs, PR #3523 improves segment downloading performance during DSN sync, and PR #3519 corrects an off-by-one error in snap sync reconstruction.

Current Audit Status

The auditors have been heavily focused on XDM and they expect to have completed review of recent PRs imminently. This leaves two categories of audit that we have marked as P1, sudo on domains and domains snap sync.

Sudo on Domain

On Substrate-based chains, Sudo is used to execute critical operations related to consensus, such as runtime upgrades and balance changes. This is currently how Sudo works on our consensus chain.

This changes when it comes to domains since they derive security from consensus. We cannot use Sudo directly on domains due to security concerns. Any critical executions, such as runtime upgrades, should originate from consensus. Therefore, we have developed a custom Sudo pallet for domains. With this pallet, any Sudo calls on domains are sent from consensus and executed on domains. Since this is a critical piece of code for domains, we want to ensure the audit provides clear approval before deploying this as part of phase 2.

Domain Snap Sync

Currently, operators must sync their nodes from consensus chain genesis to operate a domain, which creates significant friction, as this sync can take many hours to days. We have implemented snap sync for domains, similar to our existing consensus chain snap sync functionality. This allows operators to start fresh nodes and sync to the tip of the domain chain in hours rather than potentially multiple days. While testing this feature we have found several issues that have needed to be addressed, which has added to the audit queue. While not mandatory for phase 2 launch, including this feature would significantly benefit domain operator user experience. However, if the audit will take too long, we will launch phase 2 without this feature.

Issues Under Investigation

We are investigating reports of unreliable piece downloads on the network. This has been seen during plotting/re-plotting, syncing and object retrieval from the Auto Drive file gateway on the Taurus network. We are looking into whether this is a recent regression or an existing issue that is arising under current usage patterns.

Astral

Astral is currently facing several critical issues that have been exacerbated by ongoing XDM testing. As the protocol team fixed bottlenecks, the increased load surfaced new issues within Astral. Additionally, recent fixes to the staking indexer have yet to make it to production due to the bottleneck of slow re-indexing. We are systematically addressing these issues and re-architecting our indexing strategy. This will include removing some features that appear to have little usage but put excessive strain on indexing and/or queries. Specific issues we have been facing and solutions to these issues are detailed below.

Astral Issues

Slow Indexing

During periods of high activity, indexing has been extremely slow, at times barely able to keep up with block production. This creates a frustrating cycle as fixes need to be applied (see Staking Indexer Incorrect Values below) but the indexer can take many days(or weeks) to catch up. Over the last couple of weeks we have looked at the indexing process from the ground up and found many ways to improve indexing speed significantly (5x on average). Improvements include:

  • Eliminating unnecessary API calls during indexing, including account history fetching and space pledge calculations that were major bottlenecks. PR #1594 in Astral addresses this issue.
  • Implementing Redis queue architecture to move account history processing to asynchronous workers, reducing main indexer load
  • Optimizing database schema by removing string-based sorting and converting to appropriate numeric types.
  • Reducing database overhead by eliminating unnecessary indexes that were more than 20% overhead each

Autonomys “finalization” Incompatible with SubQuery

Autonomys’ usage of finalization within the polkadot-sdk is different than most chains, as we are a longest chain protocol with probabilistic finality. Our usage of the finalization flag is based on archived segments and can be many tens of thousands blocks from the chain head. SubQuery stores every intermediate header in a single entry of the _metadata table, so when the distance between last finalized block and chain head becomes excessively long, the bloated metadata blob inflates memory consumption and lengthens cold-start times because the indexer must deserialize and verify the entire set before resuming work. The fix for this is to add a custom “finalization” threshold rather than using the block finalization flag which required forking SubQuery to handle the custom logic. PR #1 in our forked SubQuery and PR #1594 in Astral are resolving this issue.

Domain Segment Events Incompatible with SubQuery

A second area where we have incompatibility with SubQuery is in how system events are handled on our domains. To address a significant performance bottleneck in our domain execution, we introduced the concept of event segments. While this resolved our performance issue, it also created incompatibility with some polkadot-sdk ecosystem tooling such as the polkadot-js explorer and SubQuery.

We recently added support in our SubQuery fork to properly handle EventSegments, which now allows us to correctly index XDM transactions on the Taurus Auto EVM domain. This capability had been blocking our ability to calculate a proper XDM tally for Crossing the Narrow Sea. With this issue resolved, we can now implement Auto EVM indexing of XDM transactions.

Slow Queries as Extrinsics/Event Counts Grow

As data on the networks has grown significantly, with 32 million extrinsics and over 145 million events on Taurus, maintaining filters, sorting, search functionality, and aggregate counts for pagination has become increasingly difficult and slow. While we are working toward more permanent solutions, several temporary measures are being applied to stabilize performance. We’ve implemented a 500k record limit and split queries to improve response times and prevent timeouts. Certain features, such as full page download, are still slow but will succeed if given time. Additionally, we’ve added indexes for aggregate count calculations to speed up pagination and split the ExtrinsicsByAccountId query to improve performance for account-specific data retrieval.

Staking Indexer Showing Incorrect Values

The staking indexer has been displaying incorrect values due to improper handling of storage fees during deposit processing. The issue occurred when processing new deposits through the OperatorNominated event flow, where we were incorrectly applying the same storage fee deduction logic used for OperatorRegistered events. This resulted in double-deducting storage fees - removing fees from amounts that already had fees removed - leading to lower estimated_shares values and inaccurate staking data across the platform. The core issue has been resolved in PR #1592, though deployment is pending the completion of re-indexing to ensure data accuracy.

Summary

The protocol team continues focused development toward Phase 2 launch, addressing critical XDM issues discovered during the “Crossing the Narrow Sea” contest and completing benchmarking requirements. Recent fixes include resolving a serious double-minting bug and implementing performance optimizations. The audit process is progressing with XDM review nearing completion, leaving sudo on domains and domain snap sync as the remaining audit categories. Meanwhile, Astral faces performance challenges exacerbated by increased network activity, with systematic improvements underway including 5x indexing speed improvements and database optimizations to handle the growing data volume of 32 million extrinsics and 145 million events on Taurus.

4 Likes

进展更新

随着我们逐步接近 Phase-2 的上线,我们希望为社区提供更多关于工程团队正在解决的技术挑战以及最新进展的透明信息。本次更新将详细介绍协议开发与基础设施扩展的最新动态,包括在最近压力测试中发现的问题,以及我们是如何系统性地解决它们的。我们承诺保持透明,这意味着我们不仅会分享成果,也会坦诚分享主网上线前正在解决的技术难题。

TLDR(要点总结)

协议进展:团队正在主网第二阶段的冲刺阶段,解决了关键需求项。通过“穿越狭海”活动,我们成功对 XDM(跨域消息系统)进行了压力测试,发现并修复了一个严重的双重铸币 bug 和其他性能问题。关键基准测试工作已完成,并实施了多项稳定性改进。审计流程进展良好,XDM 审核接近完成,仅剩两个优先级 P1 的模块:域的 sudo 权限机制与 snap sync 功能。

Astral 挑战与解决方案:随着网络活跃度提升,Taurus 链目前已有 3200 万条 extrinsics 和 1.45 亿条事件,Astral 区块浏览器面临显著性能瓶颈。我们通过数据库优化、Redis 队列架构调整、移除冗余 API 请求等手段,实现了 5 倍索引速度提升。其他进展包括解决 SubQuery 兼容性问题与质押索引器准确性 bug。

结论:协议与基础设施团队正在稳步推进 Phase-2 准备工作,同时主动应对网络规模扩大所带来的可扩展性挑战。

协议部分

协议团队目前聚焦于完成 Phase-2 所需的两项任务:

  1. extrinsic 权重的基准测试
  2. 修复“穿越狭海”活动中发现的 XDM(跨域消息传递)问题

这次 XDM 压力测试效果显著,在高负载下暴露出多个问题,涵盖文档不完善到潜在严重的“双重铸币 bug”。

许多社区成员关注 GitHub 上的 P1 审计问题列表,也注意到列表内容有时会动态变动。我们想说明审计的实际操作方式,以及我们是如何确保上线前准备就绪的。同时,也希望社区知道我们对成功推进 Phase-2 上线全力以赴。

为了最小化代码完成与审计之间的时间差,我们与 SR Labs 正在进行“边提交边审计”的持续流程。当一个 PR 被合并时,如需审计,将标记为 P1–P3 的标签,供 SR Labs 安排审计优先级。由于功能或 bug 的优先级调整,标签有时会发生变动。例如 permissioned EVM 或域的 snap sync 就被重新排序过。如果测试过程中出现严重问题(如最近 XDM 测试所揭示的问题),则必须在上线前完成修复与审计,这也是最近队列中新增审计项的原因。

最近的 Pull Requests

我们 GitHub 仓库中的近期活动反映了上述所述的聚焦努力,Pull Requests 聚焦于 Phase 2 的关键优先事项。

XDM 跨域消息传递的改进

在“Crossing the Narrow Sea”竞赛之后,我们对 XDM 实施了几项关键修复。前文提到的“双重铸币”bug已通过一系列 PR 得到修复:

  • PR #3514:确保资金铸造的顺序正确
  • PR #3522:捕捉可能失败的路径,正确标记失败的转账
  • PR #3534:确保在 XDM 处理过程中不会出现部分状态

我们还对 XDM 的处理进行了优化:

  • PR #3544:解决了许多运营者遇到的 WASM 运行时内存分配问题
  • PR #3545:为跨链交易引入了最小转账金额

基准测试与权重校准

作为 Phase 2 的要求之一,基准测试工作正在推进:

  • PR #3543:修复了 pallet-domains 的基准测试错误
  • PR #3542:为 pallet domains 扩展添加了基准测试
  • PR #3541:实现了 XDM 扩展的基准测试

这些工作对于在 Phase 2 启动前准确计算 extrinsic 权重至关重要。现在基准测试已经编写完成,我们将会在参考机器上运行测试,并相应更新 extrinsic 权重。

网络稳定性与 DSN 改进

我们已对 DSN 性能进行了一些修复:

  • PR #3525:修复了多片对象检索中的 bug
  • PR #3523:提升了 DSN 同步过程中片段下载的性能
  • PR #3519:修正了 snap sync 重建过程中的“偏移一位”错误(off-by-one error)

当前审计状态

审计团队目前将重点放在 XDM 上,并预计将很快完成对近期 PR 的审查。接下来还剩两个被标记为 P1 的审计类别:域上的 Sudo(sudo on domains)域的快照同步(domain snap sync)

域上的 Sudo(Sudo on Domain)

在基于 Substrate 的链上,Sudo 用于执行与共识相关的关键操作,例如运行时升级和余额变更。目前我们的共识链就是以这种方式使用 Sudo。

但当涉及到域(domain)时情况发生了变化,因为域的安全性是从共识层继承的。出于安全考虑,我们不能在域上直接使用 Sudo。任何关键操作,例如运行时升级,都应来源于共识链。因此,我们为域开发了一个定制的 Sudo pallet。借助这个 pallet,所有在域上执行的 Sudo 调用将从共识链发起,并在域上执行。由于这是域系统中一段关键代码,我们希望审计团队能在第 2 阶段部署之前,对其明确审计并给予批准。

域的快照同步(Domain Snap Sync)

目前,运营者必须从共识链的创世块开始同步节点,以运行一个域节点,这会造成很大的操作阻力,因为这种同步可能耗时数小时甚至数天。我们已经为域实现了类似于现有共识链的快照同步(snap sync)功能。这使得运营者可以从头启动新节点,并在数小时内同步到域链的最新区块,而不必花费多日时间。

在测试该功能时,我们发现了几个需要修复的问题,这些问题也被加入了审计队列。虽然这项功能不是 Phase 2 启动的强制要求,但如果能包含在内,将极大改善域节点运营者的使用体验。然而,如果审计所需时间过长,我们将在没有该功能的情况下推进 Phase 2 启动。

##正在调查中的问题

我们正在调查关于网络中“分片下载不稳定”的相关报告。该问题在以下场景中被观察到:绘图/重绘图、同步以及从 Taurus 网络上的 Auto Drive 文件网关获取对象时。我们目前正在确认这是近期代码回退引发的问题,还是由于当前使用模式变化导致的既有问题显现。

###Astral
Astral 当前正面临一些关键问题,而这些问题在近期持续进行的 XDM 测试中被进一步放大。随着协议团队修复性能瓶颈,系统负载的提升又暴露出了 Astral 中的新问题。此外,最近对 Staking Indexer 的修复尚未上线生产环境,原因是重新索引过程过慢形成了瓶颈。我们正在系统性地解决这些问题,并重新架构索引策略,其中包括移除一些使用率不高但对索引或查询造成较大压力的功能。我们面临的具体问题及相应的解决方案将在下文中详细说明。

####Astral 问题
索引速度缓慢(Slow Indexing)
在网络活动高峰期间,Astral 的索引速度非常缓慢,有时几乎无法跟上区块的生成速度。这导致了一个令人沮丧的循环:虽然有些问题(比如下文提到的质押索引器数值错误)已经得到修复,但由于索引器处理滞后,可能需要数天甚至数周才能追赶上最新数据。

在过去几周中,我们从底层审视了整个索引流程,并找到了多种显著提升索引速度的方法,目前平均提速达到 5 倍。具体优化包括:

  • 消除索引过程中的冗余 API 调用,例如账户历史拉取和空间质押计算,这些都是主要瓶颈之一。Astral 中的 PR #1594 已针对这一问题进行了修复;
  • 引入 Redis 队列架构,将账户历史处理任务移至异步工作线程,从而减轻主索引器的负载;
  • 优化数据库结构,通过移除基于字符串的排序逻辑,并改为使用合适的数值类型;
  • 减少数据库负载,删除一些不必要的索引,这些索引每个都带来超过 20% 的额外开销。

Autonomys 的“finalization”机制与 SubQuery 不兼容
Autonomys 在使用 polkadot-sdk 的 finalization(区块最终性)机制时,与大多数链有所不同。我们采用最长链协议,并具备概率性最终性。我们的 finalization 标志是基于已归档的区块段进行判断的,因此往往距离当前链头数万个区块。而 SubQuery 会将每个中间区块头信息存储在 _metadata 表的单一字段中,因此当链头与最终确定区块的距离过长时,这个膨胀的数据结构会极大增加内存消耗,并导致启动缓慢,因为索引器需要反序列化并验证整个集合才能继续运行。为解决此问题,我们通过自定义“finalization”阈值取代默认的区块最终性标志,这需要对 SubQuery 进行分叉来实现定制逻辑。目前,该问题已通过我们分叉的 SubQuery 中的 PR #1 和 Astral 中的 PR #1594 进行修复。


Domain Segment Events 与 SubQuery 不兼容
我们与 SubQuery 不兼容的第二个方面在于,我们在 domain 上对系统事件的处理方式。为了解决 domain 执行过程中的严重性能瓶颈,我们引入了 EventSegments(事件分段)的概念。尽管这极大提升了执行性能,但也导致了与 polkadot-js explorer 和 SubQuery 等 Polkadot SDK 生态工具的兼容性问题。

近期,我们已在分叉的 SubQuery 中增加了对 EventSegments 的支持,这使我们能够正确索引 Taurus Auto EVM domain 上的 XDM 交易。在此功能修复前,我们一直无法对“Crossing the Narrow Sea”活动中的 XDM 交易进行准确统计。随着问题解决,我们现在可以实现 Auto EVM 上 XDM 交易的索引。


随着 Extrinsics/Events 数量增加导致查询变慢
随着网络数据规模快速增长(Taurus 网络上已有 3200 万条 extrinsics 和 1.45 亿条 events),维护过滤、排序、搜索功能和分页总数的性能变得日益困难和缓慢。

虽然我们正致力于更持久的解决方案,目前已采取若干临时措施以稳定性能,包括:


Staking Indexer 显示错误数值
Staking Indexer 因在处理质押存款的存储费用逻辑上存在错误,导致显示的质押数据不准确。问题出现在处理 OperatorNominated 事件流程的新存款时,错误地沿用了处理 OperatorRegistered 时的存储费用扣除逻辑,造成了存储费的重复扣除——即对已经扣过费的金额再次扣费,最终导致 estimated_shares 低估,并在平台上呈现错误的质押数据。

根本问题已在 PR #1592 中修复,但为了确保数据准确性,修复上线仍在等待索引器重新索引的完成。

总结
协议团队持续专注于 Phase 2 的发布开发,重点解决了在 “Crossing the Narrow Sea(跨越狭海)” 活动中发现的关键 XDM(跨域消息)问题,并完成了基准测试相关要求。近期修复内容包括解决严重的双重铸币漏洞以及一系列性能优化。审计工作也在持续推进,XDM 部分已接近完成审核,目前仅剩对 Domain 的 Sudo 权限模块和 Snap Sync 功能的审计工作尚未完成。

与此同时,Astral 面临着因网络活动增长而加剧的性能挑战,团队正在进行系统性优化,包括将索引速度提升 5 倍,以及通过数据库结构优化来应对 Taurus 网络上已达到的 3200 万条交易记录(extrinsics)和 1.45 亿条事件(events)的数据体量。

2 Likes