Bluzelle Bi-Weekly Sprint Update: Strengthening Infrastructure and Network Resilience — February 4, 2025
As we conclude our latest two-week sprint, our focus remains on enhancing Bluzelle’s infrastructure and ensuring robust network operations. This period has seen significant improvements in our core systems, particularly in addressing storage management and network connectivity challenges. These enhancements reflect our commitment to building a resilient and efficient blockchain network.
Progress Made
Infrastructure Improvements
- Fixed NVME Mounting Issue on Test Networks: Space was running out due to logs growing unexpectedly large as we hit 10 million blocks. This issue has now been resolved by properly mounting logs to the NVME storage.
- Implemented a New Process for Address Book Updates: A new system now provides the most updated addrbook.json file for new nodes. This ensures rapid peering with existing network nodes, improving connectivity and stability.
- Snapshot Server Fixed: The server was not running correctly due to issues with the address book file. It now successfully provides updated address books, genesis exports (for internal backup), and quick-sync full chain block tarballs for syncing.
- Mainnet Stability Restored: Brought back online the Big Dipper nodes in the mainnet to ensure consistent block explorer functionality.
- Cosmos SDK Upgrade Review: Conducted a thorough code review for the planned upgrade to the latest version of Cosmos SDK.
- Log File Management Overhaul: Updated and refactored code to ensure log files are mounted correctly to NVME storage, preventing future space issues.
Current Work in Progress
- Repairing Mainnet Nodes: Several mainnet nodes were taken down due to AWS maintenance. We are actively restoring them.
- NGINX Container Issue Resolution: Addressing issues in NGINX containers that are affecting network builds.
- Restoring Downed PingPub Nodes: Working on bringing PingPub nodes back online in the mainnet.
- Documentation for Network Setup: Creating a step-by-step guide to streamline the process of setting up new networks.
- Health Checks for Nodes: Implementing automated health checks in our infrastructure code to monitor node performance and uptime.
Next Steps
- Slack Notifications for AWS EC2 Maintenance: Setting up automated Slack alerts for AWS EC2 retirement notices to prevent unexpected mainnet node outages, like those experienced last December.
- Continue Work on Cosmos SDK Upgrade: Progressing towards implementing the software upgrade to Cosmos SDK 0.47.15.
- Enhancing Node Event Tracking: Improving infrastructure to track node events and statuses, allowing for proactive intervention before nodes are stopped.
Looking Ahead
This sprint marks a significant step forward in our ongoing efforts to enhance Bluzelle’s infrastructure resilience. The improvements in storage management, node connectivity, and monitoring capabilities lay a strong foundation for our network’s continued growth. As we move forward with the Cosmos SDK upgrade and implement more automated monitoring solutions, we remain committed to providing a stable and efficient platform for our growing ecosystem.