The problem Red Trollys solves
Decentralized Computing and AI Model Training Platform
Overview
This platform utilizes decentralized computing to provide an efficient and cost-effective solution for AI model training by leveraging underutilized global computing resources, addressing issues like high costs and security risks.
Use Cases
1. AI Model Training
- Description: Train large AI models using distributed GPU resources, reducing time and costs.
- Benefits: Enables parallel processing for faster training and optimized deep learning tasks.
2. Cost-Effective Computing
- Description: Rent out idle computing resources for a shared economy.
- Benefits: Lowers operational costs, making high-performance computing accessible for small businesses.
3. Enhanced Security
- Description: Distributes data across multiple nodes to mitigate breach risks.
- Benefits: Improves data security and provides redundancy.
4. Lower Latency
- Description: Geographically distributed resources minimize latency for time-sensitive applications.
- Benefits: Enhances user experience with faster response times.
5. No-Code Development
- Description: A no-code interface for managing AI workflows.
- Benefits: Accessible for non-technical users, fostering collaboration.
Challenges with Centralized Computing
- High costs, downtime, security risks, and scalability issues.
Advantages of Decentralized Computing
- Resilience against failures, improved scalability, cost efficiency, and enhanced privacy.
Quantitative Analysis
Metric | Centralized | Decentralized |
---|
Monthly Cost (USD) | $21.00 | $5.00 |
Energy Consumption (kWh) | 600 | 1,200 |
Data Breach Cost (avg) | $4.35M | Reduced by ~60% |
Challenges we ran into
Challenges and Solutions in Building Our Decentralized Computing Project
Introduction
Building our decentralized computing platform involved several significant challenges, particularly with initial architecture and integration.
Initial Challenge: Using Torch RPC
We started with Torch RPC for managing our distribution network. However, we faced issues with:
- Scaling Limitations: It struggled to effectively scale across diverse resources.
- State Management Issues: Maintaining a consistent state across services proved difficult.
Solution
To overcome these hurdles, we decided to develop a custom ML engine, tailored to our needs, enhancing performance and reliability.
Inter-Service Communication (SFU)
Alongside the ML engine, we created a Selective Forwarding Unit (SFU) for efficient communication among services. However, state synchronization became a challenge.
Overcoming State Synchronization
- Centralized State Management: We implemented a single source of truth to maintain consistent state across services.
- Regular State Updates: Mechanisms were established for frequent state updates to ensure all nodes operated with the latest information.
Client-Facing Interfaces and Web Integration
Integrating WebAssembly (WASM) code for client interfaces posed challenges due to hardware-level access requirements.
Bugs and Solutions
- Integration Issues: We faced numerous bugs during WASM integration.
- Performance Constraints: Ensuring efficient WASM execution without degrading user experience was crucial.
Solutions for WASM Integration
- Thorough Testing: Extensive testing helped identify and resolve bugs before deployment.
- Fallback Mechanisms: We implemented fallback to JavaScript to maintain user experience when WASM faced issues.