Philosophy
ModelCub is built on a set of core principles that guide every design decision.
Local-First
Your data never leaves your machine.
This isn't just a feature - it's our fundamental architecture. ModelCub works 100% offline because:
Security
No network requests means:
- No data exfiltration risk
- No man-in-the-middle attacks
- No cloud breaches
- No third-party access
Privacy
Perfect for sensitive data:
- Medical imaging (HIPAA compliant)
- Pharmaceutical research
- Defense applications
- Proprietary datasets
Performance
Local processing is faster:
- No network latency
- No upload/download time
- Full GPU utilization
- Works on slow connections
Cost
Zero recurring fees:
- No monthly subscriptions
- No per-API-call charges
- No surprise bills
- No vendor lock-in
Stateless Backend
The backend is a view layer. All state lives in files.
State Storage:
├── .modelcub/config.yaml # Configuration
├── .modelcub/datasets.yaml # Dataset registry
├── .modelcub/runs.yaml # Training runs
└── data/datasets/ # Actual dataBenefits
Multiple instances: Run several UI servers simultaneously. They all see the same state.
Easy backup: Copy the directory. That's it.
Version control: Git can track everything.
No synchronization: No database to keep in sync.
Transparent: All state is human-readable YAML/JSON.
Implications
- Kill the server → restart → nothing lost
- No "database migrations"
- No connection pooling
- No ORM complexity
- No cache invalidation
API-First
Everything is accessible through clean APIs.
# Python SDK
from modelcub import Project
project = Project.init("my-project")# CLI
modelcub project init my-project// Web API
const project = await api.createProject({path: "my-project"});All three interfaces use the same underlying core API.
Benefits
Composable: Mix and match tools as needed.
Automation: Script any workflow.
Testing: Easy to test business logic.
Integration: Works with existing tools.
Future-proof: New interfaces can be added without changing core.
Format-Agnostic
YOLO internally, import/export anything.
Import Internal Export
─────────────────────────────────────────────────
YOLO ─────┐ ┌───→ YOLO
Roboflow ─────┤ ├───→ COCO
COCO ─────┼────→ YOLO ────┼───→ VOC
Images ─────┘ └───→ TFRecordWhy YOLO Internally?
Simple: Text-based format, easy to parse.
Universal: Every CV library supports it.
Git-friendly: Human-readable diffs.
Fast: No complex parsing required.
Standard: Industry-wide adoption.
Format Conversion
Transparent conversion on import/export:
# Import COCO
dataset = Dataset.from_coco("./coco", name="v1")
# Export to TFRecord
dataset.export("./output", format="tfrecord")User never needs to think about internal format.
Git-Friendly
Version datasets like code.
# Commit changes
modelcub commit "Added 100 new samples"
# View history
modelcub history
# Compare versions
modelcub diff v1 v2
# Rollback
modelcub checkout v1Why Version Control?
Reproducibility: Exact state of data for every experiment.
Collaboration: Multiple people can work on same dataset.
Experimentation: Safe to try changes, easy to rollback.
Audit trail: Know exactly what changed and when.
Debugging: Bisect to find when issue was introduced.
Implementation
File-based: All state in text files Git can track.
Diff-friendly: Changes show up clearly in diffs.
Commit metadata: Full provenance for every change.
Branch support: Experiment in branches, merge when ready.
Developer-Friendly
Built by engineers who felt the pain.
Clear Error Messages
Bad:
Error: Invalid inputGood:
❌ Dataset not found: "production-v1"
Available datasets:
• production-v2 (847 images)
• test-v1 (120 images)
Use: modelcub dataset listSensible Defaults
Auto-detect:
- GPU (CUDA, MPS, or CPU)
- Optimal batch size
- Image size
- Number of workers
User only specifies what they care about.
Good Documentation
Every API has:
- Clear description
- Parameter documentation
- Return value documentation
- Code examples
- Common use cases
Type Safety
Full type hints throughout:
def import_dataset(
source: Path,
name: str,
classes: Optional[List[str]] = None
) -> Dataset:
...Transparent
No black boxes. No hidden state.
Configuration
All config in .modelcub/config.yaml:
project:
name: my-project
defaults:
device: cuda
batch_size: 16No hidden registry files. No system-wide configuration.
State
All state in human-readable files:
# .modelcub/datasets.yaml
datasets:
v1:
name: v1
classes: [cat, dog]
images: 1000No binary databases. No opaque blobs.
Logs
Clear, structured logs:
[2025-01-26 10:30:15] INFO: Importing dataset from ./data
[2025-01-26 10:30:16] INFO: Found 1000 images
[2025-01-26 10:30:17] INFO: Detected 2 classes: cat, dog
[2025-01-26 10:30:18] SUCCESS: Import completeErrors
Full stack traces in debug mode. Clear messages in normal mode.
Composable
Use what you need. Ignore what you don't.
Standalone Components
Each piece works independently:
# Just dataset management
from modelcub import Dataset
dataset = Dataset.load("v1")
# Just annotation
from modelcub import Annotator
annotator = Annotator(dataset)
# Just training
from modelcub import Trainer
trainer = Trainer(dataset, model)No Forced Workflows
Use ModelCub how you want:
- CLI only
- SDK only
- UI only
- Mix and match
Easy Integration
Works with existing tools:
# Use with your own training loop
dataset = Dataset.load("v1")
train_loader = dataset.to_pytorch_dataloader()
# Your code here
for batch in train_loader:
...Performance
Fast enough to not be annoying.
Benchmarks
- Import 10k images: <30 seconds
- Validate dataset: <10 seconds
- Load dataset metadata: <100ms
- Render UI: 60fps
Optimization
Lazy loading: Only load what's needed.
Caching: Cache expensive computations.
Pagination: Don't load all images at once.
Async: Use async I/O where beneficial.
Parallelism: Multi-threading for CPU-bound tasks.
Security
Privacy by design. Security by default.
No Remote Code
No eval(), no exec(), no pickle of untrusted data.
Input Validation
All paths validated:
- No directory traversal
- No symlink attacks
- File extension checking
Safe Parsing
YAML/JSON parsing with safe loaders only.
SQL Safety
Parameterized queries only (for optional SQLite cache).
No Network
Zero outbound connections:
- No telemetry
- No update checks
- No analytics
- No crash reporting
Extensibility
Designed for future growth.
Plugin System (Future)
# plugins/my_augmentation.py
from modelcub import Plugin
class MyAugmentation(Plugin):
def augment(self, image):
...
# Register
modelcub.register_plugin(MyAugmentation)Hook Points
Events for extension:
from modelcub import bus
@bus.subscribe(DatasetImported)
def on_import(event):
print(f"Dataset {event.name} imported")Custom Formats
Add new import/export formats:
from modelcub import register_format
@register_format("custom")
class CustomFormat:
def parse(self, path): ...
def export(self, dataset, path): ...Summary
ModelCub is:
- Local-First: Your data, your machine
- Stateless: No hidden databases
- API-First: Everything composable
- Format-Agnostic: Use any format
- Git-Friendly: Version like code
- Developer-Friendly: Clear, simple APIs
- Transparent: No black boxes
- Composable: Use what you need
- Performant: Fast enough
- Secure: Privacy by design
- Extensible: Ready for growth
These principles guide every decision we make.