Skip to content

Philosophy

ModelCub is built on a set of core principles that guide every design decision.

Local-First

Your data never leaves your machine.

This isn't just a feature - it's our fundamental architecture. ModelCub works 100% offline because:

Security

No network requests means:

  • No data exfiltration risk
  • No man-in-the-middle attacks
  • No cloud breaches
  • No third-party access

Privacy

Perfect for sensitive data:

  • Medical imaging (HIPAA compliant)
  • Pharmaceutical research
  • Defense applications
  • Proprietary datasets

Performance

Local processing is faster:

  • No network latency
  • No upload/download time
  • Full GPU utilization
  • Works on slow connections

Cost

Zero recurring fees:

  • No monthly subscriptions
  • No per-API-call charges
  • No surprise bills
  • No vendor lock-in

Stateless Backend

The backend is a view layer. All state lives in files.

State Storage:
├── .modelcub/config.yaml      # Configuration
├── .modelcub/datasets.yaml    # Dataset registry
├── .modelcub/runs.yaml        # Training runs
└── data/datasets/             # Actual data

Benefits

Multiple instances: Run several UI servers simultaneously. They all see the same state.

Easy backup: Copy the directory. That's it.

Version control: Git can track everything.

No synchronization: No database to keep in sync.

Transparent: All state is human-readable YAML/JSON.

Implications

  • Kill the server → restart → nothing lost
  • No "database migrations"
  • No connection pooling
  • No ORM complexity
  • No cache invalidation

API-First

Everything is accessible through clean APIs.

python
# Python SDK
from modelcub import Project
project = Project.init("my-project")
bash
# CLI
modelcub project init my-project
typescript
// Web API
const project = await api.createProject({path: "my-project"});

All three interfaces use the same underlying core API.

Benefits

Composable: Mix and match tools as needed.

Automation: Script any workflow.

Testing: Easy to test business logic.

Integration: Works with existing tools.

Future-proof: New interfaces can be added without changing core.

Format-Agnostic

YOLO internally, import/export anything.

Import              Internal            Export
─────────────────────────────────────────────────
YOLO       ─────┐                  ┌───→  YOLO
Roboflow   ─────┤                  ├───→  COCO
COCO       ─────┼────→  YOLO  ────┼───→  VOC
Images     ─────┘                  └───→  TFRecord

Why YOLO Internally?

Simple: Text-based format, easy to parse.

Universal: Every CV library supports it.

Git-friendly: Human-readable diffs.

Fast: No complex parsing required.

Standard: Industry-wide adoption.

Format Conversion

Transparent conversion on import/export:

python
# Import COCO
dataset = Dataset.from_coco("./coco", name="v1")

# Export to TFRecord
dataset.export("./output", format="tfrecord")

User never needs to think about internal format.

Git-Friendly

Version datasets like code.

bash
# Commit changes
modelcub commit "Added 100 new samples"

# View history
modelcub history

# Compare versions
modelcub diff v1 v2

# Rollback
modelcub checkout v1

Why Version Control?

Reproducibility: Exact state of data for every experiment.

Collaboration: Multiple people can work on same dataset.

Experimentation: Safe to try changes, easy to rollback.

Audit trail: Know exactly what changed and when.

Debugging: Bisect to find when issue was introduced.

Implementation

File-based: All state in text files Git can track.

Diff-friendly: Changes show up clearly in diffs.

Commit metadata: Full provenance for every change.

Branch support: Experiment in branches, merge when ready.

Developer-Friendly

Built by engineers who felt the pain.

Clear Error Messages

Bad:

Error: Invalid input

Good:

❌ Dataset not found: "production-v1"

Available datasets:
  • production-v2 (847 images)
  • test-v1 (120 images)

Use: modelcub dataset list

Sensible Defaults

Auto-detect:

  • GPU (CUDA, MPS, or CPU)
  • Optimal batch size
  • Image size
  • Number of workers

User only specifies what they care about.

Good Documentation

Every API has:

  • Clear description
  • Parameter documentation
  • Return value documentation
  • Code examples
  • Common use cases

Type Safety

Full type hints throughout:

python
def import_dataset(
    source: Path,
    name: str,
    classes: Optional[List[str]] = None
) -> Dataset:
    ...

Transparent

No black boxes. No hidden state.

Configuration

All config in .modelcub/config.yaml:

yaml
project:
  name: my-project
defaults:
  device: cuda
  batch_size: 16

No hidden registry files. No system-wide configuration.

State

All state in human-readable files:

yaml
# .modelcub/datasets.yaml
datasets:
  v1:
    name: v1
    classes: [cat, dog]
    images: 1000

No binary databases. No opaque blobs.

Logs

Clear, structured logs:

[2025-01-26 10:30:15] INFO: Importing dataset from ./data
[2025-01-26 10:30:16] INFO: Found 1000 images
[2025-01-26 10:30:17] INFO: Detected 2 classes: cat, dog
[2025-01-26 10:30:18] SUCCESS: Import complete

Errors

Full stack traces in debug mode. Clear messages in normal mode.

Composable

Use what you need. Ignore what you don't.

Standalone Components

Each piece works independently:

python
# Just dataset management
from modelcub import Dataset
dataset = Dataset.load("v1")

# Just annotation
from modelcub import Annotator
annotator = Annotator(dataset)

# Just training
from modelcub import Trainer
trainer = Trainer(dataset, model)

No Forced Workflows

Use ModelCub how you want:

  • CLI only
  • SDK only
  • UI only
  • Mix and match

Easy Integration

Works with existing tools:

python
# Use with your own training loop
dataset = Dataset.load("v1")
train_loader = dataset.to_pytorch_dataloader()

# Your code here
for batch in train_loader:
    ...

Performance

Fast enough to not be annoying.

Benchmarks

  • Import 10k images: <30 seconds
  • Validate dataset: <10 seconds
  • Load dataset metadata: <100ms
  • Render UI: 60fps

Optimization

Lazy loading: Only load what's needed.

Caching: Cache expensive computations.

Pagination: Don't load all images at once.

Async: Use async I/O where beneficial.

Parallelism: Multi-threading for CPU-bound tasks.

Security

Privacy by design. Security by default.

No Remote Code

No eval(), no exec(), no pickle of untrusted data.

Input Validation

All paths validated:

  • No directory traversal
  • No symlink attacks
  • File extension checking

Safe Parsing

YAML/JSON parsing with safe loaders only.

SQL Safety

Parameterized queries only (for optional SQLite cache).

No Network

Zero outbound connections:

  • No telemetry
  • No update checks
  • No analytics
  • No crash reporting

Extensibility

Designed for future growth.

Plugin System (Future)

python
# plugins/my_augmentation.py
from modelcub import Plugin

class MyAugmentation(Plugin):
    def augment(self, image):
        ...

# Register
modelcub.register_plugin(MyAugmentation)

Hook Points

Events for extension:

python
from modelcub import bus

@bus.subscribe(DatasetImported)
def on_import(event):
    print(f"Dataset {event.name} imported")

Custom Formats

Add new import/export formats:

python
from modelcub import register_format

@register_format("custom")
class CustomFormat:
    def parse(self, path): ...
    def export(self, dataset, path): ...

Summary

ModelCub is:

  • Local-First: Your data, your machine
  • Stateless: No hidden databases
  • API-First: Everything composable
  • Format-Agnostic: Use any format
  • Git-Friendly: Version like code
  • Developer-Friendly: Clear, simple APIs
  • Transparent: No black boxes
  • Composable: Use what you need
  • Performant: Fast enough
  • Secure: Privacy by design
  • Extensible: Ready for growth

These principles guide every decision we make.

Released under the MIT License.