OpenTSLM Foundation Model Evaluation

Evaluation Date: October 2, 2025 Evaluator: Research Team Status: ❌ Not Recommended for Cloud Anomaly Detection

Executive Summary

OpenTSLM is a Stanford-developed timeseries foundation model that processes multivariate time series through language model reasoning. While architecturally innovative, it is not suitable for cloud resource anomaly detection due to:

❌ No Pre-trained Weights - Requires training from scratch (5-stage curriculum, days/weeks with GPU)
❌ Medical Domain Focus - Optimized for ECG, EEG, and human activity recognition, not cloud metrics
❌ High Training Overhead - ~6GB dataset downloads, CUDA GPU required, HuggingFace authentication needed
❌ Not Designed for Anomaly Detection - Focused on Q&A, captioning, and chain-of-thought reasoning

Recommendation: Explore purpose-built anomaly detection models or cloud-metric-trained foundation models instead.

Research Context

Motivation

While reviewing HackerNews on October 2, 2025, we discovered OpenTSLM - a newly published timeseries foundation model from Stanford. Given our cloud-resource-simulator project's need for anomaly detection capabilities, we investigated whether OpenTSLM could serve as a foundation model for:

Timeseries Anomaly Detection in cloud resource utilization
Pattern Recognition across multivariate cloud metrics (CPU, memory, network, disk)
Natural Language Explanations for detected anomalies

Research Question

Can OpenTSLM be adapted or fine-tuned for cloud resource anomaly detection in the cloud-resource-simulator project?

Investigation Methodology

Phase 1: Repository Analysis

Forked Repository to personal GitHub account: nehalecky/OpenTSLM
Cloned Locally to /Users/nehalecky/Projects/cloudzero/OpenTSLM
Initialized Submodules (open_flamingo dependency)
Examined Documentation - README, code structure, training pipeline

Phase 2: Model Weights Investigation

Searched for pre-trained checkpoints in repository
Checked HuggingFace for published models
Reviewed training scripts for weight download mechanisms
Analyzed curriculum_learning.py for checkpoint handling

Phase 3: Architecture & Requirements Analysis

Reviewed model implementations (OpenTSLMFlamingo, OpenTSLMSP)
Examined encoder architecture (TransformerCNN)
Analyzed training datasets and their domains
Assessed infrastructure requirements

Key Findings

Critical Limitation: No Pre-trained Weights Available

OpenTSLM does NOT provide pre-trained model weights. Users must train models from scratch using the full 5-stage curriculum.

What's Available: - Base LLM models from HuggingFace (Llama 3.2-1B, Gemma-3-270m) - These are untrained base models, not OpenTSLM-trained weights - No shortcuts or intermediate checkpoints provided

What's Required:

# 1. Obtain base LLM (requires HuggingFace authentication)
huggingface-cli login

# 2. Run full 5-stage curriculum training
python curriculum_learning.py --model OpenTSLMFlamingo

# Stages:
# - Stage 1: Multiple Choice Q&A (TSQA dataset)
# - Stage 2: Time Series Captioning (M4 dataset)
# - Stage 3: HAR Chain-of-Thought (~download required)
# - Stage 4: Sleep Staging CoT (EEG data)
# - Stage 5: ECG Q&A CoT (~6GB download)

# Training time: Days to weeks depending on GPU

Checkpoints Storage:

results/
└── Llama3_2_1B/
    └── OpenTSLMFlamingo/
        ├── stage1_mcq/checkpoints/best_model.pt
        ├── stage2_captioning/checkpoints/best_model.pt
        ├── stage3_cot/checkpoints/best_model.pt
        ├── stage4_sleep_cot/checkpoints/best_model.pt
        └── stage5_ecg_cot/checkpoints/best_model.pt

Domain Mismatch: Medical Focus

Primary Use Cases: - ECG Analysis - 12-lead electrocardiogram interpretation - Sleep Staging - EEG-based sleep classification - Human Activity Recognition - Accelerometer/gyroscope data - Medical Time Series Q&A - Clinical reasoning tasks

Training Datasets: | Stage | Dataset | Domain | Size | |-------|---------|--------|------| | 1 | TSQA | Time Series Q&A | Auto-download | | 2 | M4 | General forecasting | Auto-download | | 3 | HAR | Human activity | ~Download | | 4 | SleepEDF | EEG sleep staging | Auto-download | | 5 | ECG-QA + PTB-XL | 12-lead ECG | ~6GB |

Domain Characteristics: - High sampling rates (100-500 Hz for medical signals) - Strong physiological constraints (QRS complexes, sleep stages) - Clinical terminology and reasoning patterns - Diagnostic question-answering focus

Cloud Metrics Characteristics: - Low sampling rates (1-5 minute intervals typical) - Different correlation patterns (resource contention, not physiology) - Infrastructure terminology (pods, nodes, services) - Anomaly detection focus (not diagnostic Q&A)

Conclusion: Significant domain gap between medical time series and cloud infrastructure metrics.

Architecture Analysis

Model Components

1. OpenTSLMFlamingo Architecture

Time Series Input → TransformerCNN Encoder → MLP Projector → Frozen LLM
                                                              ↓
                                                    Natural Language Output

Components: - Encoder: TransformerCNN - Processes multivariate time series of any length - Projector: MLP layers align time series embeddings with LLM embedding space - LLM: Pre-trained language model (Llama 3.2-1B or Gemma variants) - Training: LoRA fine-tuning with parameter-efficient adaptation

2. Alternative: OpenTSLMSP - Uses special tokens instead of Flamingo architecture - Same encoder/projector concept - Different integration with base LLM

Key Innovation: - Combines time series understanding with natural language reasoning - Enables chain-of-thought explanations for predictions - Processes multivariate time series with variable lengths

Training Requirements

Hardware Requirements

Preferred: CUDA-enabled NVIDIA GPU
Alternative: Apple Silicon MPS (with compatibility warnings)
Warning: Models trained on CUDA may not transfer to MPS

Software Dependencies

# Core ML/DL (from requirements.txt)
torch
transformers
peft  # LoRA fine-tuning
huggingface-hub

# Time Series
chronos-forecasting
wfdb  # ECG signal processing

# Vision/Multimodal
open-clip-torch
einops

# Data Processing
numpy, pandas
scikit-learn
matplotlib

Training Pipeline (5 Stages)

Stage 1: Multiple Choice Questions (~hours)

python curriculum_learning.py --model OpenTSLMFlamingo --stages stage1_mcq

Dataset: TSQA (Time Series Question Answering)
Task: Answer multiple choice questions about time series patterns
Auto-downloads from HuggingFace

Stage 2: Captioning (~hours)

python curriculum_learning.py --model OpenTSLMFlamingo --stages stage2_captioning

Dataset: M4 competition data
Task: Generate natural language descriptions of time series
Focuses on pattern recognition and verbalization

Stage 3: HAR Chain-of-Thought (~hours-days)

python curriculum_learning.py --model OpenTSLMFlamingo --stages stage3_cot

Dataset: Human Activity Recognition (HAR)
Download: https://polybox.ethz.ch/index.php/s/kD74GnMYxn3HBEM/download
Task: Classify activities with reasoning steps

Stage 4: Sleep Staging CoT (~hours-days)

python curriculum_learning.py --model OpenTSLMFlamingo --stages stage4_sleep_cot

Dataset: SleepEDF (EEG data)
Task: Sleep stage classification with chain-of-thought
Medical domain specialization begins

Stage 5: ECG Q&A CoT (~days)

python curriculum_learning.py --model OpenTSLMFlamingo --stages stage5_ecg_cot

Datasets: ECG-QA + PTB-XL (~6GB combined)
Download: https://polybox.ethz.ch/index.php/s/D5QaJSEw4dXkzXm/download
Task: 12-lead ECG clinical reasoning
Most medically specialized stage

Full Curriculum:

python curriculum_learning.py --model OpenTSLMFlamingo
# Estimated time: Days to weeks depending on GPU

Applicability Assessment for Cloud Resource Simulator

Alignment Analysis

Requirement	OpenTSLM Support	Assessment
Anomaly Detection	❌ Not primary focus	Q&A/captioning oriented, not outlier detection
Cloud Metrics	❌ Medical training data	Domain mismatch (ECG/EEG vs CPU/memory)
Pre-trained Model	❌ Must train from scratch	Prohibitive for exploration phase
Fast Inference	⚠️ Depends on LLM size	Llama 3.2-1B moderate, Gemma-270m faster
Multivariate Support	✅ Native support	Handles multiple metrics simultaneously
Variable Length	✅ Any length	Good for different time windows
Explainability	✅ Chain-of-thought	Natural language reasoning available

Strengths for Cloud Use Case

✅ Multivariate Time Series - Can process CPU, memory, network, disk together ✅ Variable Length Sequences - Handles different monitoring windows ✅ Natural Language Output - Could explain anomalies in plain English ✅ Modular Architecture - Encoder/projector/LLM separation allows adaptation

Critical Limitations for Cloud Use Case

❌ No Pre-trained Weights - Cannot evaluate without weeks of training ❌ Medical Domain Bias - Training data fundamentally different from cloud metrics ❌ Not Anomaly-Focused - Designed for Q&A, not outlier/anomaly detection ❌ Training Overhead - Requires substantial GPU resources and time ❌ Dataset Mismatch - No cloud infrastructure training data included

Alternative Approaches Recommended

For Anomaly Detection: 1. Traditional ML Models - Isolation Forest (scikit-learn) - LSTM Autoencoders (reconstruction error) - Prophet (Facebook) for seasonal decomposition

Purpose-Built Time Series Models
Amazon Chronos - Already integrated in our project
Google TimesFM - Zero-shot forecasting
Both have pre-trained weights and better domain fit
Cloud-Specific Models
AWS DeepAR (if using AWS data)
Azure Anomaly Detector (if using Azure data)
GCP Time Series Insights (if using GCP data)

For Explainability: - SHAP values on anomaly detection models - Attention weights from transformer-based detectors - Rule-based explanations from traditional methods

Repository Information

Forked Repository

GitHub: https://github.com/nehalecky/OpenTSLM
Upstream: https://github.com/StanfordBDHG/OpenTSLM
Local Path: /Users/nehalecky/Projects/cloudzero/OpenTSLM
Stars: 73 (as of Oct 2, 2025)
Created: May 2025
Last Updated: October 1, 2025

Repository Structure

OpenTSLM/
├── curriculum_learning.py          # Main training script
├── requirements.txt                # 21 dependencies
├── src/
│   ├── model/
│   │   ├── encoder/               # TransformerCNN
│   │   ├── llm/                   # OpenTSLMFlamingo, OpenTSLMSP
│   │   └── projector/             # MLP alignment
│   ├── time_series_datasets/      # Dataset loaders (TSQA, M4, HAR, Sleep, ECG)
│   ├── prompt/                    # Prompt templates
│   └── open_flamingo/            # Submodule
├── evaluation/                    # Evaluation scripts
├── test/                         # Unit tests
└── data/                         # Auto-downloaded datasets

Additional Resources

Paper: https://doi.org/10.13140/RG.2.2.14827.60963
Website: https://www.opentslm.com
Related Papers:
ECG-QA
PTB-XL Dataset

Decision & Next Steps

Decision: Not Pursuing OpenTSLM

Primary Reasons: 1. Training Barrier - No pre-trained weights; requires weeks of GPU time 2. Domain Mismatch - Medical focus doesn't transfer well to cloud infrastructure 3. Wrong Task Focus - Designed for Q&A/captioning, not anomaly detection 4. Better Alternatives Exist - Purpose-built models with cloud data experience

Rationale

While OpenTSLM demonstrates impressive multimodal capabilities for medical time series, the combination of lacking pre-trained weights and medical domain specialization makes it impractical for our cloud anomaly detection needs. The opportunity cost of training from scratch (GPU time, dataset engineering, validation) outweighs potential benefits when superior alternatives exist.

Key Insight: Foundation models are only valuable if: - Pre-trained weights are available (transfer learning), OR - Training data closely matches your domain

OpenTSLM fails both criteria for cloud metrics.

Recommended Next Steps

Immediate Actions: 1. ✅ Archive Fork - Keep for reference, but don't actively develop 2. ✅ Document Evaluation - This report serves as institutional knowledge

Alternative Exploration Priority:

High Priority (Immediate): - [ ] Enhance Chronos Integration - Already in our codebase, has pre-trained weights - [ ] Explore TimesFM - Google's zero-shot forecasting model - [ ] Traditional Anomaly Detection - Isolation Forest baseline

Medium Priority (Next Quarter): - [ ] Investigate Cloud-Specific Models - AWS DeepAR, Azure Anomaly Detector - [ ] Custom LSTM Autoencoder - Train on our synthetic cloud data - [ ] Hybrid Approach - Chronos forecasting + statistical anomaly detection

Low Priority (Future Research): - [ ] Foundation Model Fine-tuning - If cloud-trained foundation model emerges - [ ] LLM-Based Explainability - Use GPT-4/Claude for anomaly explanations

Lessons Learned

For Future Model Evaluations

Pre-Evaluation Checklist: 1. ✅ Check for Pre-trained Weights - First question, not last 2. ✅ Verify Domain Match - Medical ≠ Cloud Infrastructure 3. ✅ Assess Task Alignment - Q&A ≠ Anomaly Detection 4. ✅ Estimate Training Cost - GPU hours, dataset size, time to validation

Red Flags Identified: - 🚩 "Train from scratch" without pre-trained option - 🚩 All training examples from unrelated domain - 🚩 No mentions of your use case in documentation - 🚩 Base models require special access (Llama 3.2 gating)

Green Flags for Future Models: - ✅ Pre-trained weights on HuggingFace - ✅ Training data includes infrastructure/system metrics - ✅ Explicit anomaly detection capabilities - ✅ Active community with cloud use cases

Research Methodology Success

What Worked Well: - Using repository-manager agent for systematic analysis - Forking before deep evaluation (preserves exploration) - Checking for weights availability early - Documenting findings immediately

Process Improvements: - Consider creating "Model Evaluation Template" for future assessments - Build checklist of domain-fit questions - Maintain "Models Under Consideration" tracking document

References

OpenTSLM Resources

GitHub Repository: https://github.com/StanfordBDHG/OpenTSLM
Project Website: https://www.opentslm.com
Research Paper: https://doi.org/10.13140/RG.2.2.14827.60963
Our Fork: https://github.com/nehalecky/OpenTSLM

Amazon Chronos: https://github.com/amazon-science/chronos-forecasting
Google TimesFM: https://github.com/google-research/timesfm
Hugging Face Time Series: https://huggingface.co/models?pipeline_tag=time-series-forecasting

Medical Time Series Datasets (Context)

ECG-QA Paper: https://arxiv.org/abs/2306.15681
PTB-XL Dataset: https://www.nature.com/articles/s41597-020-0495-6
SleepEDF: https://physionet.org/content/sleep-edfx/1.0.0/

Cloud Anomaly Detection Resources

FinOps Foundation: https://www.finops.org/
AWS CloudWatch Anomaly Detection: https://aws.amazon.com/cloudwatch/
Azure Monitor Anomaly Detector: https://azure.microsoft.com/en-us/products/ai-services/ai-anomaly-detector

Document Status: Final Last Updated: October 2, 2025 Next Review: When new cloud-focused foundation models emerge