Intitial JEPA Training

Date: 2026-06-11

Short Summary

The model was trained as a schedule representation model. It was not trained as automatic scheduler during this instance. Future iterations will begin training automatic scheduling. That means it did not learn from human labels such as "good schedule" or "bad schedule." Instead, it learned by looking at structured schedule like windows, hiding part of each window, and trying to predict the hidden part.

The final training run used 11 public datasets.

The final combined model was trained on about 6.7 million training windows and validated on about 744 thousand held out windows. Six randomized confirmation passes were run. Each pass used a different random validation holdout and a randomized dataset order.

The strongest validation check compared the trained final checkpoint against a fresh untrained model on validation data:

Model Held out validation loss
Trained final model 0.01780038
Fresh untrained model 1.04929552

The trained model's validation loss was about 98.30 percent lower than the untrained baseline on the same validation procedure. This shows that the model learned the JEPA prediction objective. It does not, by itself, prove that the model improves real scheduling decisions. That requires a downstream scheduler evaluation.

What The Model Learns

The model is a JEPA (Joint Embedding Predictive Architecture) style model for schedule and workflow data.

In this project, the model learns representations of schedule like activity windows. A window is one day or one task like time context represented as a fixed numeric grid.

The training task works like this:

  1. Take one prepared schedule window.
  2. Hide a span of time slots from the model.
  3. Let the context encoder read only the visible slots.
  4. Let the target encoder read the full original window.
  5. Train the predictor to make the context representation match the hidden target representation.
  6. Update the target encoder slowly using an exponential moving average of the context encoder.

In simpler terms: the model repeatedly sees partial schedule patterns and learns to predict the missing part in representation space.

This is useful because the model can learn patterns such as:

The current model is not yet a full scheduling assistant by itself. It is a representation model that can later be connected to the scheduler and scorer.

Model Configuration

The final combined checkpoint used this configuration:

Setting Value
Input features per slot 16
Slots per window 96
Approximate time represented by each slot 15 minutes in a 24 hour day
Model dimension 64
Transformer depth 2 layers
Attention heads 4
MLP dimension 128
Mask ratio 0.40
EMA decay for target encoder 0.99
Learning rate 0.001
Batch size 32
Steps per dataset training run 1000
Load mode streaming
Device CUDA

Each window has 96 time slots. A 24 hour day has 96 fifteen minute slots, so slot 0 is around midnight, slot 48 is around noon, and slot 95 is near the end of the day.

Each slot has 16 numeric features:

Index Feature Plain-English meaning
0 slot_position Where this slot is inside the window.
1 day_position Where this slot is inside the day.
2 calendar_event Calendar event or calendar like occupancy.
3 task Task, issue, to do, or action item signal.
4 email_activity Email or communication activity.
5 project_activity Project, issue tracker, or planning system activity.
6 code_activity GitHub or code hosting activity.
7 holiday Public holiday or all day holiday signal.
8 deadline Due date or deadline signal.
9 duration Normalized duration signal.
10 priority Normalized priority estimate.
11 participant_count Normalized count of people, recipients, actors, or locations.
12 text_length Normalized length of title, summary, body, or description text.
13 completion Closed, completed, or percentage done signal.
14 source_confidence Adapter confidence based on source fidelity.
15 reserved Reserved for future use.

All feature values are floating point numbers clamped between 0.0 and 1.0.

Datasets Used

These datasets were used for training:

Dataset What it represents Raw windows Final train windows Final validation windows
calendar_ics_thunderbird Public calendar and holiday ICS files 4,881 4,383 498
project_openproject Public OpenProject work packages 1,000 897 103
project_taiga Public Taiga issues, tasks, and user stories 1,179 1,085 94
gharchive_2026_06_09 Public GitHub event activity 3,800,471 3,420,346 380,125
ms_latte Task timing preference examples 10,101 9,126 975
epa Email derived task assignment examples 6,734 6,084 650
enron_maildir Public Enron email corpus 517,371 465,821 51,550
enron_cornell Public Enron graph/email derivative 21,768 19,570 2,198
enron_snap Public Enron communication graph derivative 367,662 331,019 36,643
public_jira Public Jira issues exported from the restored archive 2,686,282 2,417,273 269,009
smart_todo_coded Public coded SmartToDo task data 24,070 21,605 2,465

Final train and validation counts above are from the last randomized split, split_seed=2026061106.

Two planned sources were not trained:

Dataset Reason
smart_todo_decoded_gated Blocked because decoded data requires gated Avocado access.
blocked_sources Placeholder for sources that require credentials, approval, or manual agreements.

Blocked datasets wrote empty processed files so the pipeline had a complete record of what was skipped, but they had zero train windows and were not included by all public training.

What The Sample Data Looked Like

The model did not train directly on raw email text, raw issue descriptions, or raw calendar files. Each raw record was converted into a normalized numeric window.

A prepared record has this overall shape:

{
  "dataset": "public_jira",
  "source_id": "public_jira:<redacted_issue_id>",
  "window_kind": "project_task",
  "window_start": "2021-12-30T06:48:04",
  "window_end": null,
  "tokens": [
    [0.0, 0.0, 0.0, "... 16 numbers total ..."],
    [0.011, 0.011, 0.0, "... 16 numbers total ..."],
    "... 96 slots total ..."
  ],
  "metadata": {
    "...": "adapter-specific metadata"
  }
}

The important field is tokens. That is the matrix the model actually sees:

96 time slots x 16 features per slot

Below are simplified examples based on the actual prepared data. Raw text bodies are not shown.

Example 1: Calendar Holiday

Source type: public Thunderbird ICS holiday calendar.

Plain English raw event:

An all day public holiday appears on 2025-01-01.

Prepared signal:

slot 0:
  calendar_event = 1.0
  holiday = 1.0
  duration = 1.0
  source_confidence = 1.0

slot 1:
  slot_position = 0.011
  day_position = 0.011
  calendar_event = 1.0
  holiday = 1.0
  duration = 1.0
  source_confidence = 1.0

What this teaches the model:

The model sees that all day calendar items occupy broad portions of the window and look different from short task or email activity.

Example 2: OpenProject Task

Source type: public OpenProject work package.

Plain English raw record:

A project task was created on 2019-08-28 around 14:01.
The record has task/project metadata, priority like structure, and text description length.

Prepared signal:

slot 56:
  task = 1.0
  project_activity = 1.0
  priority = 0.5
  text_length = 0.114
  source_confidence = 0.85

Why slot 56:

There are 96 slots in a day. Slot 56 is around 14:00 because each slot is about 15 minutes.

What this teaches the model:

The model learns that issue tracker tasks are task like and project like, often tied to a specific time or lifecycle event.

Example 3: Public Jira Issue

Source type: normalized Public Jira issue.

Plain English raw record:

A Jira issue was created on 2021-12-30 around 06:48.
The issue has a summary and description, and may have due-date fields.

Prepared signal:

slot 27:
  task = 1.0
  project_activity = 1.0
  text_length = 0.019
  source_confidence = 0.8

What this teaches the model:

The model sees public issue tracker records as work items with project context and text size signals.

Example 4: Enron Email

Source type: public Enron maildir message.

Plain English raw record:

An email was sent on 2001-05-04 around 20:51.
The email had recipients and body text.

Prepared signal:

slot 83:
  email_activity = 1.0
  participant_count = 0.08
  text_length = 0.393
  source_confidence = 0.8

What this teaches the model:

The model learns communication patterns, recipient-count signals, and how email derived activity differs from calendar or issue tracker activity.

Example 5: GitHub Event

Source type: GH Archive public GitHub event.

Plain-English raw record:

A public GitHub event happened on 2026-06-09.
The event is code/project activity.

Prepared signal:

slot 0:
  project_activity = 1.0
  code_activity = 1.0
  text_length = 0.004
  source_confidence = 0.9

What this teaches the model:

The model learns that GitHub events are both project activity and code activity.

Example 6: MS-LaTTE Task Preference

Source type: MS-LaTTE task timing preference example.

Plain English raw record:

A task preference example indicates when a task may be appropriate.
It includes timing and participant / location like judgments.

Prepared signal:

slot 36:
  task = 1.0
  priority = 0.5
  participant_count = 0.6
  text_length = 0.008
  source_confidence = 0.9

What this teaches the model:

The model gets examples of task timing preference, not just activity logs.

Step by Step Training Process

Step 1: Review The Existing Training Plan

The existing training docs described three main requirements:

  1. Keep raw public data separate from processed model inputs.
  2. Convert every dataset into the same fixed window schema.
  3. Train from prepared train.jsonl.gz and validate on prepared val.jsonl.gz.

Step 2: Confirm Source Data Was Available

The source cache contained the expected public data groups:

Source group Local size
public_jira 6.3G
enron 3.0G
dialogue_semantics 2.9G
gharchive 531M
smart_todo 77M
epa 21M
project_apis 12M
ms_latte 11M
calendar_ics 2.2M

Not every cached source group was part of the final trainable set. The training registry controlled which prepared datasets were included.

Step 3: Restore And Normalize Public Jira

The Public Jira archive was restored and exported into normalized JSON Lines:

The normalized export contained:

2,686,282 Jira issue records

Each line had fields like:

{
  "id": "Apache:<issue_id>",
  "key": "<issue_key>",
  "created": "2021-12-30T06:48:04",
  "dueDate": null,
  "summary": "<issue summary>",
  "description": "<issue description>"
}

The actual model did not train on this raw JSON directly. The data prep adapter converted it into the numeric 96 by 16 token format.

Step 4: Prepare All Trainable Datasets

The preparation command converted raw datasets into common train and validation files:

/home/lcrh/miniforge3/envs/jepa-scheduler/bin/python -m jepa_scheduler.data_prep prepare \
  --dataset all \
  --data-root /home/lcrh/train_data \
  --processed-dir data/processed \
  --reports-dir /srv/jepa/outputs/reports/jepa-training \
  --run-id full-20260611T035946Z

Step 5: Use Train And Validation Splits

The prepared records were split into:

For confirmation training, seeded random splits were implemented. That allowed the system to reshuffle which records were held out for validation without reparsing all raw data.

The final split metadata was:

split_seed = 2026061106
split_mode = random_resplit

Step 6: Run A CUDA Smoke Training Test

Before running long training, a short smoke test verified that the data and GPU training path worked:

Smoke training used CUDA and produced a combined validation loss of:

0.7946688532829285

This was not meant to be a good model. It only proved that:

Step 7: Add Random Confirmation Training Support

The training code was extended so repeated confirmation runs could:

  1. randomly choose a new validation holdout;
  2. train datasets in random order;
  3. repeat the full dataset sequence multiple times;
  4. write separate checkpoints and reports for each pass.

This matters because a single validation split can be misleading. Repeating the run with different held out data gives a better signal that the model is learning generally, not just doing well on one lucky split.

Step 8: Run Six Randomized Confirmation Passes

The final training sequence ran six full confirmation passes.

Each pass did this:

  1. Re split every completed prepared dataset with a new seed.
  2. Hold out different validation examples.
  3. Randomize the dataset order.
  4. Train one checkpoint per dataset.
  5. Train one combined checkpoint over all trainable datasets.

The pass seeds were:

Pass Split seed
1 2026061101
2 2026061102
3 2026061103
4 2026061104
5 2026061105
6 2026061106

Each pass trained these 12 checkpoints:

Training Results

The combined model result from each confirmation pass was:

Pass Train windows Validation windows Final train loss Final validation loss
1 6,697,302 744,217 0.0518671572 0.0170253972
2 6,696,556 744,963 0.0355612300 0.0085086927
3 6,697,133 744,386 0.0421437211 0.0122576728
4 6,697,362 744,157 0.0373337157 0.0088611947
5 6,699,353 742,166 0.0469359979 0.0140897143
6 6,697,209 744,310 0.0479964577 0.0182671631

All six combined validation losses were finite and low. They were measured on validation data, not the same batches used for gradient updates.

The validation loss varied by split, which is expected. Different held out records make the validation set a slightly different test each time.

How Validation Was Done

Validation happened at several levels.

1. Data Preparation Validation

Data prep verified that each prepared window followed the expected schema:

96 slots
16 features per slot
feature values between 0.0 and 1.0
gzip JSONL train and validation files
manifest and stats files for each dataset

Prepared output was checked by counting train and validation records. Public Jira's normalized gzip export was also checked with gzip integrity validation.

This prevented the model from training on malformed records.

2. Held Out Validation Loss During Training

During training, the model was periodically evaluated on val.jsonl.gz records.

Those validation records were not used for optimizer updates. The model only used them to report loss.

The loss is mean squared error between:

Lower validation loss means the model is better at predicting hidden schedule representations from visible context.

3. Repeated Random Confirmation Splits

The model was not validated on only one fixed holdout split.

After the first full prep, the prepared records were split six times using different seeds:

2026061101
2026061102
2026061103
2026061104
2026061105
2026061106

That means each confirmation pass held out a different set of records for validation.

This is a stronger check than one static validation set because it reduces the chance that the model only looked good due to chance.

4. Randomized Dataset Order

The dataset order was randomized for the confirmation passes.

This reduces order bias. For example, if the model always trained on GitHub before Enron before Jira, the final model might be sensitive to that order. Randomizing the sequence gives a more realistic view of whether training is stable.

5. Fresh Untrained Model Baseline

After training, the final combined checkpoint was compared with a fresh untrained model on held out validation data.

The evaluation used:

checkpoint = /srv/jepa/outputs/checkpoints/random-confirm-20260611T051025Z-pass06/combined/schedule_jepa.pt
device = cuda
datasets = 11
validation batches = 100
batch size = 32

Result:

Model Validation loss
Trained final model 0.01780038
Fresh untrained model 1.04929552

The trained model was much better than the fresh model on the same validation process. This is the clearest sanity check that training succeeded.

6. Software Tests And Checkpoint Checks

The test suite passed:

24 passed

Checkpoint verification also confirmed that each random confirmation pass wrote 12 checkpoint files.

Findings

Finding 1: The End to End Training Pipeline Works

The data can be downloaded or cached, normalized, converted into prepared windows, streamed into the trainer, trained on CUDA, and saved as checkpoints.

This matters because the large datasets are too big to load all at once in ordinary memory. Streaming mode lets training proceed with bounded memory.

Finding 2: The Model Learned The JEPA Objective

The final trained model achieved a much lower validation loss than a fresh untrained model:

trained = 0.01780038
untrained = 1.04929552

This means the model learned useful structure for the masked window prediction task.

Finding 3: Validation Was Not Dependent On One Single Holdout

The six confirmation passes each used different validation records.

The combined validation losses remained low across all six passes:

0.0170
0.0085
0.0123
0.0089
0.0141
0.0183

This supports the conclusion that training was stable across different validation splits.

Finding 4: The Model Is Not Yet Proven As A Scheduling Decision Model

The validation proves that the model learned the self supervised representation task.

It does not yet prove that:

Those require downstream evaluation after connecting JEPA embeddings into the scheduler or scorer.

Limitations

The current validation is appropriate for pretraining, but it has limits:

The next useful testing layer should evaluate whether JEPA embeddings improve scheduling behavior.

Suggested tests:

  1. Add an evaluation CLI that loads a checkpoint and reports validation loss without retraining.
  2. Freeze the final combined JEPA checkpoint and generate embeddings for candidate schedule windows.
  3. Add those embeddings as features to the scheduler/scorer.
  4. Compare scheduler outcomes with and without JEPA features.
  5. Track practical metrics:
    • accepted suggestion rate;
    • moved block rate;
    • deleted or skipped block rate;
    • deadline misses;
    • meeting conflicts;
    • focus block fragmentation;
    • user correction distance in minutes.