Intitial JEPA Training

Date: 2026-06-11

Short Summary

The model was trained as a schedule representation model. It was not trained as automatic scheduler during this instance. Future iterations will begin training automatic scheduling. That means it did not learn from human labels such as "good schedule" or "bad schedule." Instead, it learned by looking at structured schedule like windows, hiding part of each window, and trying to predict the hidden part.

The final training run used 11 public datasets.

The final combined model was trained on about 6.7 million training windows and validated on about 744 thousand held out windows. Six randomized confirmation passes were run. Each pass used a different random validation holdout and a randomized dataset order.

The strongest validation check compared the trained final checkpoint against a fresh untrained model on validation data:

Model	Held out validation loss
Trained final model	0.01780038
Fresh untrained model	1.04929552

The trained model's validation loss was about 98.30 percent lower than the untrained baseline on the same validation procedure. This shows that the model learned the JEPA prediction objective. It does not, by itself, prove that the model improves real scheduling decisions. That requires a downstream scheduler evaluation.

What The Model Learns

The model is a JEPA (Joint Embedding Predictive Architecture) style model for schedule and workflow data.

In this project, the model learns representations of schedule like activity windows. A window is one day or one task like time context represented as a fixed numeric grid.

The training task works like this:

Take one prepared schedule window.
Hide a span of time slots from the model.
Let the context encoder read only the visible slots.
Let the target encoder read the full original window.
Train the predictor to make the context representation match the hidden target representation.
Update the target encoder slowly using an exponential moving average of the context encoder.

In simpler terms: the model repeatedly sees partial schedule patterns and learns to predict the missing part in representation space.

This is useful because the model can learn patterns such as:

which times tend to contain tasks;
how email derived work differs from project tracker work;
how code activity, meetings, holidays, deadlines, and priorities appear in time;
how different workflow sources map into a common schedule like structure.

The current model is not yet a full scheduling assistant by itself. It is a representation model that can later be connected to the scheduler and scorer.

Model Configuration

The final combined checkpoint used this configuration:

Setting	Value
Input features per slot	16
Slots per window	96
Approximate time represented by each slot	15 minutes in a 24 hour day
Model dimension	64
Transformer depth	2 layers
Attention heads	4
MLP dimension	128
Mask ratio	0.40
EMA decay for target encoder	0.99
Learning rate	0.001
Batch size	32
Steps per dataset training run	1000
Load mode	streaming
Device	CUDA

Each window has 96 time slots. A 24 hour day has 96 fifteen minute slots, so slot 0 is around midnight, slot 48 is around noon, and slot 95 is near the end of the day.

Each slot has 16 numeric features:

Index	Feature	Plain-English meaning
0	`slot_position`	Where this slot is inside the window.
1	`day_position`	Where this slot is inside the day.
2	`calendar_event`	Calendar event or calendar like occupancy.
3	`task`	Task, issue, to do, or action item signal.
4	`email_activity`	Email or communication activity.
5	`project_activity`	Project, issue tracker, or planning system activity.
6	`code_activity`	GitHub or code hosting activity.
7	`holiday`	Public holiday or all day holiday signal.
8	`deadline`	Due date or deadline signal.
9	`duration`	Normalized duration signal.
10	`priority`	Normalized priority estimate.
11	`participant_count`	Normalized count of people, recipients, actors, or locations.
12	`text_length`	Normalized length of title, summary, body, or description text.
13	`completion`	Closed, completed, or percentage done signal.
14	`source_confidence`	Adapter confidence based on source fidelity.
15	`reserved`	Reserved for future use.

All feature values are floating point numbers clamped between 0.0 and 1.0.

Datasets Used

These datasets were used for training:

Dataset	What it represents	Raw windows	Final train windows	Final validation windows
`calendar_ics_thunderbird`	Public calendar and holiday ICS files	4,881	4,383	498
`project_openproject`	Public OpenProject work packages	1,000	897	103
`project_taiga`	Public Taiga issues, tasks, and user stories	1,179	1,085	94
`gharchive_2026_06_09`	Public GitHub event activity	3,800,471	3,420,346	380,125
`ms_latte`	Task timing preference examples	10,101	9,126	975
`epa`	Email derived task assignment examples	6,734	6,084	650
`enron_maildir`	Public Enron email corpus	517,371	465,821	51,550
`enron_cornell`	Public Enron graph/email derivative	21,768	19,570	2,198
`enron_snap`	Public Enron communication graph derivative	367,662	331,019	36,643
`public_jira`	Public Jira issues exported from the restored archive	2,686,282	2,417,273	269,009
`smart_todo_coded`	Public coded SmartToDo task data	24,070	21,605	2,465

Final train and validation counts above are from the last randomized split, split_seed=2026061106.

Two planned sources were not trained:

Dataset	Reason
`smart_todo_decoded_gated`	Blocked because decoded data requires gated Avocado access.
`blocked_sources`	Placeholder for sources that require credentials, approval, or manual agreements.

Blocked datasets wrote empty processed files so the pipeline had a complete record of what was skipped, but they had zero train windows and were not included by all public training.

What The Sample Data Looked Like

The model did not train directly on raw email text, raw issue descriptions, or raw calendar files. Each raw record was converted into a normalized numeric window.

A prepared record has this overall shape:

{
  "dataset": "public_jira",
  "source_id": "public_jira:<redacted_issue_id>",
  "window_kind": "project_task",
  "window_start": "2021-12-30T06:48:04",
  "window_end": null,
  "tokens": [
    [0.0, 0.0, 0.0, "... 16 numbers total ..."],
    [0.011, 0.011, 0.0, "... 16 numbers total ..."],
    "... 96 slots total ..."
  ],
  "metadata": {
    "...": "adapter-specific metadata"
  }
}

The important field is tokens. That is the matrix the model actually sees:

96 time slots x 16 features per slot

Below are simplified examples based on the actual prepared data. Raw text bodies are not shown.

Example 1: Calendar Holiday

Source type: public Thunderbird ICS holiday calendar.

Plain English raw event:

An all day public holiday appears on 2025-01-01.

Prepared signal:

slot 0:
  calendar_event = 1.0
  holiday = 1.0
  duration = 1.0
  source_confidence = 1.0

slot 1:
  slot_position = 0.011
  day_position = 0.011
  calendar_event = 1.0
  holiday = 1.0
  duration = 1.0
  source_confidence = 1.0

What this teaches the model:

The model sees that all day calendar items occupy broad portions of the window and look different from short task or email activity.

Example 2: OpenProject Task

Source type: public OpenProject work package.

Plain English raw record:

A project task was created on 2019-08-28 around 14:01.
The record has task/project metadata, priority like structure, and text description length.

Prepared signal:

slot 56:
  task = 1.0
  project_activity = 1.0
  priority = 0.5
  text_length = 0.114
  source_confidence = 0.85

Why slot 56:

There are 96 slots in a day. Slot 56 is around 14:00 because each slot is about 15 minutes.

What this teaches the model:

The model learns that issue tracker tasks are task like and project like, often tied to a specific time or lifecycle event.

Example 3: Public Jira Issue

Source type: normalized Public Jira issue.

Plain English raw record:

A Jira issue was created on 2021-12-30 around 06:48.
The issue has a summary and description, and may have due-date fields.

Prepared signal:

slot 27:
  task = 1.0
  project_activity = 1.0
  text_length = 0.019
  source_confidence = 0.8

What this teaches the model:

The model sees public issue tracker records as work items with project context and text size signals.

Example 4: Enron Email

Source type: public Enron maildir message.

Plain English raw record:

An email was sent on 2001-05-04 around 20:51.
The email had recipients and body text.

Prepared signal:

slot 83:
  email_activity = 1.0
  participant_count = 0.08
  text_length = 0.393
  source_confidence = 0.8

What this teaches the model:

The model learns communication patterns, recipient-count signals, and how email derived activity differs from calendar or issue tracker activity.

Example 5: GitHub Event

Source type: GH Archive public GitHub event.

Plain-English raw record:

A public GitHub event happened on 2026-06-09.
The event is code/project activity.

Prepared signal:

slot 0:
  project_activity = 1.0
  code_activity = 1.0
  text_length = 0.004
  source_confidence = 0.9

What this teaches the model:

The model learns that GitHub events are both project activity and code activity.

Example 6: MS-LaTTE Task Preference

Source type: MS-LaTTE task timing preference example.

Plain English raw record:

A task preference example indicates when a task may be appropriate.
It includes timing and participant / location like judgments.

Prepared signal:

slot 36:
  task = 1.0
  priority = 0.5
  participant_count = 0.6
  text_length = 0.008
  source_confidence = 0.9

What this teaches the model:

The model gets examples of task timing preference, not just activity logs.

Step by Step Training Process

Step 1: Review The Existing Training Plan

The existing training docs described three main requirements:

Keep raw public data separate from processed model inputs.
Convert every dataset into the same fixed window schema.
Train from prepared train.jsonl.gz and validate on prepared val.jsonl.gz.

Step 2: Confirm Source Data Was Available

The source cache contained the expected public data groups:

Source group	Local size
`public_jira`	6.3G
`enron`	3.0G
`dialogue_semantics`	2.9G
`gharchive`	531M
`smart_todo`	77M
`epa`	21M
`project_apis`	12M
`ms_latte`	11M
`calendar_ics`	2.2M

Not every cached source group was part of the final trainable set. The training registry controlled which prepared datasets were included.

Step 3: Restore And Normalize Public Jira

The Public Jira archive was restored and exported into normalized JSON Lines:

The normalized export contained:

2,686,282 Jira issue records

Each line had fields like:

{
  "id": "Apache:<issue_id>",
  "key": "<issue_key>",
  "created": "2021-12-30T06:48:04",
  "dueDate": null,
  "summary": "<issue summary>",
  "description": "<issue description>"
}

The actual model did not train on this raw JSON directly. The data prep adapter converted it into the numeric 96 by 16 token format.

Step 4: Prepare All Trainable Datasets

The preparation command converted raw datasets into common train and validation files:

/home/lcrh/miniforge3/envs/jepa-scheduler/bin/python -m jepa_scheduler.data_prep prepare \
  --dataset all \
  --data-root /home/lcrh/train_data \
  --processed-dir data/processed \
  --reports-dir /srv/jepa/outputs/reports/jepa-training \
  --run-id full-20260611T035946Z

Step 5: Use Train And Validation Splits

The prepared records were split into:

training windows, used to update model weights;
validation windows, held out from weight updates and used only to measure loss.

For confirmation training, seeded random splits were implemented. That allowed the system to reshuffle which records were held out for validation without reparsing all raw data.

The final split metadata was:

split_seed = 2026061106
split_mode = random_resplit

Step 6: Run A CUDA Smoke Training Test

Before running long training, a short smoke test verified that the data and GPU training path worked:

Smoke training used CUDA and produced a combined validation loss of:

0.7946688532829285

This was not meant to be a good model. It only proved that:

prepared data could be read;
batches could be streamed;
the model could train on CUDA;
checkpoints and reports could be written.

Step 7: Add Random Confirmation Training Support

The training code was extended so repeated confirmation runs could:

randomly choose a new validation holdout;
train datasets in random order;
repeat the full dataset sequence multiple times;
write separate checkpoints and reports for each pass.

This matters because a single validation split can be misleading. Repeating the run with different held out data gives a better signal that the model is learning generally, not just doing well on one lucky split.

Step 8: Run Six Randomized Confirmation Passes

The final training sequence ran six full confirmation passes.

Each pass did this:

Re split every completed prepared dataset with a new seed.
Hold out different validation examples.
Randomize the dataset order.
Train one checkpoint per dataset.
Train one combined checkpoint over all trainable datasets.

The pass seeds were:

Pass	Split seed
1	2026061101
2	2026061102
3	2026061103
4	2026061104
5	2026061105
6	2026061106

Each pass trained these 12 checkpoints:

one checkpoint for calendar_ics_thunderbird;
one checkpoint for project_openproject;
one checkpoint for project_taiga;
one checkpoint for gharchive_2026_06_09;
one checkpoint for ms_latte;
one checkpoint for epa;
one checkpoint for enron_maildir;
one checkpoint for enron_cornell;
one checkpoint for enron_snap;
one checkpoint for public_jira;
one checkpoint for smart_todo_coded;
one combined checkpoint trained across the public trainable datasets.

Training Results

The combined model result from each confirmation pass was:

Pass	Train windows	Validation windows	Final train loss	Final validation loss
1	6,697,302	744,217	0.0518671572	0.0170253972
2	6,696,556	744,963	0.0355612300	0.0085086927
3	6,697,133	744,386	0.0421437211	0.0122576728
4	6,697,362	744,157	0.0373337157	0.0088611947
5	6,699,353	742,166	0.0469359979	0.0140897143
6	6,697,209	744,310	0.0479964577	0.0182671631

All six combined validation losses were finite and low. They were measured on validation data, not the same batches used for gradient updates.

The validation loss varied by split, which is expected. Different held out records make the validation set a slightly different test each time.

How Validation Was Done

Validation happened at several levels.

1. Data Preparation Validation

Data prep verified that each prepared window followed the expected schema:

96 slots
16 features per slot
feature values between 0.0 and 1.0
gzip JSONL train and validation files
manifest and stats files for each dataset

Prepared output was checked by counting train and validation records. Public Jira's normalized gzip export was also checked with gzip integrity validation.

This prevented the model from training on malformed records.

2. Held Out Validation Loss During Training

During training, the model was periodically evaluated on val.jsonl.gz records.

Those validation records were not used for optimizer updates. The model only used them to report loss.

The loss is mean squared error between:

the predicted latent representation for hidden slots;
the target encoder representation for the original full window.

Lower validation loss means the model is better at predicting hidden schedule representations from visible context.

3. Repeated Random Confirmation Splits

The model was not validated on only one fixed holdout split.

After the first full prep, the prepared records were split six times using different seeds:

That means each confirmation pass held out a different set of records for validation.

This is a stronger check than one static validation set because it reduces the chance that the model only looked good due to chance.

4. Randomized Dataset Order

The dataset order was randomized for the confirmation passes.

This reduces order bias. For example, if the model always trained on GitHub before Enron before Jira, the final model might be sensitive to that order. Randomizing the sequence gives a more realistic view of whether training is stable.

5. Fresh Untrained Model Baseline

After training, the final combined checkpoint was compared with a fresh untrained model on held out validation data.

The evaluation used:

checkpoint = /srv/jepa/outputs/checkpoints/random-confirm-20260611T051025Z-pass06/combined/schedule_jepa.pt
device = cuda
datasets = 11
validation batches = 100
batch size = 32

Result:

Model	Validation loss
Trained final model	0.01780038
Fresh untrained model	1.04929552

The trained model was much better than the fresh model on the same validation process. This is the clearest sanity check that training succeeded.

6. Software Tests And Checkpoint Checks

The test suite passed:

24 passed

Checkpoint verification also confirmed that each random confirmation pass wrote 12 checkpoint files.

Findings

Finding 1: The End to End Training Pipeline Works

The data can be downloaded or cached, normalized, converted into prepared windows, streamed into the trainer, trained on CUDA, and saved as checkpoints.

This matters because the large datasets are too big to load all at once in ordinary memory. Streaming mode lets training proceed with bounded memory.

Finding 2: The Model Learned The JEPA Objective

The final trained model achieved a much lower validation loss than a fresh untrained model:

trained = 0.01780038
untrained = 1.04929552

This means the model learned useful structure for the masked window prediction task.

Finding 3: Validation Was Not Dependent On One Single Holdout

The six confirmation passes each used different validation records.

The combined validation losses remained low across all six passes:

This supports the conclusion that training was stable across different validation splits.

Finding 4: The Model Is Not Yet Proven As A Scheduling Decision Model

The validation proves that the model learned the self supervised representation task.

It does not yet prove that:

users will accept more schedule suggestions;
deadline misses will decrease;
conflict rates will improve;
the model improves ranking of candidate time blocks.

Those require downstream evaluation after connecting JEPA embeddings into the scheduler or scorer.

Limitations

The current validation is appropriate for pretraining, but it has limits:

The model was trained on public and proxy datasets, not private personal calendar/task history.
Some public datasets are noisy and may contain incomplete timestamps or sparse metadata.
Validation loss measures representation prediction, not direct product usefulness.
Gated sources were skipped until permissions are available.
The model still needs a downstream evaluation against actual scheduling outcomes.

Recommended Next Tests

The next useful testing layer should evaluate whether JEPA embeddings improve scheduling behavior.

Suggested tests:

Add an evaluation CLI that loads a checkpoint and reports validation loss without retraining.
Freeze the final combined JEPA checkpoint and generate embeddings for candidate schedule windows.
Add those embeddings as features to the scheduler/scorer.
Compare scheduler outcomes with and without JEPA features.
Track practical metrics:
- accepted suggestion rate;
- moved block rate;
- deleted or skipped block rate;
- deadline misses;
- meeting conflicts;
- focus block fragmentation;
- user correction distance in minutes.