Intitial JEPA Training
Date: 2026-06-11
Short Summary
The model was trained as a schedule representation model. It was not trained as automatic scheduler during this instance. Future iterations will begin training automatic scheduling. That means it did not learn from human labels such as "good schedule" or "bad schedule." Instead, it learned by looking at structured schedule like windows, hiding part of each window, and trying to predict the hidden part.
The final training run used 11 public datasets.
The final combined model was trained on about 6.7 million training windows and validated on about 744 thousand held out windows. Six randomized confirmation passes were run. Each pass used a different random validation holdout and a randomized dataset order.
The strongest validation check compared the trained final checkpoint against a fresh untrained model on validation data:
| Model | Held out validation loss |
|---|---|
| Trained final model | 0.01780038 |
| Fresh untrained model | 1.04929552 |
The trained model's validation loss was about 98.30 percent lower than the untrained baseline on the same validation procedure. This shows that the model learned the JEPA prediction objective. It does not, by itself, prove that the model improves real scheduling decisions. That requires a downstream scheduler evaluation.
What The Model Learns
The model is a JEPA (Joint Embedding Predictive Architecture) style model for schedule and workflow data.
In this project, the model learns representations of schedule like activity windows. A window is one day or one task like time context represented as a fixed numeric grid.
The training task works like this:
- Take one prepared schedule window.
- Hide a span of time slots from the model.
- Let the context encoder read only the visible slots.
- Let the target encoder read the full original window.
- Train the predictor to make the context representation match the hidden target representation.
- Update the target encoder slowly using an exponential moving average of the context encoder.
In simpler terms: the model repeatedly sees partial schedule patterns and learns to predict the missing part in representation space.
This is useful because the model can learn patterns such as:
- which times tend to contain tasks;
- how email derived work differs from project tracker work;
- how code activity, meetings, holidays, deadlines, and priorities appear in time;
- how different workflow sources map into a common schedule like structure.
The current model is not yet a full scheduling assistant by itself. It is a representation model that can later be connected to the scheduler and scorer.
Model Configuration
The final combined checkpoint used this configuration:
| Setting | Value |
|---|---|
| Input features per slot | 16 |
| Slots per window | 96 |
| Approximate time represented by each slot | 15 minutes in a 24 hour day |
| Model dimension | 64 |
| Transformer depth | 2 layers |
| Attention heads | 4 |
| MLP dimension | 128 |
| Mask ratio | 0.40 |
| EMA decay for target encoder | 0.99 |
| Learning rate | 0.001 |
| Batch size | 32 |
| Steps per dataset training run | 1000 |
| Load mode | streaming |
| Device | CUDA |
Each window has 96 time slots. A 24 hour day has 96 fifteen minute slots, so slot 0 is around midnight, slot 48 is around noon, and slot 95 is near the end of the day.
Each slot has 16 numeric features:
| Index | Feature | Plain-English meaning |
|---|---|---|
| 0 | slot_position |
Where this slot is inside the window. |
| 1 | day_position |
Where this slot is inside the day. |
| 2 | calendar_event |
Calendar event or calendar like occupancy. |
| 3 | task |
Task, issue, to do, or action item signal. |
| 4 | email_activity |
Email or communication activity. |
| 5 | project_activity |
Project, issue tracker, or planning system activity. |
| 6 | code_activity |
GitHub or code hosting activity. |
| 7 | holiday |
Public holiday or all day holiday signal. |
| 8 | deadline |
Due date or deadline signal. |
| 9 | duration |
Normalized duration signal. |
| 10 | priority |
Normalized priority estimate. |
| 11 | participant_count |
Normalized count of people, recipients, actors, or locations. |
| 12 | text_length |
Normalized length of title, summary, body, or description text. |
| 13 | completion |
Closed, completed, or percentage done signal. |
| 14 | source_confidence |
Adapter confidence based on source fidelity. |
| 15 | reserved |
Reserved for future use. |
All feature values are floating point numbers clamped between 0.0 and 1.0.
Datasets Used
These datasets were used for training:
| Dataset | What it represents | Raw windows | Final train windows | Final validation windows |
|---|---|---|---|---|
calendar_ics_thunderbird |
Public calendar and holiday ICS files | 4,881 | 4,383 | 498 |
project_openproject |
Public OpenProject work packages | 1,000 | 897 | 103 |
project_taiga |
Public Taiga issues, tasks, and user stories | 1,179 | 1,085 | 94 |
gharchive_2026_06_09 |
Public GitHub event activity | 3,800,471 | 3,420,346 | 380,125 |
ms_latte |
Task timing preference examples | 10,101 | 9,126 | 975 |
epa |
Email derived task assignment examples | 6,734 | 6,084 | 650 |
enron_maildir |
Public Enron email corpus | 517,371 | 465,821 | 51,550 |
enron_cornell |
Public Enron graph/email derivative | 21,768 | 19,570 | 2,198 |
enron_snap |
Public Enron communication graph derivative | 367,662 | 331,019 | 36,643 |
public_jira |
Public Jira issues exported from the restored archive | 2,686,282 | 2,417,273 | 269,009 |
smart_todo_coded |
Public coded SmartToDo task data | 24,070 | 21,605 | 2,465 |
Final train and validation counts above are from the last randomized split, split_seed=2026061106.
Two planned sources were not trained:
| Dataset | Reason |
|---|---|
smart_todo_decoded_gated |
Blocked because decoded data requires gated Avocado access. |
blocked_sources |
Placeholder for sources that require credentials, approval, or manual agreements. |
Blocked datasets wrote empty processed files so the pipeline had a complete record of what was skipped, but they had zero train windows and were not included by all public training.
What The Sample Data Looked Like
The model did not train directly on raw email text, raw issue descriptions, or raw calendar files. Each raw record was converted into a normalized numeric window.
A prepared record has this overall shape:
{
"dataset": "public_jira",
"source_id": "public_jira:<redacted_issue_id>",
"window_kind": "project_task",
"window_start": "2021-12-30T06:48:04",
"window_end": null,
"tokens": [
[0.0, 0.0, 0.0, "... 16 numbers total ..."],
[0.011, 0.011, 0.0, "... 16 numbers total ..."],
"... 96 slots total ..."
],
"metadata": {
"...": "adapter-specific metadata"
}
}
The important field is tokens. That is the matrix the model actually sees:
96 time slots x 16 features per slot
Below are simplified examples based on the actual prepared data. Raw text bodies are not shown.
Example 1: Calendar Holiday
Source type: public Thunderbird ICS holiday calendar.
Plain English raw event:
An all day public holiday appears on 2025-01-01.
Prepared signal:
slot 0:
calendar_event = 1.0
holiday = 1.0
duration = 1.0
source_confidence = 1.0
slot 1:
slot_position = 0.011
day_position = 0.011
calendar_event = 1.0
holiday = 1.0
duration = 1.0
source_confidence = 1.0
What this teaches the model:
The model sees that all day calendar items occupy broad portions of the window and look different from short task or email activity.
Example 2: OpenProject Task
Source type: public OpenProject work package.
Plain English raw record:
A project task was created on 2019-08-28 around 14:01.
The record has task/project metadata, priority like structure, and text description length.
Prepared signal:
slot 56:
task = 1.0
project_activity = 1.0
priority = 0.5
text_length = 0.114
source_confidence = 0.85
Why slot 56:
There are 96 slots in a day. Slot 56 is around 14:00 because each slot is about 15 minutes.
What this teaches the model:
The model learns that issue tracker tasks are task like and project like, often tied to a specific time or lifecycle event.
Example 3: Public Jira Issue
Source type: normalized Public Jira issue.
Plain English raw record:
A Jira issue was created on 2021-12-30 around 06:48.
The issue has a summary and description, and may have due-date fields.
Prepared signal:
slot 27:
task = 1.0
project_activity = 1.0
text_length = 0.019
source_confidence = 0.8
What this teaches the model:
The model sees public issue tracker records as work items with project context and text size signals.
Example 4: Enron Email
Source type: public Enron maildir message.
Plain English raw record:
An email was sent on 2001-05-04 around 20:51.
The email had recipients and body text.
Prepared signal:
slot 83:
email_activity = 1.0
participant_count = 0.08
text_length = 0.393
source_confidence = 0.8
What this teaches the model:
The model learns communication patterns, recipient-count signals, and how email derived activity differs from calendar or issue tracker activity.
Example 5: GitHub Event
Source type: GH Archive public GitHub event.
Plain-English raw record:
A public GitHub event happened on 2026-06-09.
The event is code/project activity.
Prepared signal:
slot 0:
project_activity = 1.0
code_activity = 1.0
text_length = 0.004
source_confidence = 0.9
What this teaches the model:
The model learns that GitHub events are both project activity and code activity.
Example 6: MS-LaTTE Task Preference
Source type: MS-LaTTE task timing preference example.
Plain English raw record:
A task preference example indicates when a task may be appropriate.
It includes timing and participant / location like judgments.
Prepared signal:
slot 36:
task = 1.0
priority = 0.5
participant_count = 0.6
text_length = 0.008
source_confidence = 0.9
What this teaches the model:
The model gets examples of task timing preference, not just activity logs.
Step by Step Training Process
Step 1: Review The Existing Training Plan
The existing training docs described three main requirements:
- Keep raw public data separate from processed model inputs.
- Convert every dataset into the same fixed window schema.
- Train from prepared
train.jsonl.gzand validate on preparedval.jsonl.gz.
Step 2: Confirm Source Data Was Available
The source cache contained the expected public data groups:
| Source group | Local size |
|---|---|
public_jira |
6.3G |
enron |
3.0G |
dialogue_semantics |
2.9G |
gharchive |
531M |
smart_todo |
77M |
epa |
21M |
project_apis |
12M |
ms_latte |
11M |
calendar_ics |
2.2M |
Not every cached source group was part of the final trainable set. The training registry controlled which prepared datasets were included.
Step 3: Restore And Normalize Public Jira
The Public Jira archive was restored and exported into normalized JSON Lines:
The normalized export contained:
2,686,282 Jira issue records
Each line had fields like:
{
"id": "Apache:<issue_id>",
"key": "<issue_key>",
"created": "2021-12-30T06:48:04",
"dueDate": null,
"summary": "<issue summary>",
"description": "<issue description>"
}
The actual model did not train on this raw JSON directly. The data prep adapter converted it into the numeric 96 by 16 token format.
Step 4: Prepare All Trainable Datasets
The preparation command converted raw datasets into common train and validation files:
/home/lcrh/miniforge3/envs/jepa-scheduler/bin/python -m jepa_scheduler.data_prep prepare \
--dataset all \
--data-root /home/lcrh/train_data \
--processed-dir data/processed \
--reports-dir /srv/jepa/outputs/reports/jepa-training \
--run-id full-20260611T035946Z
Step 5: Use Train And Validation Splits
The prepared records were split into:
- training windows, used to update model weights;
- validation windows, held out from weight updates and used only to measure loss.
For confirmation training, seeded random splits were implemented. That allowed the system to reshuffle which records were held out for validation without reparsing all raw data.
The final split metadata was:
split_seed = 2026061106
split_mode = random_resplit
Step 6: Run A CUDA Smoke Training Test
Before running long training, a short smoke test verified that the data and GPU training path worked:
Smoke training used CUDA and produced a combined validation loss of:
0.7946688532829285
This was not meant to be a good model. It only proved that:
- prepared data could be read;
- batches could be streamed;
- the model could train on CUDA;
- checkpoints and reports could be written.
Step 7: Add Random Confirmation Training Support
The training code was extended so repeated confirmation runs could:
- randomly choose a new validation holdout;
- train datasets in random order;
- repeat the full dataset sequence multiple times;
- write separate checkpoints and reports for each pass.
This matters because a single validation split can be misleading. Repeating the run with different held out data gives a better signal that the model is learning generally, not just doing well on one lucky split.
Step 8: Run Six Randomized Confirmation Passes
The final training sequence ran six full confirmation passes.
Each pass did this:
- Re split every completed prepared dataset with a new seed.
- Hold out different validation examples.
- Randomize the dataset order.
- Train one checkpoint per dataset.
- Train one combined checkpoint over all trainable datasets.
The pass seeds were:
| Pass | Split seed |
|---|---|
| 1 | 2026061101 |
| 2 | 2026061102 |
| 3 | 2026061103 |
| 4 | 2026061104 |
| 5 | 2026061105 |
| 6 | 2026061106 |
Each pass trained these 12 checkpoints:
- one checkpoint for
calendar_ics_thunderbird; - one checkpoint for
project_openproject; - one checkpoint for
project_taiga; - one checkpoint for
gharchive_2026_06_09; - one checkpoint for
ms_latte; - one checkpoint for
epa; - one checkpoint for
enron_maildir; - one checkpoint for
enron_cornell; - one checkpoint for
enron_snap; - one checkpoint for
public_jira; - one checkpoint for
smart_todo_coded; - one combined checkpoint trained across the public trainable datasets.
Training Results
The combined model result from each confirmation pass was:
| Pass | Train windows | Validation windows | Final train loss | Final validation loss |
|---|---|---|---|---|
| 1 | 6,697,302 | 744,217 | 0.0518671572 | 0.0170253972 |
| 2 | 6,696,556 | 744,963 | 0.0355612300 | 0.0085086927 |
| 3 | 6,697,133 | 744,386 | 0.0421437211 | 0.0122576728 |
| 4 | 6,697,362 | 744,157 | 0.0373337157 | 0.0088611947 |
| 5 | 6,699,353 | 742,166 | 0.0469359979 | 0.0140897143 |
| 6 | 6,697,209 | 744,310 | 0.0479964577 | 0.0182671631 |
All six combined validation losses were finite and low. They were measured on validation data, not the same batches used for gradient updates.
The validation loss varied by split, which is expected. Different held out records make the validation set a slightly different test each time.
How Validation Was Done
Validation happened at several levels.
1. Data Preparation Validation
Data prep verified that each prepared window followed the expected schema:
96 slots
16 features per slot
feature values between 0.0 and 1.0
gzip JSONL train and validation files
manifest and stats files for each dataset
Prepared output was checked by counting train and validation records. Public Jira's normalized gzip export was also checked with gzip integrity validation.
This prevented the model from training on malformed records.
2. Held Out Validation Loss During Training
During training, the model was periodically evaluated on val.jsonl.gz records.
Those validation records were not used for optimizer updates. The model only used them to report loss.
The loss is mean squared error between:
- the predicted latent representation for hidden slots;
- the target encoder representation for the original full window.
Lower validation loss means the model is better at predicting hidden schedule representations from visible context.
3. Repeated Random Confirmation Splits
The model was not validated on only one fixed holdout split.
After the first full prep, the prepared records were split six times using different seeds:
2026061101
2026061102
2026061103
2026061104
2026061105
2026061106
That means each confirmation pass held out a different set of records for validation.
This is a stronger check than one static validation set because it reduces the chance that the model only looked good due to chance.
4. Randomized Dataset Order
The dataset order was randomized for the confirmation passes.
This reduces order bias. For example, if the model always trained on GitHub before Enron before Jira, the final model might be sensitive to that order. Randomizing the sequence gives a more realistic view of whether training is stable.
5. Fresh Untrained Model Baseline
After training, the final combined checkpoint was compared with a fresh untrained model on held out validation data.
The evaluation used:
checkpoint = /srv/jepa/outputs/checkpoints/random-confirm-20260611T051025Z-pass06/combined/schedule_jepa.pt
device = cuda
datasets = 11
validation batches = 100
batch size = 32
Result:
| Model | Validation loss |
|---|---|
| Trained final model | 0.01780038 |
| Fresh untrained model | 1.04929552 |
The trained model was much better than the fresh model on the same validation process. This is the clearest sanity check that training succeeded.
6. Software Tests And Checkpoint Checks
The test suite passed:
24 passed
Checkpoint verification also confirmed that each random confirmation pass wrote 12 checkpoint files.
Findings
Finding 1: The End to End Training Pipeline Works
The data can be downloaded or cached, normalized, converted into prepared windows, streamed into the trainer, trained on CUDA, and saved as checkpoints.
This matters because the large datasets are too big to load all at once in ordinary memory. Streaming mode lets training proceed with bounded memory.
Finding 2: The Model Learned The JEPA Objective
The final trained model achieved a much lower validation loss than a fresh untrained model:
trained = 0.01780038
untrained = 1.04929552
This means the model learned useful structure for the masked window prediction task.
Finding 3: Validation Was Not Dependent On One Single Holdout
The six confirmation passes each used different validation records.
The combined validation losses remained low across all six passes:
0.0170
0.0085
0.0123
0.0089
0.0141
0.0183
This supports the conclusion that training was stable across different validation splits.
Finding 4: The Model Is Not Yet Proven As A Scheduling Decision Model
The validation proves that the model learned the self supervised representation task.
It does not yet prove that:
- users will accept more schedule suggestions;
- deadline misses will decrease;
- conflict rates will improve;
- the model improves ranking of candidate time blocks.
Those require downstream evaluation after connecting JEPA embeddings into the scheduler or scorer.
Limitations
The current validation is appropriate for pretraining, but it has limits:
- The model was trained on public and proxy datasets, not private personal calendar/task history.
- Some public datasets are noisy and may contain incomplete timestamps or sparse metadata.
- Validation loss measures representation prediction, not direct product usefulness.
- Gated sources were skipped until permissions are available.
- The model still needs a downstream evaluation against actual scheduling outcomes.
Recommended Next Tests
The next useful testing layer should evaluate whether JEPA embeddings improve scheduling behavior.
Suggested tests:
- Add an evaluation CLI that loads a checkpoint and reports validation loss without retraining.
- Freeze the final combined JEPA checkpoint and generate embeddings for candidate schedule windows.
- Add those embeddings as features to the scheduler/scorer.
- Compare scheduler outcomes with and without JEPA features.
- Track practical metrics:
- accepted suggestion rate;
- moved block rate;
- deleted or skipped block rate;
- deadline misses;
- meeting conflicts;
- focus block fragmentation;
- user correction distance in minutes.