Project Overview

This is a research project intended to develop into a commercial scheduling assistant. Dainvo Planner is the intended assistant.

Dainvo

Dainvo is a planning, task manager, meeting planner, and scheduling app designed to help users organize work, manage schedules, and plan meetings in one place.

This project adds a calendar and scheduling assistant to support training information inside Dainvo. The goal is to make training related dates, schedule details, and planning tasks easier to view, organize, and manage through the app.

General Information on the JEPA Model

What JEPA means

JEPA stands for Joint Embedding Predictive Architecture. In plain English, it is a way to train AI systems to understand the world by predicting meaning, not by copying every small detail.

A normal generative AI model may try to predict the next word, draw missing pixels, or recreate missing parts of an image. A JEPA model works differently. It looks at part of an image, video, sound, or other data, then tries to predict the hidden part in a compact meaning space. This compact meaning space is often called an embedding space.

Meta describes V JEPA as a non generative model that predicts missing or masked parts of video in an abstract representation space rather than comparing raw pixels.

Simple example

Imagine watching a video where someone lifts a cup from a table. A person can guess what may happen next because they understand objects, motion, gravity, and intent.

A JEPA model tries to learn that kind of useful pattern. It does not need to draw every pixel of the cup, the table, or the background. Instead, it learns the important meaning: there is a cup, a hand is moving toward it, the cup may be picked up, and the next scene should match that kind of physical change.

Why researchers care about JEPA

Researchers are interested in JEPA because it may help AI learn more like people do. Humans do not learn only from labels or instructions. We learn a lot by watching the world and predicting what will happen next.

Yann LeCun proposed JEPA as part of a larger plan for autonomous machine intelligence in 2022. That plan focused on systems that can learn world models, reason, and plan over time.

Basic history

In 2022, Yann LeCun described JEPA as part of a path toward AI systems that can learn internal models of the world and use those models for planning.

In 2023, I JEPA applied the idea to images. The model learned by using a visible part of an image to predict the representations of hidden target blocks in the same image. The researchers described it as a non generative method for self supervised image learning.

In 2024, Meta introduced V JEPA for video. This moved the idea from still images to video, where the model learns from motion and change over time. Meta described V JEPA as predicting masked parts of video in an abstract representation space.

In 2025, Meta introduced V JEPA 2, a larger world model trained on video. Meta said V JEPA 2 was designed to understand, predict, and plan, including zero shot robot control in new environments.

In 2026, researchers introduced V JEPA 2.1, which improves dense visual understanding for images and video. The paper reports stronger results for action anticipation, robot grasping, navigation, depth estimation, and visual recognition.

Current state

JEPA is still mainly a research direction, not a standard business software feature. It is being studied for world models, robotics, video understanding, audio understanding, 3D perception, autonomous driving, and vision language systems.

The current direction is clear: researchers are testing whether AI can learn better by predicting useful hidden meaning instead of generating every visible detail. V JEPA 2 and V JEPA 2.1 are important examples because they show JEPA being used for video, physical reasoning, and robot planning. Meta says V JEPA 2 uses a two phase approach: first learning from natural videos, then using a smaller amount of robot data for planning.

How JEPA Works

A JEPA system usually has three main parts.

A context encoder

This part looks at the information the model is allowed to see. For example, it may see part of an image or the first few seconds of a video.

A target encoder

This part looks at the hidden or future information during training. It turns that hidden part into a compact meaning representation.

A predictor

This part tries to predict the target representation from the context representation.

The model is trained by checking whether the predicted meaning is close to the real target meaning. If the prediction is wrong, the model adjusts itself. Over time, it learns useful patterns about objects, movement, sound, space, and actions.

The important point is that JEPA does not need to predict every tiny detail. For example, in a video of a tree, it may not need to predict the exact position of every leaf. It can focus on the parts that matter for understanding and planning. The V JEPA 2 paper explains this difference by saying JEPA focuses on predictable parts of a scene, while pixel generation methods often spend effort on unpredictable visual details.

Current Research Examples

1. V JEPA 2 for video and robot planning

V JEPA 2 is one of the most important current examples. It was trained on more than one million hours of internet video and then connected with a smaller amount of robot data. The goal was to help AI understand the physical world, predict what may happen, and plan actions. The paper reports that V JEPA 2 AC was used on robot arms for pick and place tasks in new environments without task specific training in those environments.

In plain English, the model learns from watching video. Then, when a robot needs to act, the model can compare possible next steps and choose actions that are likely to move the world closer to the goal.

2. V JEPA 2.1 for stronger visual understanding

V JEPA 2.1 improves the training approach for images and videos. It focuses on learning dense visual features, which means the model learns not only the overall meaning of a scene, but also more detailed information about objects, space, and time. The paper reports gains in action anticipation, robot grasping, navigation, depth estimation, and recognition.

In plain English, this helps the model understand where things are, how they relate to each other, and how they may move. That matters for robotics, navigation, and any system that must act in the physical world.

3. Audio JEPA for sound

Audio JEPA applies the JEPA idea to sound. Instead of reconstructing raw audio, it predicts hidden representations of masked spectrogram patches. A spectrogram is a visual way to represent sound over time. The researchers tested it on speech, music, and environmental sound tasks.

In plain English, the model listens to part of a sound and learns to predict the meaning of missing parts. This could help AI understand speech, music, alarms, background sounds, or other audio signals.

4. 3D JEPA for point clouds

3D JEPA applies JEPA to 3D data, such as point clouds. Point clouds are often used in robotics, mapping, and autonomous systems. The model learns by using one part of a 3D object or scene to predict the representation of other target parts. The paper says this helps the model learn useful 3D representations without focusing too much on irrelevant details.

In plain English, the model learns the shape and meaning of 3D objects without needing someone to label every point.

5. ACT JEPA for learning actions

ACT JEPA is a research example focused on action and policy learning. The model is trained to predict both action sequences and hidden observation sequences. The researchers report improvements in world model understanding and task success rate compared with their strongest baseline.

In plain English, this is about teaching an AI system not only what it sees, but also what action should come next.

6. VL JEPA for vision and language

VL JEPA applies JEPA to vision and language tasks. Instead of generating text tokens in the usual way, it predicts continuous text embeddings. The paper reports that this approach can support classification, video retrieval, and visual question answering.

In plain English, the model connects what it sees with language meaning, but it does not always have to write out text unless text is needed.

7. Drive JEPA for autonomous driving

Drive JEPA applies V JEPA to driving. The model uses video pretraining to learn planning representations, then combines this with trajectory planning. The paper reports strong results on the NAVSIM driving benchmark.

In plain English, this research asks whether video based world models can help a driving system understand a road scene and choose safer, more stable future paths.

Why JEPA Matters

JEPA matters because it is trying to solve a major AI problem: how can a machine understand the world well enough to predict and plan?

Text models are powerful, but the real world is not only text. The real world includes movement, space, objects, sound, people, time, and cause and effect. JEPA research is focused on helping AI learn these patterns from observation.

For a training site or app, JEPA may not be something you build into the first version of the calendar system. However, it is useful background for future AI features. A future Dainvo assistant could use similar ideas to better understand user behavior, schedules, training patterns, and recommended next actions. For now, the practical focus should be the calendar, scheduling logic, Dainvo integration, notifications, and clean training content management.