NYU Researchers Use Language-Vision AI to Turn Traffic Camera Footage Into a Powerful Road Safety Tool
Cities like New York generate an overwhelming amount of traffic video every single day. Thousands of cameras record vehicles, cyclists, and pedestrians around the clock, creating an ocean of footage that mostly sits unused. The problem is not a lack of data, but a lack of practical ways to analyze it at scale. Manually reviewing video to find safety issues is slow, expensive, and simply unrealistic for most transportation agencies.
A new research project from NYU Tandon School of Engineering offers a promising solution. Researchers there have developed an artificial intelligence system that can automatically analyze long traffic videos, detect collisions and near-misses, and explain what went wrong in clear, human-readable language. The system, called SeeUnsafe, combines visual perception with language reasoning and could significantly change how cities approach road safety.
The study was published in the journal Accident Analysis & Prevention and earned New York City’s Vision Zero Research Award, which recognizes work that supports the city’s goal of eliminating traffic deaths and serious injuries. The research was presented at the city’s Research on the Road symposium, highlighting its relevance to real-world transportation planning.
Why Traffic Video Analysis Has Been So Hard Until Now
Traffic cameras are everywhere, but using them effectively is another story. Reviewing video footage to identify dangerous intersections or risky driving behaviors usually requires trained staff, specialized tools, and large budgets. Most agencies can only investigate video after a serious crash has already occurred.
This reactive approach misses an important opportunity. Many serious accidents are preceded by repeated near-misses — vehicles passing too close to pedestrians, unsafe turns, or sudden braking at busy intersections. These events rarely show up in official crash statistics, yet they provide valuable early warning signs.
SeeUnsafe was designed to bridge this gap by making it possible to analyze massive amounts of existing video without hiring teams of video analysts or building custom AI systems from scratch.
How SeeUnsafe Works
At its core, SeeUnsafe relies on multimodal large language models, a class of AI systems that can process both images and text. Instead of treating traffic video as raw pixels alone, the system understands scenes in a more human-like way — recognizing road users, interpreting their movements, and reasoning about safety risks.
The system analyzes long-form traffic videos and classifies them into three main categories: collisions, near-misses, or normal traffic conditions. It also identifies which road users were involved, such as vehicles, cyclists, or pedestrians.
What makes SeeUnsafe especially notable is that it does not require agencies to collect and label new datasets. It leverages pre-trained AI models, meaning cities can use the system with their existing camera infrastructure and video archives.
Strong Performance in Real-World Testing
To evaluate its effectiveness, the researchers tested SeeUnsafe on the Toyota Woven Traffic Safety dataset, a well-known benchmark used in traffic safety research.
The results were encouraging. SeeUnsafe correctly classified traffic videos as collisions, near-misses, or normal situations 76.71% of the time, outperforming several other existing models. When it came to identifying which specific road users were involved in dangerous events, accuracy reached as high as 87.5%.
These numbers matter because traffic footage is complex and unpredictable. Lighting changes, camera angles vary, and scenes can be crowded. Achieving this level of performance without custom training for each city is a significant step forward.
Turning Video Into Clear Road Safety Reports
One of the most practical features of SeeUnsafe is its ability to generate natural-language road safety reports. Instead of producing technical outputs that require expert interpretation, the system explains its findings in plain language.
These reports describe factors such as weather conditions, traffic volume, road user behavior, and movement patterns that contributed to a collision or near-miss. For transportation planners, this means less time deciphering data and more time acting on insights.
Clear explanations also improve trust. When decision-makers understand why the system flagged a location as dangerous, they are more likely to use its recommendations to guide policy and infrastructure changes.
A Shift From Reactive to Proactive Road Safety
Traditionally, cities redesign roads or adjust traffic controls only after accidents occur. SeeUnsafe supports a proactive safety strategy by identifying patterns of risky behavior before severe crashes happen.
By analyzing near-misses at scale, agencies can spot intersections where drivers regularly fail to yield, pedestrians frequently face close calls, or cyclists encounter unsafe passing distances. This information can guide preventive measures such as better signage, optimized signal timing, protected bike lanes, or redesigned crossings.
The system effectively helps cities make better use of investments they have already made in traffic cameras, without requiring costly new hardware or long AI development cycles.
Collaboration Across Disciplines at NYU
The project reflects close collaboration between computer vision researchers at NYU’s Center for Robotics and Embodied Intelligence and transportation safety experts at NYU Tandon’s C2SMART center. This combination of AI expertise and transportation knowledge was essential in designing a system that works in real urban environments.
C2SMART has been involved in multiple data-driven transportation projects, including studies on electric truck impacts on infrastructure, analyses of speed camera effectiveness across neighborhoods, development of digital twins for emergency response routing, and monitoring of overweight vehicles on major highways.
SeeUnsafe builds on this broader effort to modernize transportation systems using advanced analytics and AI.
Current Limitations and Challenges
Despite its strengths, the system is not perfect. SeeUnsafe’s performance depends on the quality of object tracking, and it can struggle in low-light conditions or visually cluttered scenes. These challenges are common in real-world traffic footage and remain active areas of research.
The researchers view this work as a foundation rather than a final solution. Improvements in camera quality, tracking algorithms, and multimodal AI models are expected to enhance performance over time.
Looking Ahead to Future Applications
Beyond fixed traffic cameras, the researchers see potential for applying this approach to in-vehicle dash cameras, where it could support real-time risk assessment from a driver’s perspective. In the long term, similar systems could contribute to connected vehicle networks and smarter urban mobility platforms.
As cities continue to search for scalable ways to improve safety without massive new spending, tools like SeeUnsafe highlight how language-vision AI can turn existing data into actionable knowledge.
Why Multimodal AI Matters for Transportation
Multimodal large language models represent a shift in how machines understand complex environments. Instead of analyzing visual data in isolation, these systems combine perception with reasoning and explanation. In transportation, this means AI can move beyond detecting objects to understanding context, cause, and consequence.
For road safety, that difference is critical. Knowing that a car and pedestrian were close is useful. Knowing why it happened and how often it occurs at a specific location is transformative.
Research Paper Reference:
https://doi.org/10.1016/j.aap.2025.108077