Projects | Wei-Cheng Wang's Personal Website

From Lab to Street：Transferable and Privacy-friendly Deep Learning for Urban Surveillance

“AI is here they say.” We have witnessed how artificial intelligence has reshaped our perception of the world in the past years. AlphaFold 2 achieved decades of work within a few weeks on predicting protein structures to accelerate medical breakthroughs. Meanwhile, GNoME discovered 380k new structures within a few months, which is equivalent to 800 years of work using the traditional way. However, we have also seen some domains struggle to apply state-of-the-art research to real-world applications. Surveillance, the chaotic, unscripted environment of our streets, constrained by privacy, scarcity, and unpredictability, is one of them. This part of my Ph.D. research tackles the critical bottlenecks and bridges the gap between academic research and real-world deployment on different aspects: privacy risks, data scarcity, and environmental domain shifts. This project is not merely a collection of frameworks designed to extract marginal accuracy gains on scripted datasets; it is about how we can adapt advanced models to handle the chaotic, constrained, and unscripted disorder of the real world. The dissertation is built upon three core technical pillars, and culminated in a Ph.D. degree from Ghent University: Privacy-Friendly Sensing Framework Audio-Visual Representation Learning Source-Free Unsupervised Transferability Assessment Privacy-Friendly Sensing Framework The “Opt-in” Mechanism: From Audio to Visual Traditional privacy protection is fundamentally reactive (Opt-out). Users are forced to play an infinite game of whack-a-mole, trying to list every sensitive attribute they want to hide. However, raw data is inherently “bundled”: a simple voice command carries not just the semantic content, but also speaker’s gender, emotion, and identity. In the real world, it is impossible to exhaustively list and block every potential leakage. This “collect first, sanitize later” approach violates the core principle of GDPR, leaving users vulnerable to future, unforeseen extraction techniques. Addressing the trade-off between utility and privacy. I proposed a fundamental inversion of this paradigm. Instead of asking “what should we hide?”, my framework asks “what is strictly necessary?”. Using adversarial learning, I trained an on-edge obfuscator adapted from a generative architecture (CycleGAN-VC2), designed to protect attributes like speaker identity, emotion, and gender while maintaining compatibility with downstream models (e.g., DeepSpeech2).This model acts as a digital sieve that actively strips away the “bundled” sensitive attributes (like identity) at the signal level, while selectively preserving only the features required for the authorized task (e.g., speech recognition). It transforms privacy from a passive policy into an active, mathematical constraint. Crucially, this framework solves the deployment bottleneck. The obfuscated data remains mathematically compatible with off-the-shelf models. This means service providers can “plug-in” this privacy module at the edge without needing to retrain their massive backend models. It offers a source-free, scalable path to GDPR compliance that protects users without dismantling the existing AI infrastructure. Experiments on four speech datasets demonstrate that the framework suppresses unauthorized attribute recognition to near-random chance levels, while incurring a minimal performance drop (only 2-6%) on authorized tasks. Audio Domain: Published in IEEE Pervasive Computing (1st author). Visual Domain: The core “opt-in” logic proved robust enough to be adapted to the visual domain, validating its cross-modal universality. Published in Applied Intelligence (2nd author). Audio-Visual Representation Learning: The Paradox of Misalignment: Turning False Negatives into Semantic Anchors Contrastive learning is an effective way to learn representation without labels. Yet, conventional contrastive learning on multimodal data, such as surveillance, suffers from false negatives. When two ambulance cars occur at different times during the night, the temporal coherence constraint used by traditional contrastive learning treats them as unrelated events. These false negatives lead to inefficient and ineffective representation learning. In urban surveillance, spatiotemporal discontinuity is the norm, not the exception. A siren is often heard before the ambulance appears; a crash sound precedes the visual collision. By rigidly enforcing temporal alignment, traditional models discard these meaningful but asynchronous correlations as false negatives, actively unlearning the causal structure of reality. Furthermore, traditional methods suffer from an information bottleneck: by relying on a single positive pair (the exact timestamp), the model learns only the minimal sufficient features needed to match that pair, discarding rich semantic details essential for generalization. Instead of treating asynchronous signals as errors to be filtered, I utilized them as semantic anchors. I developed the Embedding-based Pair Generation (EPG) mechanism, which operates on a simple premise: if two signals share high similarity in the latent space, they belong to the same event regardless of their timestamp. Dynamic Pair Re-evaluation: EPG actively retrieves these “misaligned” samples from the memory bank and re-labels them as positive pairs. Multi-Positive Contrastive Loss: By forcing the model to recognize multiple, time-scattered instances of the same event, we break the information bottleneck. This compels the encoder to capture richer, more robust features rather than just the minimal cues needed for temporal alignment. This approach successfully transformed the chaotic charateristics of surveillance data from a performance bottleneck into a source of data augmentation. Performance: Achieved a 10% improvement over state-of-the-art baselines (TACMA, MAViL) in audio-visual event localization. Rich in Information: The proposed EPG and the multi-positive loss force the model to capture dense, semantic features. This learnt representation is general-purpose, successfully powering multiple downstream tasks including event localization, anomaly detection, and query-guided event search without retraining. Scalability: Such versatility dramatically improves the scalability of edge deployments. Instead of installing separate, heavy models for each function, a single lightweight encoder can now serve multiple analytical tasks simultaneously. The results has been published in Frontiers in Robotics and AI, 1st author. Transferability Assessment: Navigating Without a Map: The “Source-Free” Compass One of the greatest hurdle in large-scale deployment is domain shift: a model trained on a sunny day in Ghent often fails miserably on a rainy night in Taipei. In an ideal world, we would access the original training data (source data) to bridge this gap, and the well-annotated target data during the evalutaion. But in the real world, strict privacy regulations (like GDPR) often lock this data away. Engineers are forced to deploy models into new, unseen environments while effectively flying blind, unable to predict which model will survive the shift. To solve this, I leveraged the underexplored potentional of an unlikely guide: Randomness. I proposed a novel assessment framework using Randomly Initialized Neural Networks (RINNs). My research revealed that while random networks contain no knowledge, their statistical structure provides a consistent, unbiased “universal ruler.” By measuring the Centered Kernel Alignment (CKA) between a pre-trained model and a set of random networks, I derived a “fingerprint” of the model’s structural adaptability. This allows us to assess model compatibility with a new environment without ever touching the restricted source data or requiring ground-truth labels. This turns model selection from a guessing game into a precise science. Task-Agnostic Validation: I validated this metric across a spectrum of real-world surveillance tasks, ranging from object tagging and event classification to the more abstract anomaly detection. High Correlation: Evaluating on diverse real-world datasets, my metric achieved a Kendall’s $\tau$ correlation of 0.95 with actual model performance. Operational Efficiency: It acts as a “Source-Free Compass,” allowing engineers to instantly identify the best-suited model for a specific camera feed before deployment, ensuring reliability while strictly respecting data sovereignty. The results has been published in Sensors, 1st author.

Jun 30, 2025

Urban Acoustic Surveillance

SensCity - Acoustic Surveillance in Real-World

SensCity x AsaSense: Critical Analysis of Urban Acoustic Surveillance A strategic research collaboration with the SensCity project (AsaSense), utilizing city-scale raw acoustic data to expose the failure modes of standard surveillance models and proposing context-aware architectural solutions. The Research Gap & Motivation Why “Off-the-Shelf” Fails in the Wild: Most acoustic surveillance systems are validated on clean, curated datasets. However, their performance on raw, unprocessed urban audio remains largely unverified. Our Mission: In collaboration with AsaSense, we accessed a unique stream of continuous, uncurated audio from Ghent and Rotterdam. Instead of just deploying a standard model, our goal was to stress-test two dominant paradigms: anomaly detection and sound tagging, and identify why conventional paradigms fail in dynamic environments (e.g., temporal drift, open-set events), and propose robust alternatives. Operational Context (The SensCity Testbed) This project leveraged a real-world infrastructure to diagnose algorithmic limitations: Raw Data Ingestion: Unlike academic datasets, the SensCity sensor network captures the “messy” reality of cities across two years: wind noise, overlapping soundscapes, and non-stationary backgrounds. Most importantly, without any annotations. System Audit: We applied SOTA approaches on anomaly detection and sound tagging models to this raw stream. The analysis revealed that global models generate unmanageable false alarms due to contextual blindness (e.g., treating a weekend market as an anomaly because the model only knew weekday traffic), further causing operator fatigue and leading to system failure. Core Conclusion: Our experiments conclusively proved that a single global model is insufficient for city-scale deployment. Instead, Context-Specific Modeling (sensor-specific baselines) is a prerequisite for operational reliability. Proposed Resolution: Based on these findings, we formulated a Context-Aware Design Framework, advocating for sensor-specific baselines and adaptive thresholding to handle the inherent variance of city life. Core Methodologies Data Source: High-fidelity, long-term raw acoustic logs from the AsaSense deployment (Ghent & Rotterdam). Diagnosis Method: Cross-context evaluation (Spatial & Temporal Domain Shift). Algorithmic Focus: Unsupervised Deep Autoregressive Modeling (WaveNet) vs. Pre-trained Tagging Models. Architecture Design: Feasibility analysis of Hybrid Edge-Cloud pipelines to mitigate bandwidth bottlenecks. Technical Analysis & Innovations 1. Diagnosing the “Generalization Fallacy” The Problem: We demonstrated that state-of-the-art anomaly detectors suffer from severe concept drift. A model trained on “winter data” failed catastrophically during summer evenings due to changed human activity patterns. The Solution: Proposed a Context-Specific Modeling approach, proving that training lightweight, dedicated models for each sensor location significantly outperforms a massive, generic global model in anomaly retrieval. 2. The Limits of Semantic Tagging The Finding: Standard sound taggers (trained on AudioSet) struggle with the Open-Set Nature of cities. They force novel urban sounds into rigid, pre-defined categories, leading to semantic misalignment. The Proposal: Suggested moving from “rigid classification” to “unsupervised deviation detection” at the edge, using tagging only as a secondary enrichment layer in the cloud, rather than a primary filter. 3. Architectural Scalability (Edge vs. Cloud) Analysis: Analyzed the trade-off between transmission cost and detection latency. Recommendation: Proposed a “Filter-then-Forward” architecture where edge nodes perform lightweight unsupervised screening, transmitting only potential anomalies to the cloud. This reduces bandwidth consumption by orders of magnitude while preserving privacy. Outcomes & Impact Empirical Evidence: Provided one of the first comprehensive studies on the limitations of transfer learning in acoustic surveillance using real-world, longitudinal data. Design Guidelines: The findings established the foundation for Privacy-Preserved & Adaptive Surveillance, directly influencing the design of subsequent research on privacy in surveillance. Strategic Value: Delivered critical insights to the industrial partner (AsaSense) on avoiding “technical debt” by pivoting from global models to adaptive, edge-based learning. Resources Chapter 2: The AsaSense Project - Detailed analysis of deployment constraints and algorithmic failures.

Jun 30, 2021

Unsupervised Learning

Intelligent Video Analytics & Surveillance Systems

Extracting insights from chaos without labeled data. This research project focuses on the unsupervised understanding of surveillance video, tackling the full pipeline from raw pixel processing to user-centric visualization. The core analysis module leverages background modeling to extract foreground entities, constructing trajectory kinematics descriptors to capture motion patterns. By applying unsupervised clustering on these spatiotemporal features, the system automatically distinguishes between normal routines and anomalous events without requiring manual annotations. Beyond detection, my Master’s thesis addressed the challenge of information presentation. I formulated the dynamic annotation placement as a spatiotemporal optimization problem. By enforcing coherence constraints, the algorithm calculates optimal label positions that maximize readability while minimizing occlusion of critical visual information, ensuring a seamless monitoring experience. (Details and visual results to be followed)

Jun 30, 2016

Driver Monitoring

Multimodal Driver Monitoring & Temporal Face Analysis

Multimodal Driver Safety System & Robust Face Analysis A holistic driver monitoring framework developed with ARTC, fusing visual temporal dynamics and ECG signals to enable early anomaly detection and proactive safety intervention. The Research Gap & Motivation From Passive Recording to Proactive Intervention: Standard recognition models often fail in real-world cockpits due to inter-personal variability. A generic model struggles to distinguish between a driver’s natural features (e.g., droopy eyelids) and fatigue. Our Goal: To build a safety-critical system capable of early detection of compromised states by combining non-intrusive visual monitoring with physiological signals (ECG), reducing false alarms and ensuring timely intervention. Operational User Scenario (How it Works) To address the variability mentioned above, the system operates in a three-stage safety loop: Initialization (The “Handshake”): When the driver starts the car, the system silently records a short “calibration sequence” to learn their current appearance (e.g., wearing sunglasses, heavy makeup, or fatigue). This establishes a Personalized Normal Driving Model (PNDM) for the specific trip. Dynamic Monitoring: As the vehicle moves through changing environments (e.g., entering a dark tunnel or facing high-beam glare), the alignment-free visual descriptor maintains robust tracking without being confused by lighting shifts. Proactive Intervention: If the driver shows signs of drowsiness (e.g., prolonged eye closure) AND the ECG sensor detects physiological fatigue, the system triggers a multi-stage alert—first warning the driver, and in critical cases, notifying fleet management or emergency services. Core Methodologies Visual Algorithms: Temporal Coherent Face Descriptor (alignment-free, robust to lighting). System Integration: Multimodal Sensor Fusion (Vision + ECG). Modeling Strategy: Sparse Representation-based Classification with online dictionary learning. Validation: Co-developed and tested with the Automotive Research & Testing Center (ARTC). Technical Architecture & Innovations 1. Personalized Calibration (User-Centric Design) The Problem: Drivers look different every day. Pre-trained generic models fail when users change appearance. The Solution: Implemented a rapid initialization phase that builds a dynamic baseline for each trip. The algorithm detects anomalies based on relative deviation from this baseline, effectively filtering out noise from accessories or facial structure. 2. Robust Temporal Modeling (Visual Subsystem) Alignment-Free: By leveraging temporal consistency across continuous frames, we eliminated the need for fragile face alignment steps, ensuring stability even under rapid head movements. Lighting Invariance: Utilized intensity contrast descriptors to maintain accuracy in challenging lighting conditions (e.g., nighttime driving validated in NCKU-driver database). 3. Proactive Safety Trigger (System Level) Multimodal Logic: Designed the visual module to work in tandem with ECG sensors. While ECG detects physiological drops in alertness, our visual module confirms behavioral lapses (e.g., nodding off). Impact: This cross-verification significantly reduces false positives, ensuring that alerts are only triggered for genuine safety risks. Outcomes & Validation Industry Collaboration: Co-developed with ARTC. Award-Winning: Secured Second Place at the International ICT Innovative Services Awards. Performance: Achieved real-time performance and superior accuracy over state-of-the-art baselines in nighttime scenarios. Resources Publications: Wang Wei-Cheng, Ru-Yun Hsu, Chun-Rong Huang, Li-You Syu (2015). Video gender recognition using temporal coherent face descriptor. IEEE/ACIS SNPD 2015. Chien-Yu Chiou, Wang Wei-Cheng, Shueh-Chou Lu, Chun-Rong Huang, Pau-Choo Chung, Yun-Yang Lai (2019). Driver Monitoring Using Sparse Representation With Part-Based Temporal Face Descriptors. IEEE T-ITS.

Jun 30, 2016

Agentic Career Orchestrator - Multi-Agent LLM System for Strategic Job Triage & Advisory

Taiwanese in Ghent, The Survivor Kit - A Serverless LLM-Agent Deployment

From Lab to Street：Transferable and Privacy-friendly Deep Learning for Urban Surveillance

SensCity - Acoustic Surveillance in Real-World

Intelligent Video Analytics & Surveillance Systems

Multimodal Driver Monitoring & Temporal Face Analysis

Bridging the gap between lab research and market value.