China’s 4-Stage Approach to End-to-end Autonomous Driving

5 min read3 days ago

Autonomous driving system architecture evolution, source: Cherish Capital

Every year, there are refreshed buzzwords in the China autonomous driving community. We have recognized phrases such as “city NOA”, “mapless”, and “BEV + Transformer” in most of the articles and speeches in 2023 while “end-to-end” has been widely discussed this year.

In fact, these buzzwords are very much correlated. NOA (Navigation on Autopilot) is a widely adopted benchmark in China replacing the sometimes-confusing level 2 and level 2+ autonomous driving, which is categorized as highway NOA and city NOA by road applications. For easier understanding, highway NOA is similar to Tesla Enhanced Autopilot and city NOA is similar to Tesla FSD. As highway NOA has been largely solved and implemented on a scale, city NOA becomes the arena for auto OEMs including Tesla.

Highway NOA is commonly built relying on HD maps, ensuring fresh coverage of which on highways is still feasible but very challenging on city roads at sustainable cost. City NOA has to be a “mapless” solution to be deployed in mass production vehicles. The community first turned to Transformer-based BEV (Bird’s Eye-View) in the perception layer, which has demonstrated great effectiveness in aggregating the context of different sensor inputs in a unified space.

However, seeing the world better doesn’t necessary lead to planning a better trajectory for a vehicle. The prediction & planning layer traditionally adopts sophisticated rule-based designs that become highly inefficient as the NOA feature rolls out to more cities in China. There is a growing trend to opt for a learning-based approach, which is the end-to-end paradigm.

Although the concept of end-to-end autonomous driving was initiated in the academic field as ALVINN back in 1988, Nvidia made it popular in 2016. Basically, Nvidia designed and trained a prototype end-to-end CNN system with a front camera as input and steering commands as output. The team demonstrated that with minimum training data from humans, the system learns to steer, with or without lane markings, on both local roads and highways.

Modular v.s. End-to-end Architecture

Inspired by the architecture of robots, a conventional autonomous driving system usually consists of several interconnected modules that are independently developed and each corresponding to a specific task. For example, Baidu Apollo 3.0 is designed with 4 main blocks being perception, prediction, planning and control (exhibit 1).

Exhibit 1: Apollo software architecture diagram, source: Apollo Auto

Each company varies and evolves with its own modular structure. There might be more or less blocks in a different order or with different names. But the modules always there would be perception and planning.

The conventional architecture feeds the output of one block directly into the next block, which makes the R&D sequence structured and debugging easier. However, since each block has different tasks and objectives, the entire autonomous driving system may not be optimized towards a unified target as the ultimate driving experience. Errors from each module could be compounded and result in information loss.

In contrast, the idea of end-to-end is to combine all modules into a single neural network with raw sensor input and planning/control output (exhibit 2). The benefits of this approach include but not limited to joint feature optimization and computational efficiency.

Exhibit 2: Wayve end-to-end autonomous driving system, source: Wayve.ai

When auto OEMs and intelligent driving Tier1s in China are racing to publicize their paradigm shift, there are debates on the fundamental principles of an end-to-end system.

End-to-end Definition and 4 Stages

Cherish Capital, a fund management company focusing on EV value chains, recently lead-drafted a report on end-to-end autonomous driving by interviewing over 30 experts. The report identified 4 stages in the evolution of autonomous driving architecture from modular to end-to-end (exhibit 3). The final stage would be an end-to-end algorithm framework, while the intermediate stages with modules and/or rules existing in a certain format are necessary to make the transition.

Exhibit 3: Autonomous driving system architecture evolution, source: Cherish Capital

The report defines the core of end-to-end autonomous driving as global optimization of the entire system and lossless transmission of perception information. The systems in stage 3 and 4 meet the core definition hence counted as end-to-end autonomous driving.

Stage 1 is end-to-end perception, where the perception module has achieved multi-sensor fusion in the BEV space. By adopting transformer and cross attention mechanism, the accuracy and stability of detection results have been greatly improved. The module for prediction, decision and planning is still rule-based. This is the phase with most of the Chinese companies that are marketing end-to-end autonomous driving.

Stage 2 is model-based planning, in which the functionalities of prediction, decision and planning are integrated into one neural network and the perception module remains the same. In this phase, both perception and planning are based on deep learning but the interface between the two is still defined by human interpretation. Each block has to be trained independently.

Compared to stage 1, a model-based planner would make data-driven optimization possible by simply scaling training resources, theoretically improving the system capabilities to handle complex driving scenarios. However, the issues with modular design still exist, such as error propagation, unaligned optimization objectives across modules, etc.

Stage 3 is modular end-to-end. The perception and planning are still two modules communicating together but jointly trained in an end-to-end manner towards the ultimate task ensuring minimum information loss. To achieve that, the perception block outputs feature representations instead of human-defined representations of the world. It would also support back-propagation.

Stage 4 is the ultimate one model. At the final stage of evolution, perception and planning are combined into a single deep learning network that is optimized on a global scale to achieve superior performance. Like the concept established by ALVINN and later Nvidia, the one model takes raw sensor input to generate ego-motion plans, but it will be much more sophisticated than the demos in research projects. The model can be trained via reinforcement learning or imitation learning, which are two existing methodologies. World model learning is a promising direction that needs to be further explored and validated.

China’s 4-Stage Approach to End-to-end Autonomous Driving

Modular v.s. End-to-end Architecture

End-to-end Definition and 4 Stages

Written by Shuai Chen