Binbing Liao, Jingqing Zhang, Chao Wu, Douglas McIlwraith, Tong Chen, Shengwen Yang, Yike Guo, Fei Wu
This dataset is updated and now available at BaiduNetDisk Code:umqd. Backup link.
For those who have downloaded the old dataset, we strongly suggest you re-download the updated dataset. The old dataset at Baidu Research Open-Access Dataset (BROAD) exists some duplicated hashed_link_id due to the hash function. So the hashed_link_id is removed in the updated dataset, meaning that we just use the link_id
which is consistent with the intermediate_files.
The intermediate data files (after pre-processing) are available at intermediate_files, so you can directly train the model now.
Please feel free to raise an issue if you have any question.
This sub-dataset was collected in Beijing, China between April 1, 2017 and May 31, 2017, from the Baidu Map. The detailed pre-processing of this sub-dataset is described in the paper. The query sub-dataset contains about 114 million user queries, each of which records the starting time-stamp, coordinates of the starting location, coordinates of the destination, estimated travel time (minutes)
. There are some query samples as follows:
2017-04-01 19:42:23, 116.88 37.88, 116.88 37.88, 33
2017-04-01 18:00:05, 116.88 37.88, 116.88 37.88, 33
2017-04-01 01:14:08, 116.88 37.88, 116.88 37.88, 33
..., ..., ..., ..., ...
We also collected the traffic speed data for the same area and during the same time period as the query sub-dataset. This sub-dataset contains 15,073 road segments covering approximately 738.91 km. Figure 1 shows the spatial distribution of these road segments, respectively.
Figure 1. Spatial distribution of the road segments in Beijing
They are all in the 6th ring road (bounded by the lon/lat box of <116.10, 39.69, 116.71, 40.18>), which is the most crowded area of Beijing. The traffic speed of each road segment is recorded per minute. To make the traffic speed predictable, for each road segment, we use simple moving average with a 15-minute time window to smooth the traffic speed sub-dataset and sample the traffic speed per 15 minutes.
Thus, there are totally 5856 (road_segment_id, time_stamp ([0, 5856))
and traffic_speed (km/h)
.
There are some traffic speed samples as follows:
15257588940, 0, 42.1175
..., ..., ...
15257588940, 5855, 33.6599
1525758913, 0, 41.2719
..., ..., ...
Due to the spatio-temporal dependencies of traffic data, the topology of the road network would help to predict traffic. Table 1 shows the fields of the road network sub-dataset.
Table 1. Examples of geographical attributes of each road segment.
For each road segment in the traffic speed sub-dataset, the road network sub-dataset provides the starting node (snode) and ending node (enode) of the road segment, based on which the topology of the road network can be built. In addition, the sub-dataset also provides various geographical attributes of each road segment, such as width, length, speed limit and the number of lanes. Furthermore, we also provide the social attributes such as weekdays, weekends, public holidays, peak hours and off-peak hours.
Table 2 shows the comparison of different datasets for traffic speed prediction. In the past few years, researchers have performed experiments with small or (and) private datasets. The release of Q-Traffic, a large-scale public available dataset with offline (geographical and social attributes, road network) and online (crowd map queries) information, should lead to an improvement of the research of traffic prediction.
Table 2. Comparison of different datasets for traffic speed prediction.
The source code has been tested with:
- Python 3.5
- TensorFlow 1.3.0
- TensorLayer 1.7.3
- numpy 1.14.0
- pandas 0.21.0
- scikit-learn 0.19.1
The structure of code:
- model.py: Implementation of deep learning models
- train.py: Implementation of controllers for training and testing
- baselines.py: Implementation of baseline models including RF and SVR
- dataloader.py: Data processing and loading, subject to change due to data format if necessary
- preprocessing: Data preprocessing and cleaning
- others: utilities, playground, logging, data preprocessing
In case using our dataset, please cite the following publication:
@inproceedings{bbliaojqZhangKDD18deep,
title = {Deep Sequence Learning with Auxiliary Information for Traffic Prediction},
author = {Binbing Liao and Jingqing Zhang and Chao Wu and Douglas McIlwraith and Tong Chen and Shengwen Yang and Yike Guo and Fei Wu},
booktitle = {Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
pages = {537--546},
year = {2018},
organization = {ACM}
}