Air pollution presents a serious hazard to human health and the environment for the global rise in industrialization and urbanization. While fine-grained monitoring is crucial for understanding the formation and control of air pollution and their effects on human health, existing macro-regional level or ground-level methods make air pollution inference in the same spatial scale and fail to address the spatiotemporal correlations between cross-grained air pollution distribution. In this paper, we propose a 3D spatiotemporal attention super-resolution model (AirSTFM) for fine-grained air pollution inference at a large-scale region level. Firstly, we design a 3D-patch-wise self-attention convolutional module to extract the spatiotemporal features of air pollution, which aggregates both spatial and temporal information of coarse-grained air pollution and employs a sliding window to add spatial local features. Then, we propose a bidirectional optical flow feed-forward layer to extract the short-term air pollution diffusion characteristics, which can learn the temporal correlation contaminant diffusion between closeness time intervals. Finally, we construct a spatiotemporal super-resolution upsampling pretext task to model the higher-level dispersion features mapping between the coarse-grained and fined-grained air pollution distribution. The proposed method is tested on the PM2.5 pollution datatset of the Yangtze River Delta region. Our model outperforms the second best model in RMSE, MAE, and MAPE by 2.6%, 3.05%, and 6.36% in the 100% division, and our model also outperforms the second best model in RMSE, MAE, and MAPE by 3.86%, 3.76%, and 12.18% in the 40% division, which demonstrates the applicability of our model for different data sizes. Furthermore, the comprehensive experiment results show that our proposed AirSTFM outperforms the state-of-the-art models.