Solar forecasting has emerged as a cost-effective technology to mitigate the negative impacts of intermittent solar power on the power grid. Despite the multitude of deep learning methodologies available for forecasting solar irradiance, there is a notable gap in research concerning the automated selection and holistic utilization of multi-modal features for ultra-shortterm regional irradiance forecasting. Our study introduces SolarFusionNet, a novel deep learning architecture that effectively integrates automatic multi-modal feature selection and crossmodal data fusion. SolarFusionNet utilizes two distinct types of automatic variable feature selection units to extract relevant features from multichannel satellite images and multivariate meteorological data, respectively. Long-term dependencies are then captured using three types of recurrent layers, each tailored to the corresponding data modal. In particular, a novel Gaussian kernel-injected convolutional long short-term memory network is specifically designed to isolate the sparse features present in the cloud motion field derived from optical flow. Subsequently, a hierarchical multi-head cross-modal self-attention mechanism is proposed based on the physical-logical dependencies among the three modalities to investigate the coupling correlations among the modalities. The experimental results indicate that SolarFusionNet exhibits robust performance in predicting regional solar irradiance, achieving higher accuracy than other state-of-the-art models and a forecast skill ranging from 37.4% to 47.6% against the smart persistence model for the 4-hour-ahead forecast.