Considering the issue of privacy leakage and motivating more sophisticated protection methods for air-typing with XR devices, in this paper, we propose AirtypeLogger, a new approach towards practical video-based attacks on the air-typing activities of XR users in virtual space. Different from the existing approaches, AirtypeLogger considers a scenario in which the users are typing a short text fragment with semantic meaning occasionally under the spy of video cameras. It detects and localizes the air-typing events in video streams and proposes the spatial-temporal representation to encode the keystrokes' relative positions and temporal order. Then, high-precision inference can be achieved by applying a Transformer-based network to the spatial and temporal encodings of the keystroke sequences. Finally, according to our extensive real-world experiments, AirtypeLogger can achieve a Character Error Rate (CER) of less than 0.1 as long as 7 air-typing events are observed, which is impossible for previous approaches that require long-term observation of the typing activities online before launching inference attacks. The implementation details and source codes can be found at https://github.com/ztysdu/AirtypeLogger.