Parallel corpus is the crucial resource for many Natural Language Processing (NLP) systems such as statistical machine translation, cross-language information retrieval, and so on. Manually obtaining such corpora takes a very high cost while a large amount of them is available in various ways on the Web, such as web pages of bilingual web sites
therefore, automatically extracting parallel texts from the Web becomes an important task in NLP studying. In this paper, the authors develop a new approach based on extending of the definition of parallel texts to match translation segments. This will help us to extract proper translation units in bilingual web pages. the authors also formulate the problem as a classification problem and use both kinds of knowledge resources, including structural information of web pages and the translation information between the two languages. The experiments are conducted on the language pair of English and Vietnamese, which showed significant results.