Extracting Mathematical Expressions From Postscript Documents

M. Yang, R. Fateman


Full-text indexing of documents containing mathematics cannot be considered a complete success unless the mathematics symbolism is extracted and represented in a standardized form permitting both searching for formulas, and further use of this information in (for example) computer algebra systems. Most documents produced in the past and subsequently digitally encoded, and even most those potentially ``born digital'' in current journal production are---at best---encoded in a printer form such as Adobe Postscript \cite{Postscript}, in which mathematics is not explicitly marked or easily identifiable. While one might look forward in the future to other document encodings such as MathML, the current journal or textbook product is essentially without semantic content: a jumble of odd characters. We demonstrate an approach to decoding, to recognizing and extracting mathematical expressions, from a Postscript document.

