Documents are perfectly suited for information exchange via the Internet. In order to insure that there are no misunderstandings, information embedded in a document needs to be precise and unambiguous. Having a (de facto) standard data model and conceptual information model insures that the involved parties will agree on what the information means. XML (eXtensible Markup Language) has become the de facto standard format for representing information in documents for document exchange. Many techniques have been proposed to create XML documents, including the validation and transformation of XML documents. However, very little is discussed when it comes to extracting information from non- XML documents and engineering the information into XML documents. The extraction process can be a highly labor intensive task if it is done manually. The use of automated tools would make the process more efficient. In this chapter, the author will briefly survey document engineering techniques for XML documents. Then, the author will present two techniques to extract data from Windows documents into XML documents. These two techniques have been successfully applied in two industrial projects. He believes that techniques that automate the extraction of data from non-XML documents into XML formats will definitely enhance the use of XML documents.
Glushko and McGrath (2005) define documents in a general notion as follows,
“Document in a technology-neutral way as a purposeful and self-contained collection of information.”
Organizations should think of documents in an abstract and technology-neutral way. Documents should be flexibly exchanged via the Internet without concern as to how the documents are to be sent via the Internet. Documents should also be considered as a self-contained package of related information that can effectively organize business functions for use by other organizations. In addition, the interfaces between organizations used to process documents should be kept as minimal and simple as possible. More importantly, software tools should be able to enable quick and efficient means for documentation. One way to create such documents on the Internet is through the use of XML that is a universal, text-based, and self-describing data format. Almost every organization has computers and software tools to process XML.