Language Independent System for Document Context Extraction (original) (raw)
Related papers
IJERT-A Paper on Approaches for Information Extraction from Unstructured Text
International Journal of Engineering Research and Technology (IJERT), 2014
https://www.ijert.org/a-paper-on-approaches-for-information-extraction-from-unstructured-text https://www.ijert.org/research/a-paper-on-approaches-for-information-extraction-from-unstructured-text-IJERTV3IS051383.pdf In today's computer world everything is converted into e world. Like , e-business,e-library-malls etc. That includes lots of e documents, information on the internet. Now the e documents is nothing but text, images etc.So from this available data to find the data of our interest is a complex task. We need information extraction systems to perform this task. This paper includes the background for information extraction from Unstructured data i.e data mining ,web content mining .This paper includes available approaches for unstructured data extraction .Then the discussion of the leggings of approaches and concluded with the ideas to overcome the issues.
Leveraging corporate context within knowledge-based document analysis and understanding
International Journal on Document Analysis and Recognition, 2001
Knowledge-based systems for document analysis and understanding (DAU) are quite useful whenever analysis has to deal with the changing of free-form document types which require different analysis components. In this case, declarative modeling is a good way to achieve flexibility. An important application domain for such systems is the business letter domain. Here, high accuracy and the correct assignment to the right people and the right processes is a crucial success factor. Our solution to this proposes a comprehensive knowledge-centered approach: we model not only comparatively static knowledge concerning document properties and analysis results within the same declarative formalism, but we also include the analysis task and the current context of the system environment within the same formalism. This allows an easy definition of new analysis tasks and also an efficient and accurate analysis by using expectations about incoming documents as context information. The approach described has been implemented within the VOPR (VOPR is an acronym for the Virtual Office PRototype.) system. This DAU system gains the required context information from a commercial workflow management system (WfMS) by constant exchanges of expectations and analysis tasks. Further interaction between these two systems covers the delivery of results from DAU to the WfMS and the delivery of corrected results vice versa.
The Internet is a great source of information. Semi-structured text documents represent great part of that information; commercial data-sheets of the Information Technology domain are among them (e.g. laptop computer datasheets). However, the capability to automatically gather and manipulate such information is still limited because those documents are designed to be read by people. Documents in domains such as that of Information Technology describe commercial products in data sheets with technical specications. People use those data sheets mainly to make buying decisions. Commercial data sheets are a kind of data-rich documents. Data-rich documents are characterized by heterogeneous format, specialized terminology, names, abbreviations, acronyms, quantities, magnitudes, units and limited utilization of complex natural language structures. This thesis presents an information extractor for data-rich documents based on a lexicalized domain ontology built mainly with meronymy and attribute relationships. Ontology concepts were manually lexicalized with words and n-word terms in English. The extraction process is mainly composed of a fuzzy string matcher module and a term disambiguator based on ideas from the word sense disambiguation problem in the natural language processing (NLP) eld. The former is used to nd references to ontology concepts allowing an error margin and the later is used to choose the better concept and use (or sense) associated with each referenced concept in a data-rich document according to their respective context. A domain ontology, a lexicon and a labeled corpus were manually constructed for the laptop computers domain using data sheets downloaded from the web sites of three of the most popular computer makers. The ontology had more than 350 concepts and the lexicon had more than 380 entries with at least 1500 dierent related terms. The validation document corpus was ve selected data-sheet documents with more than 5000 tokens in total of which 2300 were extraction targets. The ontology and lexicon were built based on a set 30 laptop data sheet. A subset among them was manually annotated with unambiguous semantic labels in order to be used for validation. The approximate text string-matching problem using static measures was reviewed. Static string measures are those that compare two strings in an algorithmic way using only the information contained in both strings. Additionally, a comparative study of dierent string matching techniques at character and at token level was conducted using well known data sets in literature. Particularly, a new general method for combining any string measure at character level with resemblance coecients (e.g. Jaccard, Dice, and cosine) was proposed. Encouraging results were obtained in the proposed experiments. On the other hand, to develop the term disambiguator, the concept semantic path was proposed. A semantic path is a chain of concepts that connects the ontology root concept with in a terminal concept (or sink node) in an ontology directed acyclic graph. Semantic paths have three uses: (i) to dene a label type for unambiguous semantic document annotation, (ii) to determine the use/sense inventory for each ontology concept and (iii) to serve as comparison unit for semantic relatedness measures. The proposed disambiguation technique was inspired by WSD methods based on general lexicalized ontologies such as WordNet. Additionally, a new global context-graph optimization criterion for disambiguation was proposed (i.e. shortest path). That criterion seemed to be suitable for the specic task reaching an F-measure above 80%. Finally, experiments showed that the information extractor was resilient against random noise introduced in the lexicon.