Svetlana Symonenko - Academia.edu (original) (raw)
Uploads
Papers by Svetlana Symonenko
The semantic role labels of verb predicates can be used to define an event model for understandin... more The semantic role labels of verb predicates can be used to define an event model for understanding text. In the system described in this paper, the events are extracted from documents that are summary reports about individual people. The system constructed for the event extraction integrates a statistical approach using machine learning over Propbank semantic role labels with a rule-based approach using a sublanguage grammar of the summary reports. The event model is also utilized in identifying patterns of event/role usage that can be mapped to entity relations in the domain ontology of the application.
Research within a larger, multi-faceted risk assessment project for the Intelligence Community (I... more Research within a larger, multi-faceted risk assessment project for the Intelligence Community (IC) combines Natural Language Processing (NLP) and Machine Learning techniques to detect potentially malicious shifts in the semantic content of information either accessed or produced by insiders within an organization. Our hypothesis is that the use of fewer, more discriminative linguistic features can outperform the traditional bag-of-words (BOW) representation in classification tasks. Experiments using the standard Support Vector Machine algorithm and the LibSVM algorithm compared the BOW representation and two NLP representations. Classification results on NLP-based document representation vectors achieved greater precision and recall using forty-nine times fewer features than the BOW representation. The NLP-based representations improved classification performance by producing a lower dimensional but more linearly separable feature space that modeled the problem domain more accurate...
We present initial results from an international and multi-disciplinary research collaboration th... more We present initial results from an international and multi-disciplinary research collaboration that aims at the construction of a reference corpus of web genres. The primary application scenario for which we plan to build this resource is the automatic identification of web genres. Web genres are rather difficult to capture and to describe in their entirety, but we plan for the finished reference corpus to contain multi-level tags of the respective genre or genres a web document or a website instantiates. As the construction of such a corpus is by no means a trivial task, we discuss several alternatives that are, for the time being, mostly based on existing collections. Furthermore, we discuss a shared set of genre categories and a multi-purpose tool as two additional prerequisites for a reference corpus of web genres. 1.
A study was conducted to explore the potential of Natural Language Processing (NLP)based knowledg... more A study was conducted to explore the potential of Natural Language Processing (NLP)based knowledge discovery approaches for the task of representing and exploiting the vital information contained in field service (trouble) tickets for a large utility provider. Analysis of a subset of tickets, guided by sublanguage theory, identified linguistic patterns, which were translated into rule-based algorithms for automatic identification of tickets ’ discourse structure. The subsequent data mining experiments showed promising results, suggesting that sublanguage is an effective framework for the task of discovering the historical and predictive value of trouble ticket data. 1
Malicious insiders ’ difficult-to-detect activities pose serious threats to the intelligence comm... more Malicious insiders ’ difficult-to-detect activities pose serious threats to the intelligence community (IC) when these activities go undetected. A novel approach that integrates the results of social network analysis, role-based access monitoring, and semantic analysis of insiders ’ communications as evidence for evaluation by a risk assessor is being tested on an IC simulation. A semantic analysis, by our proven Natural Language Processing (NLP) system, of the insider’s text-based communications produces conceptual representations that are clustered and compared on the expected vs. observed scope. The determined risk level produces an input to a risk analysis algorithm that is merged with outputs from the system’s social network analysis and role-based monitoring modules.
Experiments were conducted to test several hypotheses on methods for improving document classific... more Experiments were conducted to test several hypotheses on methods for improving document classification for the malicious insider threat problem within the Intelligence Community. Bag-of-words (BOW) representations of documents were compared to Natural Language Processing (NLP) based representations in both the typical and one-class classification problems using the Support Vector Machine algorithm. Results show that the NLP features significantly improved classifier performance over the BOW approach both in terms of precision and recall, while using many fewer features. The one-class algorithm using NLP features demonstrated robustness when tested on new domains. 1
A feasibility study was conducted to determine whether the sublanguage methodology of NLP could a... more A feasibility study was conducted to determine whether the sublanguage methodology of NLP could analyze and represent the vital information contained in trouble tickets’ ungrammatical text and to explore various knowledge mining approaches to render the data contained in these documents accessible for analysis and prediction. Experiments showed that the linguistic characteristics of trouble tickets fit the sublanguage theoretical framework, thus enabling NLP systems to tap into the unrealized value of trouble ticket data.
Over a few decades, the Internet has developed out of research project into an indispensable part... more Over a few decades, the Internet has developed out of research project into an indispensable part of our lives. We are pretty sure almost everything is on the Web, but can we find it there? In real world, when navigating in a novel place, we look for familiar signs to guide us. On the Web too, we apply our experience to figure out the path. How well does it work? Browsing is still less productive and less popular than searching. Searching works when we know what we are looking for. But what if our need is not clear and we would rather look around? Has the Web reached the state when an intuitive navigation becomes possible? The study explored emerging conventions in content structure of academic and corporate sites. The identified trends were translated into the prototype content structures. The study also found users to possess some expectations of website content organization, which they relied on, interacting with the sites. Match between expectations and reality was shown to affe...
A pilot study was conducted for the dissertation research on the indications of conventionalizati... more A pilot study was conducted for the dissertation research on the indications of conventionalization in the observable structure of website content, i.e. in the way information is displayed to and perceived by users. The pilot applied qualitative content analysis methods, guided by the genre theory, to the sample of the top three structural levels of fifteen websites of three types (universities, governmental, and business). Because of the small sample, the pilot results should be treated as preliminary, but they do point to certain type-dependent patterns in organization of information on websites. In addition, analysis of page titles and link labels identified some naming conventions for particular content categories. The analysis of content structure also appears potentially informative about major “lines of business” of an actual entity, or entity type, behind the website.
Background and Problem Area This research addresses the question of whether the AI technologies o... more Background and Problem Area This research addresses the question of whether the AI technologies of Natural Language Processing (NLP) and Machine Learning (ML) can be used to improve security within the Intelligence Community (IC). This would be done by monitoring insiders’ work flow documents and emitting an alert to the central risk assessor monitored by a system assurance administrator if the documents accessed or produced by an IC analyst are not semantically appropriate to the domain of the analyst’s assigned tasks. The application of NLP-driven information extraction and ML-based text categorization is being applied to the problem of monitoring insider activity, with the goal of detecting malicious insiders within an organization (Symonenko et al., 2004). The capability is being implemented and tested as one piece of a tripartite solution in a system prototype within the context of a larger Insider Threat project being conducted under ARDA’s Information Assurance for the Intell...
The semantic role labels of verb predicates can be used to define an event model for understandin... more The semantic role labels of verb predicates can be used to define an event model for understanding text. In the system described in this paper, the events are extracted from documents that are summary reports about individual people. The system constructed for the event extraction integrates a statistical approach using machine learning over Propbank semantic role labels with a rule-based approach using a sublanguage grammar of the summary reports. The event model is also utilized in identifying patterns of event/role usage that can be mapped to entity relations in the domain ontology of the application.
Experiments were conducted to test several hypotheses on methods for improving document categoriz... more Experiments were conducted to test several hypotheses on methods for improving document categorization for the malicious insider threat problem within the Intelligence Community. Bag-of-words (BOW) representations of documents were compared to Natural Language Processing (NLP) based representations in both the typical and one-class categorization problems using the Support Vector Machine algorithm. Results from our Semantic Anomaly Monitoring (SAM) system show that the NLP features significantly improved classifier performance over the BOW approach both in terms of precision and recall, while using many fewer features. The oneclass algorithm using NLP features demonstrated robustness when tested on new domains.
Experiments were conducted to test several hypotheses on methods for improving document classific... more Experiments were conducted to test several hypotheses on methods for improving document classification for the malicious insider threat problem within the Intelligence Community. Bag-of-words (BOW) representations of documents were compared to Natural Language Processing (NLP) based representations in both the typical and one-class classification problems using the Support Vector Machine algorithm. Results show that the NLP features significantly improved classifier performance over the BOW approach both in terms of precision and recall, while using many fewer features. The one-class algorithm using NLP features demonstrated robustness when tested on new domains.
The semantic role labels of verb predicates can be used to define an event model for understandin... more The semantic role labels of verb predicates can be used to define an event model for understanding text. In the system described in this paper, the events are extracted from documents that are summary reports about individual people. The system constructed for the event extraction integrates a statistical approach using machine learning over Propbank semantic role labels with a rule-based approach using a sublanguage grammar of the summary reports. The event model is also utilized in identifying patterns of event/role usage that can be mapped to entity relations in the domain ontology of the application.
Research within a larger, multi-faceted risk assessment project for the Intelligence Community (I... more Research within a larger, multi-faceted risk assessment project for the Intelligence Community (IC) combines Natural Language Processing (NLP) and Machine Learning techniques to detect potentially malicious shifts in the semantic content of information either accessed or produced by insiders within an organization. Our hypothesis is that the use of fewer, more discriminative linguistic features can outperform the traditional bag-of-words (BOW) representation in classification tasks. Experiments using the standard Support Vector Machine algorithm and the LibSVM algorithm compared the BOW representation and two NLP representations. Classification results on NLP-based document representation vectors achieved greater precision and recall using forty-nine times fewer features than the BOW representation. The NLP-based representations improved classification performance by producing a lower dimensional but more linearly separable feature space that modeled the problem domain more accurate...
We present initial results from an international and multi-disciplinary research collaboration th... more We present initial results from an international and multi-disciplinary research collaboration that aims at the construction of a reference corpus of web genres. The primary application scenario for which we plan to build this resource is the automatic identification of web genres. Web genres are rather difficult to capture and to describe in their entirety, but we plan for the finished reference corpus to contain multi-level tags of the respective genre or genres a web document or a website instantiates. As the construction of such a corpus is by no means a trivial task, we discuss several alternatives that are, for the time being, mostly based on existing collections. Furthermore, we discuss a shared set of genre categories and a multi-purpose tool as two additional prerequisites for a reference corpus of web genres. 1.
A study was conducted to explore the potential of Natural Language Processing (NLP)based knowledg... more A study was conducted to explore the potential of Natural Language Processing (NLP)based knowledge discovery approaches for the task of representing and exploiting the vital information contained in field service (trouble) tickets for a large utility provider. Analysis of a subset of tickets, guided by sublanguage theory, identified linguistic patterns, which were translated into rule-based algorithms for automatic identification of tickets ’ discourse structure. The subsequent data mining experiments showed promising results, suggesting that sublanguage is an effective framework for the task of discovering the historical and predictive value of trouble ticket data. 1
Malicious insiders ’ difficult-to-detect activities pose serious threats to the intelligence comm... more Malicious insiders ’ difficult-to-detect activities pose serious threats to the intelligence community (IC) when these activities go undetected. A novel approach that integrates the results of social network analysis, role-based access monitoring, and semantic analysis of insiders ’ communications as evidence for evaluation by a risk assessor is being tested on an IC simulation. A semantic analysis, by our proven Natural Language Processing (NLP) system, of the insider’s text-based communications produces conceptual representations that are clustered and compared on the expected vs. observed scope. The determined risk level produces an input to a risk analysis algorithm that is merged with outputs from the system’s social network analysis and role-based monitoring modules.
Experiments were conducted to test several hypotheses on methods for improving document classific... more Experiments were conducted to test several hypotheses on methods for improving document classification for the malicious insider threat problem within the Intelligence Community. Bag-of-words (BOW) representations of documents were compared to Natural Language Processing (NLP) based representations in both the typical and one-class classification problems using the Support Vector Machine algorithm. Results show that the NLP features significantly improved classifier performance over the BOW approach both in terms of precision and recall, while using many fewer features. The one-class algorithm using NLP features demonstrated robustness when tested on new domains. 1
A feasibility study was conducted to determine whether the sublanguage methodology of NLP could a... more A feasibility study was conducted to determine whether the sublanguage methodology of NLP could analyze and represent the vital information contained in trouble tickets’ ungrammatical text and to explore various knowledge mining approaches to render the data contained in these documents accessible for analysis and prediction. Experiments showed that the linguistic characteristics of trouble tickets fit the sublanguage theoretical framework, thus enabling NLP systems to tap into the unrealized value of trouble ticket data.
Over a few decades, the Internet has developed out of research project into an indispensable part... more Over a few decades, the Internet has developed out of research project into an indispensable part of our lives. We are pretty sure almost everything is on the Web, but can we find it there? In real world, when navigating in a novel place, we look for familiar signs to guide us. On the Web too, we apply our experience to figure out the path. How well does it work? Browsing is still less productive and less popular than searching. Searching works when we know what we are looking for. But what if our need is not clear and we would rather look around? Has the Web reached the state when an intuitive navigation becomes possible? The study explored emerging conventions in content structure of academic and corporate sites. The identified trends were translated into the prototype content structures. The study also found users to possess some expectations of website content organization, which they relied on, interacting with the sites. Match between expectations and reality was shown to affe...
A pilot study was conducted for the dissertation research on the indications of conventionalizati... more A pilot study was conducted for the dissertation research on the indications of conventionalization in the observable structure of website content, i.e. in the way information is displayed to and perceived by users. The pilot applied qualitative content analysis methods, guided by the genre theory, to the sample of the top three structural levels of fifteen websites of three types (universities, governmental, and business). Because of the small sample, the pilot results should be treated as preliminary, but they do point to certain type-dependent patterns in organization of information on websites. In addition, analysis of page titles and link labels identified some naming conventions for particular content categories. The analysis of content structure also appears potentially informative about major “lines of business” of an actual entity, or entity type, behind the website.
Background and Problem Area This research addresses the question of whether the AI technologies o... more Background and Problem Area This research addresses the question of whether the AI technologies of Natural Language Processing (NLP) and Machine Learning (ML) can be used to improve security within the Intelligence Community (IC). This would be done by monitoring insiders’ work flow documents and emitting an alert to the central risk assessor monitored by a system assurance administrator if the documents accessed or produced by an IC analyst are not semantically appropriate to the domain of the analyst’s assigned tasks. The application of NLP-driven information extraction and ML-based text categorization is being applied to the problem of monitoring insider activity, with the goal of detecting malicious insiders within an organization (Symonenko et al., 2004). The capability is being implemented and tested as one piece of a tripartite solution in a system prototype within the context of a larger Insider Threat project being conducted under ARDA’s Information Assurance for the Intell...
The semantic role labels of verb predicates can be used to define an event model for understandin... more The semantic role labels of verb predicates can be used to define an event model for understanding text. In the system described in this paper, the events are extracted from documents that are summary reports about individual people. The system constructed for the event extraction integrates a statistical approach using machine learning over Propbank semantic role labels with a rule-based approach using a sublanguage grammar of the summary reports. The event model is also utilized in identifying patterns of event/role usage that can be mapped to entity relations in the domain ontology of the application.
Experiments were conducted to test several hypotheses on methods for improving document categoriz... more Experiments were conducted to test several hypotheses on methods for improving document categorization for the malicious insider threat problem within the Intelligence Community. Bag-of-words (BOW) representations of documents were compared to Natural Language Processing (NLP) based representations in both the typical and one-class categorization problems using the Support Vector Machine algorithm. Results from our Semantic Anomaly Monitoring (SAM) system show that the NLP features significantly improved classifier performance over the BOW approach both in terms of precision and recall, while using many fewer features. The oneclass algorithm using NLP features demonstrated robustness when tested on new domains.
Experiments were conducted to test several hypotheses on methods for improving document classific... more Experiments were conducted to test several hypotheses on methods for improving document classification for the malicious insider threat problem within the Intelligence Community. Bag-of-words (BOW) representations of documents were compared to Natural Language Processing (NLP) based representations in both the typical and one-class classification problems using the Support Vector Machine algorithm. Results show that the NLP features significantly improved classifier performance over the BOW approach both in terms of precision and recall, while using many fewer features. The one-class algorithm using NLP features demonstrated robustness when tested on new domains.