Cem Bozsahin | Middle East Technical University (original) (raw)
Papers by Cem Bozsahin
TheBench is a tool to study monadic structures in natural language. It is for writing monadic gra... more TheBench is a tool to study monadic structures in natural language. It is for writing monadic grammars to explore analyses, compare diverse languages through their categories, and to train models of grammar from form-meaning pairs where syntax is latent variable.
Monadic structures are binary combinations of elements that employ semantics of composition only. TheBench is essentially old-school categorial grammar to syntacticize the idea, with the implication that although syntax is autonomous (recall \emph{colorless green ideas sleep furiously}), the treasure is in the baggage it carries at every step, viz. semantics, more narrowly, predicate-argument structures indicating choice of categorial reference and its consequent placeholders for decision in such structures.
There is some new thought in old school.
Unlike traditional categorial grammars, application is turned into composition in monadic analysis. Moreover,
every correspondence requires specifying two command relations, one on syntactic command and the other on semantic command. A monadic grammar of TheBench contains only synthetic elements (called `objects' in category theory of mathematics) that are shaped by this analytic invariant, viz. composition. Both ingredients (command relations) of any analytic step must therefore be functions (`arrows' in category theory). TheBench is one implementation of the idea for iterative development of such
functions along with grammar of synthetic elements.
Mobile Sequencers, 2024
The article is an attempt to contribute to explorations of a common origin for language and plann... more The article is an attempt to contribute to explorations of a common origin for language and planned-collaborative action. It gives ‘semantics of change’ the central stage in the synthesis, from its history and recordkeeping to its development, its syntax, delivery and reception, including substratal aspects.
It is suggested that to arrive at a common core, linguistic semantics must be understood as studying through syntax mobile agent’s representing, tracking and coping with change and no change. Semantics of actions can be conceived the same way, but through plans instead of syntax. The key point is the following: Sequencing itself, of words and action sequences, brings in more structural interpretation to the sequence than which is immediately evident from the sequents themselves. Mobile sequencers can be understood as subjects structuring reporting, understanding and keeping track of change and no change. The idea invites rethinking of the notion of category, both in language and in planning.
Linguist’s search for explaining the gaps in possible structures, and offlineness of lan- guage, and computer scientist’s search for possible plan landscape, and onlineness of action, are leveraged by the synthesis for open exploration. It leaves very little room for analogies and instrumental thinking, such as language being an infinite gift, or computer being the ultimate human tool. Nothing is infinite if modern physics is right, not even the computer’s name- recursive representations, which is commonly—and misleadingly—compared with human’s value-recursive representations. This has implications for the synthesis.
Understanding understanding change by mobile agents is suggested to be about human extended practice, not extended-human practice. That’s why linguistics is as important as computer science in the synthesis. It must rely on representational history of acts, thoughts and expressions, personal and public, crosscutting overtness and covertness of these phenom- ena. It has implication for anthropology in the extended practice, which is covered briefly.
Bu yazinin amaci, Ulamsal Dilbilgisi (Categorial Grammar) alaninda son yillarda yapilan calismala... more Bu yazinin amaci, Ulamsal Dilbilgisi (Categorial Grammar) alaninda son yillarda yapilan calismalari ozetlemek, ve bu kuramin Turkce'ye uygulanmasinda kullanilan yeni yontemleri tanitmaktir.
36th National Linguistics Meeting, 2022, Kaysei, Turkey
4-son.pdf is the full paper (in Turkish)
Journal of Logic, Language and Information, 2023
Two positions of Bolinger, about synonymy and meaningfulness of words, point to significance of c... more Two positions of Bolinger, about synonymy and meaningfulness of words, point to significance of controlling the referentiality of word forms, from representing them in grammar to their projection onto surface structure, i.e. configurationality. In particular, it becomes critical to control the range of surface substitution for surface syntactic categories of words to maintain referential properties of idiosyncrasy. Categorial grammars as reference systems suggest ways to keep the two aspects in grammar. The first dividend of adopting a categorial perspective is systematically distinguishing metaphorical sense extensions from idioms. The second dividend is procedural. Some tokens can be seen to be types themselves, with distinct referential import. Furthermore, some idiomatic meanings which require a unique phonological word for specific reference to events and participants can be types too. Together they can be thought of as the idiotype. The idiotype as idiosyncrasy’s foot through the door of grammar reveals controllable range of possibilities for referentiality and configurationality of idiosyncrasy. Phrasal and idiomatic meanings can then be treated compositionally, given the proposed added role of paracompositionality arising from event versus predicate distinction at the level of predicate-argument structure, in multiword expression cum idiom and phrasal verb treatment, which we show for English, Mandarin Chinese and Turkish.
Natural Language Engineering, 2022
We propose an integrated deep learning model for morphological segmentation, morpheme tagging, pa... more We propose an integrated deep learning model for morphological segmentation, morpheme tagging, part-of-speech (POS) tagging, and syntactic parsing onto dependencies, using cross-level contextual information flow for every word, from segments to dependencies, with an attention mechanism at horizontal flow. Our model extends the work of Nguyen and Verspoor ((2018). Proceedings of the CoNLL Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. The Association for Computational Linguistics, pp. 81–91.) on joint POS tagging and dependency parsing to also include morphological segmentation and morphological tagging. We report our results on several languages. Primary focus is agglutination in morphology, in particular Turkish morphology, for which we demonstrate improved performance compared to models trained for individual tasks. Being one of the earlier efforts in joint modeling of syntax and morphology along with dependencies, we discuss prospective guidelines for ...
We report two tools to conduct psycholinguistic experiments on Turkish words. KelimetriK allows e... more We report two tools to conduct psycholinguistic experiments on Turkish words. KelimetriK allows experimenters to choose words based on desired orthographic scores of word frequency, bigram and trigram frequency, ON, OLD20, ATL and subset/superset similarity. Turkish version of Wuggy generates pseudowords from one or more template words using an efficient method. The syllabified version of the words are used as the input, which are decomposed into their sub-syllabic components. The bigram frequency chains are constructed by the entire words’ onset, nucleus and coda patterns. Lexical statistics of stems and their syllabification are compiled by us from BOUN corpus of 490 million words. Use of these tools in some experiments is shown.
Word Order in Turkish, 2019
Typed conception of surface-command and LF-command reveals a unique degree of freedom for specify... more Typed conception of surface-command and LF-command reveals a unique degree of freedom for specifying a verb in its combinatory capacity. It naturally brings in the question of word order in relation to its semantics. We exemplify from the Turkish verb and verbs of three other languages with different word order behavior. The differences are explainable in syntax if we assume that surface-command and LF-command are free to vary in a lexical correspondence, and that being the head of a construction also means determining its semantics. Turkish verbs are not heads of any construction; word-order variation has semantics arising from metrical grid and autonomous phonological events. Welsh verbs are heads of relativization; as such their logical forms must be different than their plain semantics. European Portuguese treats referentially dependent and independent arguments of the verb differently, exploiting word order for them but not to the extent of requiring a different category for th...
This study aims to model social dynamics of an idealized closed musical society to investigate wh... more This study aims to model social dynamics of an idealized closed musical society to investigate whether a musical agreement in terms of shared musical expectations can be attained without external intervention or centralized control. Our model implements a multi-agent simulation, where identical agents, which have their own private two dimensional transition matrix that defines their expectations on all possible bi-gram note transitions, are involved in round-based pairwise interactions. Throughout an interaction two agents are randomly chosen from the population, one as the performer and the other as the listener. Performers compose a fixed length melodic line by successively appending their most expected note sequences recursively by using sounds from a finite inventory. Listeners assess this melody to determine the success of the interaction by evaluating how familiar they are to the bi-gram transitions that they hear. According to success the interacting parties perform updates o...
The Turkish Discourse Bank (TDB) is a resource of approximately 400,000 words in its current rele... more The Turkish Discourse Bank (TDB) is a resource of approximately 400,000 words in its current release in which explicit discourse connectives and phrasal expressions are annotated along with the textual spans they relate. The corpus has been annotated by annotators using a semiautomatic annotation tool. We expect that it will enable researchers to study aspects of language beyond the sentence level. The TDB follows the Penn Discourse Tree Bank (PDTB) in adopting a connective-based annotation for discourse. The connectives are considered heads of annotated discourse relations. We have so far found only applicative structures in Turkish discourse, which, unlike syntactic heads, seem to have no need for composition. Interleaving in-text spans of arguments appears to be only apparently-crossing, and related to information structure.
Turkish Natural Language Processing, 2018
Wide-coverage parsing poses three demands: broad coverage over preferably free text, depth in sem... more Wide-coverage parsing poses three demands: broad coverage over preferably free text, depth in semantic representation for purposes such as inference in question answering, and computational efficiency. We show for Turkish that these goals are not inherently contradictory when we assign categories to sub-lexical elements in the lexicon. The presumed computational burden of processing such lexicons does not arise when we work with automata-constrained formalisms that are trainable on word-meaning correspondences at the level of predicate-argument structures for any string, which is characteristic of radically lexicalizable grammars. This is helpful in morphologically simpler languages too, where word-based parsing has been shown to benefit from sub-lexical training.
Fundamental Issues of Artificial Intelligence, 2016
Computing and Philosophy, 2016
The paper argues that a computational constraint is one that appeals to control of computational ... more The paper argues that a computational constraint is one that appeals to control of computational resources in a computationalist explanation. Such constraints may arise in a theory and in its models. Instrumental use of the same concept is trivial because the constraining behavior of any function eventually reduces to its computation. Computationalism is not instrumentalism. Born-again computationalism, which is an ardent form of pancomputationalism, may need some soul searching about whether a genuinely computational explanation is necessary or needed in every domain, because the resources in a computationalist explanation are limited. Computational resources are the potential targets of computational constraints. They are representability, time, space, and, possibly, randomness, assuming that ‘BPP = BQP?’ question remains open. The first three are epitomized by the Turing machine, and manifest themselves for example in complexity theories. Randomness may be a genuine resource in quantum computing. From this perspective, some purported computational constraints may be instrumental, and some supposedly noncomputational or cognitivist constraints may be computational. Examples for both cases are provided. If pancomputationalism has instrumentalism in mind, then it may be a truism, therefore not very interesting, but born-again computationalism cannot be computationalism as conceived here.
This study is a preliminary investigation of verb classes in Turkish Sign Language (TiD), and how... more This study is a preliminary investigation of verb classes in Turkish Sign Language (TiD), and how they can be captured in a lexicalized generative grammar. TiD manifests an array of verb classes, as in other sign languages: plain verbs, single/double agreement verbs, and spatial verbs. Syntactic categorisation of these verb classes is a challenge to any linguistic theory because it involves multi-modal features (manual and nonmanual signs), a relativistic pronominal reference scheme, an unorthodox morphology for signs and iconicity. We start our investigation with directionality (and grammatical relations) because they are considered to be basic for understanding syntactic asymmetries, as Ross (1967) and subsequent research has shown for coordination and extraction. Rather than confining ourselves to single clauses without embedding, we investigate syntactic constructions and try to determine word order and directionality. An important assumption in this approach is that directionality can be captured in the lexicon, in the lexical categories of verbs, as a systematic combinatory property of argument-taking entities such as verbs, under the guidance of an invariant Universal Grammar (Steedman 1996, 2000). The question then becomes testing the hypotheses on directionality of verbs by looking at syntactic constructions that depend on verbal categories coming from the lexicon.
Combinatory Linguistics, 2012
New Generation Computing, 1988
TheBench is a tool to study monadic structures in natural language. It is for writing monadic gra... more TheBench is a tool to study monadic structures in natural language. It is for writing monadic grammars to explore analyses, compare diverse languages through their categories, and to train models of grammar from form-meaning pairs where syntax is latent variable.
Monadic structures are binary combinations of elements that employ semantics of composition only. TheBench is essentially old-school categorial grammar to syntacticize the idea, with the implication that although syntax is autonomous (recall \emph{colorless green ideas sleep furiously}), the treasure is in the baggage it carries at every step, viz. semantics, more narrowly, predicate-argument structures indicating choice of categorial reference and its consequent placeholders for decision in such structures.
There is some new thought in old school.
Unlike traditional categorial grammars, application is turned into composition in monadic analysis. Moreover,
every correspondence requires specifying two command relations, one on syntactic command and the other on semantic command. A monadic grammar of TheBench contains only synthetic elements (called `objects' in category theory of mathematics) that are shaped by this analytic invariant, viz. composition. Both ingredients (command relations) of any analytic step must therefore be functions (`arrows' in category theory). TheBench is one implementation of the idea for iterative development of such
functions along with grammar of synthetic elements.
Mobile Sequencers, 2024
The article is an attempt to contribute to explorations of a common origin for language and plann... more The article is an attempt to contribute to explorations of a common origin for language and planned-collaborative action. It gives ‘semantics of change’ the central stage in the synthesis, from its history and recordkeeping to its development, its syntax, delivery and reception, including substratal aspects.
It is suggested that to arrive at a common core, linguistic semantics must be understood as studying through syntax mobile agent’s representing, tracking and coping with change and no change. Semantics of actions can be conceived the same way, but through plans instead of syntax. The key point is the following: Sequencing itself, of words and action sequences, brings in more structural interpretation to the sequence than which is immediately evident from the sequents themselves. Mobile sequencers can be understood as subjects structuring reporting, understanding and keeping track of change and no change. The idea invites rethinking of the notion of category, both in language and in planning.
Linguist’s search for explaining the gaps in possible structures, and offlineness of lan- guage, and computer scientist’s search for possible plan landscape, and onlineness of action, are leveraged by the synthesis for open exploration. It leaves very little room for analogies and instrumental thinking, such as language being an infinite gift, or computer being the ultimate human tool. Nothing is infinite if modern physics is right, not even the computer’s name- recursive representations, which is commonly—and misleadingly—compared with human’s value-recursive representations. This has implications for the synthesis.
Understanding understanding change by mobile agents is suggested to be about human extended practice, not extended-human practice. That’s why linguistics is as important as computer science in the synthesis. It must rely on representational history of acts, thoughts and expressions, personal and public, crosscutting overtness and covertness of these phenom- ena. It has implication for anthropology in the extended practice, which is covered briefly.
Bu yazinin amaci, Ulamsal Dilbilgisi (Categorial Grammar) alaninda son yillarda yapilan calismala... more Bu yazinin amaci, Ulamsal Dilbilgisi (Categorial Grammar) alaninda son yillarda yapilan calismalari ozetlemek, ve bu kuramin Turkce'ye uygulanmasinda kullanilan yeni yontemleri tanitmaktir.
36th National Linguistics Meeting, 2022, Kaysei, Turkey
4-son.pdf is the full paper (in Turkish)
Journal of Logic, Language and Information, 2023
Two positions of Bolinger, about synonymy and meaningfulness of words, point to significance of c... more Two positions of Bolinger, about synonymy and meaningfulness of words, point to significance of controlling the referentiality of word forms, from representing them in grammar to their projection onto surface structure, i.e. configurationality. In particular, it becomes critical to control the range of surface substitution for surface syntactic categories of words to maintain referential properties of idiosyncrasy. Categorial grammars as reference systems suggest ways to keep the two aspects in grammar. The first dividend of adopting a categorial perspective is systematically distinguishing metaphorical sense extensions from idioms. The second dividend is procedural. Some tokens can be seen to be types themselves, with distinct referential import. Furthermore, some idiomatic meanings which require a unique phonological word for specific reference to events and participants can be types too. Together they can be thought of as the idiotype. The idiotype as idiosyncrasy’s foot through the door of grammar reveals controllable range of possibilities for referentiality and configurationality of idiosyncrasy. Phrasal and idiomatic meanings can then be treated compositionally, given the proposed added role of paracompositionality arising from event versus predicate distinction at the level of predicate-argument structure, in multiword expression cum idiom and phrasal verb treatment, which we show for English, Mandarin Chinese and Turkish.
Natural Language Engineering, 2022
We propose an integrated deep learning model for morphological segmentation, morpheme tagging, pa... more We propose an integrated deep learning model for morphological segmentation, morpheme tagging, part-of-speech (POS) tagging, and syntactic parsing onto dependencies, using cross-level contextual information flow for every word, from segments to dependencies, with an attention mechanism at horizontal flow. Our model extends the work of Nguyen and Verspoor ((2018). Proceedings of the CoNLL Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. The Association for Computational Linguistics, pp. 81–91.) on joint POS tagging and dependency parsing to also include morphological segmentation and morphological tagging. We report our results on several languages. Primary focus is agglutination in morphology, in particular Turkish morphology, for which we demonstrate improved performance compared to models trained for individual tasks. Being one of the earlier efforts in joint modeling of syntax and morphology along with dependencies, we discuss prospective guidelines for ...
We report two tools to conduct psycholinguistic experiments on Turkish words. KelimetriK allows e... more We report two tools to conduct psycholinguistic experiments on Turkish words. KelimetriK allows experimenters to choose words based on desired orthographic scores of word frequency, bigram and trigram frequency, ON, OLD20, ATL and subset/superset similarity. Turkish version of Wuggy generates pseudowords from one or more template words using an efficient method. The syllabified version of the words are used as the input, which are decomposed into their sub-syllabic components. The bigram frequency chains are constructed by the entire words’ onset, nucleus and coda patterns. Lexical statistics of stems and their syllabification are compiled by us from BOUN corpus of 490 million words. Use of these tools in some experiments is shown.
Word Order in Turkish, 2019
Typed conception of surface-command and LF-command reveals a unique degree of freedom for specify... more Typed conception of surface-command and LF-command reveals a unique degree of freedom for specifying a verb in its combinatory capacity. It naturally brings in the question of word order in relation to its semantics. We exemplify from the Turkish verb and verbs of three other languages with different word order behavior. The differences are explainable in syntax if we assume that surface-command and LF-command are free to vary in a lexical correspondence, and that being the head of a construction also means determining its semantics. Turkish verbs are not heads of any construction; word-order variation has semantics arising from metrical grid and autonomous phonological events. Welsh verbs are heads of relativization; as such their logical forms must be different than their plain semantics. European Portuguese treats referentially dependent and independent arguments of the verb differently, exploiting word order for them but not to the extent of requiring a different category for th...
This study aims to model social dynamics of an idealized closed musical society to investigate wh... more This study aims to model social dynamics of an idealized closed musical society to investigate whether a musical agreement in terms of shared musical expectations can be attained without external intervention or centralized control. Our model implements a multi-agent simulation, where identical agents, which have their own private two dimensional transition matrix that defines their expectations on all possible bi-gram note transitions, are involved in round-based pairwise interactions. Throughout an interaction two agents are randomly chosen from the population, one as the performer and the other as the listener. Performers compose a fixed length melodic line by successively appending their most expected note sequences recursively by using sounds from a finite inventory. Listeners assess this melody to determine the success of the interaction by evaluating how familiar they are to the bi-gram transitions that they hear. According to success the interacting parties perform updates o...
The Turkish Discourse Bank (TDB) is a resource of approximately 400,000 words in its current rele... more The Turkish Discourse Bank (TDB) is a resource of approximately 400,000 words in its current release in which explicit discourse connectives and phrasal expressions are annotated along with the textual spans they relate. The corpus has been annotated by annotators using a semiautomatic annotation tool. We expect that it will enable researchers to study aspects of language beyond the sentence level. The TDB follows the Penn Discourse Tree Bank (PDTB) in adopting a connective-based annotation for discourse. The connectives are considered heads of annotated discourse relations. We have so far found only applicative structures in Turkish discourse, which, unlike syntactic heads, seem to have no need for composition. Interleaving in-text spans of arguments appears to be only apparently-crossing, and related to information structure.
Turkish Natural Language Processing, 2018
Wide-coverage parsing poses three demands: broad coverage over preferably free text, depth in sem... more Wide-coverage parsing poses three demands: broad coverage over preferably free text, depth in semantic representation for purposes such as inference in question answering, and computational efficiency. We show for Turkish that these goals are not inherently contradictory when we assign categories to sub-lexical elements in the lexicon. The presumed computational burden of processing such lexicons does not arise when we work with automata-constrained formalisms that are trainable on word-meaning correspondences at the level of predicate-argument structures for any string, which is characteristic of radically lexicalizable grammars. This is helpful in morphologically simpler languages too, where word-based parsing has been shown to benefit from sub-lexical training.
Fundamental Issues of Artificial Intelligence, 2016
Computing and Philosophy, 2016
The paper argues that a computational constraint is one that appeals to control of computational ... more The paper argues that a computational constraint is one that appeals to control of computational resources in a computationalist explanation. Such constraints may arise in a theory and in its models. Instrumental use of the same concept is trivial because the constraining behavior of any function eventually reduces to its computation. Computationalism is not instrumentalism. Born-again computationalism, which is an ardent form of pancomputationalism, may need some soul searching about whether a genuinely computational explanation is necessary or needed in every domain, because the resources in a computationalist explanation are limited. Computational resources are the potential targets of computational constraints. They are representability, time, space, and, possibly, randomness, assuming that ‘BPP = BQP?’ question remains open. The first three are epitomized by the Turing machine, and manifest themselves for example in complexity theories. Randomness may be a genuine resource in quantum computing. From this perspective, some purported computational constraints may be instrumental, and some supposedly noncomputational or cognitivist constraints may be computational. Examples for both cases are provided. If pancomputationalism has instrumentalism in mind, then it may be a truism, therefore not very interesting, but born-again computationalism cannot be computationalism as conceived here.
This study is a preliminary investigation of verb classes in Turkish Sign Language (TiD), and how... more This study is a preliminary investigation of verb classes in Turkish Sign Language (TiD), and how they can be captured in a lexicalized generative grammar. TiD manifests an array of verb classes, as in other sign languages: plain verbs, single/double agreement verbs, and spatial verbs. Syntactic categorisation of these verb classes is a challenge to any linguistic theory because it involves multi-modal features (manual and nonmanual signs), a relativistic pronominal reference scheme, an unorthodox morphology for signs and iconicity. We start our investigation with directionality (and grammatical relations) because they are considered to be basic for understanding syntactic asymmetries, as Ross (1967) and subsequent research has shown for coordination and extraction. Rather than confining ourselves to single clauses without embedding, we investigate syntactic constructions and try to determine word order and directionality. An important assumption in this approach is that directionality can be captured in the lexicon, in the lexical categories of verbs, as a systematic combinatory property of argument-taking entities such as verbs, under the guidance of an invariant Universal Grammar (Steedman 1996, 2000). The question then becomes testing the hypotheses on directionality of verbs by looking at syntactic constructions that depend on verbal categories coming from the lexicon.
Combinatory Linguistics, 2012
New Generation Computing, 1988
2024 37. Ulusal Dilbilim Kurultayında davetli konuşma [invited talk at 37th Annual Meeting of Lin... more 2024 37. Ulusal Dilbilim Kurultayında davetli konuşma [invited talk at 37th Annual Meeting of Linguistic Society of Türkiye]
UDK 36 Kayseri Erciyes sunumu
These are the slides of the talk I gave at Ankara DTCF Linguistics department, on October 4, 2022... more These are the slides of the talk I gave at Ankara DTCF Linguistics department, on October 4, 2022. [Turkish text with some Turkish, English, and Chinese examples]
These are the slides (in Turkish) for the talk I gave to students of linguistics, 2021.
These are the notes i shared in the 'sohbet' we had at Dilbilim Ogrenci Platformu
These are the slides for the talk I gave at Bogazici Univ. CogSci colloquium, 2020.
There are the slides for the talk I gave at Math Club in Turkish, ODTU, Feb 2019