2

There were only 3 two-senses adjectives in the 30 target words

3

A variant of the PMI where negative values are replaced by zero.

4

Mostly when the neighbor is morphologically related to the target word, like in interpréter >interprétation.

5

In our case a node represents a target word in a co-occurrence network.

6

Target word senses.

7

s2 : (n) area, country (a particular geographical region of indefinite boundary, usually serving some special purpose or distinguished by its people or culture or geography) “it was a mountainous area”; “Bible country”.

8

Note that Aij is the partial payoff matrix for player i and j, computed multiplying the similarity weight wij with the identity matrix of size c, Ic.

9

Stoplist and Indicative Phrase list are available at http://sites.labic.icmc.usp.br/merleyc/ThesisData/

10

PTStemmer: a stemming toolkit for the Portuguese language – http://code.google.com/p/ptstemmer/

11

We used the implementation available at https://code.google.com/p/jatetoolkit/

13

All the term extraction results for the three corpora using each measure are available at http://sites.labic.icmc.usp.br/merleyc/ThesisData/ .

14

The demo version can be found at http://webground.su/

15

It is possible to use TF-iDF to analyze a larger context: in this case a larger reference corpus should be used to count iDF. But this is beyond our scope in this paper: TD-iDF is used to study a plot context only since we already have an “information portrait” for a larger context.

16

It would be hard to translate these results into English because the segmentation is highly depends on micro-syntax structure. Hopefully, the visual clues – the length of the segments, the distribution of keywords – may give an idea of the potential of this methodology.

20

Ponta embarrassingly extends the crisis from USL. – in English (En)

21

Literally ménager la susceptibilité.

22

Linguistic Data Consortium, University of Pennsylvania, http://catalog.ldc.upenn.edu/LDC2001T02

27

http://wit.istc.cnr.it:8080/tools/citalo

30

There can be multiple ways to define this set including a simple list, regular expressions, or context-free grammars

31

To be fair, one must admit though that great efforts have been made to improve the situation. In fact, there are quite a few onomasiological dictionaries. For example, Roget’s Thesaurus (Roget, 1852), analogical dictionaries (Boissière, 1862, Robert et al., 1993), Longman’s Language Activator (Summers, 1993) various network-based dictionaries: WordNet (Fellbaum,1998; Miller et al., 1990), MindNet (Richardson et al., 1998), and Pathfinder (Schvaneveldt, 1989). There are also various collocation dictionaries (BBI, OECD), reverse dictionaries (Kahn, 1989; Edmonds, 1999) and OneLook which combines a dictionary, WordNet, and an encyclopedia, Wikipedia (http://onelook.com/reverse-dictionary.shtml). A lot of progress has been made over the last few years, yet more can be done especially with respect to indexing (the organization of the data) and navigation. Given the possibilities modern computers offer with respect to storage and access, computational lexicography should probably jettison the distinctions between lexicon, encyclopedia, and thesaurus and unify them into a single resource.

32

Note that outputs can also be polysemous, but ambiguity is not really a problem here, as all the dictionary user wants is to find a given word form.

33

The tip-of-the-tongue phenomenon (http://en.wikipedia.org/wiki/Tip_of_the_tongue) is characterized by the fact that the author (speaker/writer) has only partial access to the word s/he is looking for. The typically lacking parts are phonological (syllables, phonemes). Since all information except this last one seems to be available, and since this is the one preceding articulation, we say: the word is stuck on the tip of the tongue (TOT, or TOT-problem).

34

While paper dictionaries store word forms (lemma) and meanings next to each other, this type of information is distributed across various layers in the mental lexicon. This may lead to certain word access problems. Information distribution is supported by many empirical findings like speech errors (Fromkin, 1980), studies in aphasia (Dell et al., 1997), experiments on priming (Meyer & Schvaneveldt, 1971) or the tip of the tongue phenomenon (Brown & McNeill, 1996). For computer simulations see (Levelt et al., 1999; Dell, 1986).

35

This feature of the mental lexicon (ML) is very important, as in case of failure of one method, one can always resort to another.

37

Whenever we use the term ‘word’ we imply not only single terms but also ‘collocations’ or ‘multiword expressions’, that is, a sequence of words expressing meaning.

38

A word sense should be understood as an indexed word. For a word, there are as many indexes as the word form has senses.

39

Or sequence of lemmas, in case of collocations. Since, a node is here a word sense, its lemma should be considered as a feature.

42

Digitised by Jarmasz (2003), based on the 1987 version of Roget, published by Pearson Education.

43

Please note that the distance of the various elements with respect to a common hypernym may be quite variable, hence cluster names may vary considerably in terms of abstraction.

44

The TAC is organized and sponsored by the U.S. National Institute of Standards and Technology (NIST) and the U.S. Department of Defense.

48

SFJP – a Polish dictionary developed by the Computer Linguistics Group at AGH University of Science and Technology in Kraków, in cooperation with the Department of Computational Linguistics at the Jagiellonian University (Lubaszewski et al. 2001). It contains more than 120 thousand headwords and provides a programming interface – the CLP library (Gajęcki 2009).

49

Actually there are can be multiple values for both the word ID and the categories because of language ambiguity.

50

Actually we operate on form sets and their intersections because the language is ambiguous.

51

Emphasis added.

52

Here a Millian view of proper names is assumed.

53

Tashaphyne, Arabic light Stemmer/segment :1)2) http://tashaphyne.sourceforge.net/

54

Morphological Analyser and Disambiguator for Hebrew Language http://code972.com/hebmorph

55

Clearly, the suggestions are in Italian. We are attempting to provide English translations as semantically similar to the original Italian sentences as possible.

56

For privacy reasons we omitted both context and authors.

57

Note that e9781501510427_i0364.jpg are not grammatically correct in English, while they are perfectly well-formed in Italian, since Italian is a null-subject language.

58

Several variants exist, most notably TER (Translation Edit or Error Rate), in which local swaps of sequences of words are allowed. However, WER and TER are known to behave very similarly, as described in Cer et al. (2010).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset