SciELO - Scientific Electronic Library Online

 
 issue48A POS Tagger for Social Media Texts Trained on Web CommentsMore Effective Boilerplate Removal-the GoldMiner Algorithm author indexsubject indexsearch form
Home Pagealphabetic serial listing  

Services on Demand

Journal

Article

Indicators

Related links

  • Have no similar articlesSimilars in SciELO

Share


Polibits

On-line version ISSN 1870-9044

Abstract

SIDOROV, Grigori. Non-continuous Syntactic N-grams. Polibits [online]. 2013, n.48, pp.69-78. ISSN 1870-9044.

In this paper, we present the concept of non-continuous syntactic n-grams. In our previous works we introduced the general concept of syntactic n-grams, i.e., n-grams that are constructed by following paths in syntactic trees. Their great advantage is that they allow introducing of the merely linguistic (syntactic) information into machine learning methods. Certain disadvantage is that previous parsing is required. We also proved that their application in the authorship attribution task gives better results than using traditional n-grams. Still, in those works we considered only continuous syntactic n-grams, i.e., the paths in syntactic trees are not allowed to have bifurcations. In this paper, we propose to remove this limitation, so we consider all sub-trees of length « of a syntactic tree as non-continuous syntactic n-grams. Note that continuous syntactic n-grams are the particular case of non-continuous syntactic n-grams. Further research should show which n-grams are more useful and in which NLP tasks. We also propose a formal manner of writing down (representing) non-continuous syntactic n-grams using parenthesis and commas, for example, "a b [c [d, e], f]. In this paper, we also present examples of construction of non-continuous syntactic n-grams on the basis of the syntactic tree of the FreeLing and the Stanford parser.

Keywords : Vector space model; n-grams; continuous syntactic n-grams; non-continuous syntactic n-grams.

        · abstract in Spanish     · text in Spanish     · Spanish ( pdf )

 

Creative Commons License All the contents of this journal, except where otherwise noted, is licensed under a Creative Commons Attribution License