Word Embeddings Go to Italy: A Comparison of Models and Training Datasets.

Abstract

In this paper we present some preliminary results on the generation of word embeddings for the Italian language. We compare two popular word representation models, word2vec and GloVe, and train them on two datasets with different stylistic properties. We test the generated word embeddings on a word analogy test derived from the one originally proposed for word2vec, adapted to capture some of the linguistic aspects that are specific of Italian. Results show that the tested models are able to create syntactically and semantically meaningful word embeddings despite the higher morphological complexity of Italian with respect to English. Moreover, we have found that the stylistic properties of the training dataset plays a relevant role in the type of information captured by the produced vectors.

Keywords

Word2vecWord (group theory)AnalogyNatural language processingComputer scienceArtificial intelligenceRepresentation (politics)LinguisticsEmbedding

Affiliated Institutions

National Research Council IT

Related Publications

Evaluating the Effectiveness of Large Language Models in Representing Textual Descriptions of Geometry and Spatial Relations (Short Paper)

T. B. Brown , Benjamin F. Mann , Nick Ryder +28 more

This research focuses on assessing the ability of large language models (LLMs) in representing geometries and their spatial relations. We utilize LLMs including GPT-2 and BERT t...

2023 Leibniz-Zentrum für Informatik (Schlo... 14006 citations

Glove: Global Vectors for Word Representation

Jeffrey Pennington , Richard Socher , Christopher D. Manning

Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the o...

2014 32840 citations

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Alex Wang , Amanpreet Singh , Julian Michael +3 more

Human ability to understand language is general, flexible, and robust. In contrast, most NLU models above the word level are designed for a specific task and struggle with out-o...

2018 3699 citations

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe , Christian Szegedy

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. T...

2024 arXiv (Cornell University) 15635 citations

An exploration of large vocabulary tools for small vocabulary phonetic recognition

Tara N. Sainath , Bhuvana Ramabhadran , Michael Picheny

While research in large vocabulary continuous speech recognition (LVCSR) has sparked the development of many state of the art research ideas, research in this domain suffers fro...

2009 33 citations

Publication Info

Year: 2015
Type: book-chapter
Citations: 48
Access: Closed

External Links

Citation Metrics

OpenAlex

Cite This

APA Style

                            
                                    Giacomo Berardi, 
                                
                                    Andrea Esuli, 
                                
                                    Diego Marcheggiani
                                
                            (2015). 
                            Word Embeddings Go to Italy: A Comparison of Models and Training Datasets.. 
                            IIR eBooks
                            
                            .