Short Introduction

Gromoteur is a tool for linguists that gives easy access to textual corpora. It allows to get pages from the Web or from local files, treat them, analyze them, and output results.

Here is a simple diagram of Gromoteur's text handling chains:

Gromoteur Diagram

A short description of the different paths

Gromoteur Diagram

Gromoteur can download whole websites, following your rules, or start with a search engine and get the results. Since version 2, it can use your login to access restricted websites (login with Firefox or Chrome).

It transforms the files to pure Unicode text and puts them into your database. It can handle Html files and Pdf files.

Gromoteur Diagram

Gromoteur can import a whole folder of text or pdf files from your hard drive or import tab separated tables that you have exported from your spreadsheet.

Gromoteur Diagram

Gromoteur allows you to look through your data, sort it, filter it, apply simple tools like lemmatizers, taggers, and word segmentation for different languages. It includes a graphical selection tool that allows to select specific parts of Webpages, for example the central part, thus excluding the repetitive links and ads around the content.

Gromoteur Diagram

Gromoteur allows you to export the data into separate files, into one unique file. Specific words can be highlighted and Gromoteur can even output a concordancer view of the data:

Gromoteur Diagram

Gromoteur Diagram

Gromoteur includes the Nexico tool, a simplified version of Lexico3. It can compute the specific terms of any selection of pages and it can compute textual co-occurrences based on a fast implementation of the cumulative hypergeometric distribution. Tables and images can be exported.

Collocation graph around "harmless" from the Hitchhiker's Guide to the Galaxy made with the Gromoteur.
Arrows point to words that appear astonishingly often in 5-grams around the source word. For example, "one" appears often in the 5-grams around "harmless", but "one" does not have "harmless" very often in its 5-grams.



English: A short manual: Quick Gro.


The same short manual in French: Gro & Rapide.

Diapositives de la présentation de Gromoteur aux JADT 2014.

Chinese: A short manual.

Reference: Please cite this paper if you use the Gromoteur for your research: Kim Gerdes, "Corpus collection and analysis for the linguistic layman: The Gromoteur", Proceedings of the JADT 2014, Paris.

The tags of the tagger.



Windows Windows: Gromoteur 2.5 beta installer
If you need to run the program without administrator privileges:
Gromoteur 2.5 beta zipped. Unzip and double click on the executable "gromoteur" in the folder.

All versions:

Apple Apple Mac OS X: Gromoteur 2.1 beta.
Unzip the downloaded file, for example on the Desktop. Open a terminal. type cd Desktop/gromoteur Enter. Then type ./gromoteur

All versions:
Gromoteur mac 2.1 beta

Linux Linux: Gromoteur 2.5
Very big file (250Mb)! Unzip the downloaded file. Double click on the executable in the folder (gromoteur).


Source code

The source can be obtained from launchpad and is open for any modification there

bzr branch lp:grosmoteur

You can also click on the number of the latest version and then on "download tarball".

You'll need:

Python 2 at least 2.7, QT and PyQT at least 4.4. Easy_install or pip from setuptools can be of use to install the following modules gromoteur relies on:

python should start the system.



What can Gromoteur do that I can't do with a simple websearch?

Well, you can live without the Gromoteur, it's just faster to have it.

You can, for example, look for a word and get all the sentence that contain this word. You can do this by hand, which may take years, you can use Gromoteur which takes a few minutes or hours.

If you look for all the sentences containing two words, Bing allows you to look for these words, giving you all the pages containing these words, but these words can be anywhere on the page. Gromoteur can check these pages and only keep the pages where these two words occur in some way that you can express in a regular expression.

Gromoteur can do this while checking the language of the page and only keeps the pages in the right langauage. Later, you can use it to put the results nicely into a concordancer table for further analysis.

My result page always remains empty. What to do?

This may be a bug, but more likely you put some restrictions that are a bit too high. Check this:

Can you access the start pages in your browser without using any proxies or VPN? - You can configure the VPN in the last configuration page (expert mode)

Are you doing a Bing powered search? If yes, try the "try Bing" button. Do you get a few results?

Do you have anything in the "constraints", the "levels", "restrict to pages containing"? If yes, erase it all.

Does your start URL match the URL restriction? If it doesn't Gromoteur closes down the page collection immediately (because it tries out the first URL, finds that it doesn't match and has no other URLs to continue).

Put the maxima to a few dozens.

For the time being, relax most constraints, and put them back into place one by one:

Run again.

Do you still have an empty file? Mail me your configuration file (please find it in the subfolder lib/spiderConfigurations where Gromoteur is installed).

How to add a new language?

For the moment, not possible from the interface. Get the source package, add a little language corpus to the language folder and run "python". That should do it.

Or mail me the language example file and I'll add it for everyone in the next version.

Where does this dumb name "Gromoteur" come from and how do you pronounce it?

Many years ago, in the beginning of this millenium, I was working with someone on possible borders in German compound nouns. We needed real world examples and I programmed a little script that walked around German web pages keeping the 100 longest words it stumbled upon.

This script was still in Java, and not a line of code is left in the recent Gromoteur, but the idea remains that this can be used to find words in specific forms, for example really long words.

The French expression gros mot means "cuss word" in English and literaly it translates as "fat word". And a moteur (de recherche) is a search engine in French. So this is a search engine for fat words and loads of text, a Gromoteur in a word...

And it just sounds great: It is pronounced as gro [gʁo] and moteur [mɔtœʁ], thus, with vowel harmony, as [gʁomotœʁ].

During my work at the Academy of Sciences in Peking, the tool also received the nice translation 胖摩托, Pàngmótuō in pinyin. To get the tones right, stamp your foot while saying pàng, think "really?" while saying mó, and sing the last syllable tuō.

Can you add the great feature X to the Gromoteur?

There are a quite a few things to improve in this software.

Some of the future developments that are planned for include:

Nag me about any of these points if you really need them.



Please find my address on my webpage.