Gromoteur is a tool for linguists that gives easy access to textual corpora. It allows to get pages from the Web or from local files, treat them, analyze them, and output results.
Here is a simple diagram of Gromoteur's text handling chains:
Gromoteur can download whole websites, following your rules, or start with a search engine and get the results. Since version 2, it can use your login to access restricted websites (login with Firefox or Chrome).
It transforms the files to pure Unicode text and puts them into your database. It can handle Html files and Pdf files.
Gromoteur can import a whole folder of text or pdf files from your hard drive or import tab separated tables that you have exported from your spreadsheet.
Gromoteur allows you to look through your data, sort it, filter it, apply simple tools like lemmatizers, taggers, and word segmentation for different languages. It includes a graphical selection tool that allows to select specific parts of Webpages, for example the central part, thus excluding the repetitive links and ads around the content.
Gromoteur allows you to export the data into separate files, into one unique file. Specific words can be highlighted and Gromoteur can even output a concordancer view of the data:
Gromoteur includes the Nexico tool, a simplified version of Lexico3. It can compute the specific terms of any selection of pages and it can compute textual co-occurrences based on a fast implementation of the cumulative hypergeometric distribution. Tables and images can be exported.
English: A short manual: Quick Gro.
Chinese: A short manual.
Reference: Please cite this paper if you use the Gromoteur for your research: Kim Gerdes, "Corpus collection and analysis for the linguistic layman: The Gromoteur", Proceedings of the JADT 2014, Paris.
The source can be obtained from launchpad and is open for any modification there
bzr branch lp:grosmoteur
You can also click on the number of the latest version and then on "download tarball".
Python 2 at least 2.7, QT and PyQT at least 4.4. Easy_install or pip from setuptools can be of use to install the following modules gromoteur relies on:
python gromoteur.py should start the system.
Well, you can live without the Gromoteur, it's just faster to have it.
You can, for example, look for a word and get all the sentence that contain this word. You can do this by hand, which may take years, you can use Gromoteur which takes a few minutes or hours.
If you look for all the sentences containing two words, Bing allows you to look for these words, giving you all the pages containing these words, but these words can be anywhere on the page. Gromoteur can check these pages and only keep the pages where these two words occur in some way that you can express in a regular expression.
Gromoteur can do this while checking the language of the page and only keeps the pages in the right langauage. Later, you can use it to put the results nicely into a concordancer table for further analysis.
This may be a bug, but more likely you put some restrictions that are a bit too high. Check this:
Can you access the start pages in your browser without using any proxies or VPN? - You can configure the VPN in the last configuration page (expert mode)
Are you doing a Bing powered search? If yes, try the "try Bing" button. Do you get a few results?
Do you have anything in the "constraints", the "levels", "restrict to pages containing"? If yes, erase it all.
Does your start URL match the URL restriction? If it doesn't Gromoteur closes down the page collection immediately (because it tries out the first URL, finds that it doesn't match and has no other URLs to continue).
Put the maxima to a few dozens.
For the time being, relax most constraints, and put them back into place one by one:
Do you still have an empty file? Mail me your configuration file (please find it in the subfolder lib/spiderConfigurations where Gromoteur is installed).
For the moment, not possible from the interface. Get the source package, add a little language corpus to the language folder and run "python ngram.py". That should do it.
Or mail me the language example file and I'll add it for everyone in the next version.
Many years ago, in the beginning of this millenium, I was working with someone on possible borders in German compound nouns. We needed real world examples and I programmed a little script that walked around German web pages keeping the 100 longest words it stumbled upon.
This script was still in Java, and not a line of code is left in the recent Gromoteur, but the idea remains that this can be used to find words in specific forms, for example really long words.
The French expression gros mot means "cuss word" in English and literaly it translates as "fat word". And a moteur (de recherche) is a search engine in French. So this is a search engine for fat words and loads of text, a Gromoteur in a word...
And it just sounds great: It is pronounced as gro [gʁo] and moteur [mɔtœʁ], thus, with vowel harmony, as [gʁomotœʁ].
During my work at the Academy of Sciences in Peking, the tool also received the nice translation 胖摩托, Pàngmótuō in pinyin. To get the tones right, stamp your foot while saying pàng, think "really?" while saying mó, and sing the last syllable tuō.
There are a quite a few things to improve in this software.
Some of the future developments that are planned for include:
Nag me about any of these points if you really need them.