The origins of our project
Open data is becoming a major concern in our society, as it enables any individual to share any knowledge or information with everyone. In France, state collectivities are slowly starting to make databases on :geography, trafic, raod work, general information on the way they work, etc.
However, when it comes to science, there is very little data available, other than online, which may be free, but imposed on the websites providing them. And there is no comparison with what we can find in English.
For they did figure out, a long time ago, that limiting information and knowledge prevents evolution and innovation. Free classes and raw data complete classic teaching, they are not adversaries, for there is no doubt that a real teacher can never be replaced. Open data should not scare you but inspire confidence. This is why broadcasting information and science via open data is the only way to maintain a cultural diversity in the world and not let a few have the monopoly over knowledge.
The lack of free lexicographic ressources (dictionaries, language tools for translation or localization, etc.) is obvious ; which is the direct result of the lack of initial data, such as words lists with grammar information that would give some unity to the words and enable reliable analysis (there would be no possible ambiguity on words).
Having a multilingual data with large words lists available as Open Data would be an undeniable advantage for research (human sciences or basic search for translation), the preservation of local cultural heritage and private companies activities (which could then focus more on their initial goals and not lose time on looking for these data, or worse, creating them), as well as for teaching, or even for anyone with a personal project.
Many countries or groups supporting the preservation of language heritage with few means could then benefit of a specialist they might not have around.
Our main goal is to promote the creation of a multilingual database with lexicographic, semantic and historic data, by providing the most complete and standardized data in CSV files (Comma Seperated Values) with an open-source license - free to use, reproduce and modify. We chose this format for its flexibility (to use and read), no matter what technology you use to operate it afterwards. The license choice is pragmatic : one should not make his a language or part of it when it belongs to everyone. Our final goal is to achieve this, for any written language. We shall start with the most documented languages such as French, Spanish, and English. We will also do everything in our power to integrate regional languages (Occitan, Catalan, Breton, etc.) and other foreign languages, depending on positive feedback from their respective specialists. Our action will focus on any language or dialect and provide the most exhaustive inventory of their words and how they work together (Grammar categories, Conjugation, etc.). The data will be provided as Open Data under the license « Creative Commons Attribution 3.0 Unported ».
Our second goal is to create a consulting tool to our online database to enable research in a particular language, or between two languages, to help translating.The research criteria could be :
- Lexicography (grammar category, gender, number, etc.).
- Semantics (to enable a synonym list for a word or an expression).
- The words usage (specialized area, such as aeronotics or military).
- Choosing the language register.
- Where the language is spoken.
- When the word or expression was first used.
- The etymology.
We will offer a simple research from an alphabetic list of words or expressions, per language.
The interface of the website will be completely internationalized so that we can reach the largest audience possible.
The work involved by our project starts with an analysis of the French language :
- Complete listing of all the words with their infos (defining their grammar category, gender, number, etc.).
- Extracting the list of the lemmas (main entries in a dictionary).
- Listing each lemma’s definition to create a semantic database, which will be the internal pivotal point of each language, so we can extract synonyms or translate the words between two languages.
Then, for each new language, we will :
- Make a list of the words and qualify them.
- Extract the lemmas.
- Link them with the definitions of the semantic database.
This is a very lenghty task. To give you an idea, between six and eight monts are needed to qualify all the words in French.