Archiv der Kategorie: English

Porter Stemmer for German in Python

Anybody need a quick and dirty Python 2.5.2 implementation of the Porter German stemming algorithm?

I wrote one for the German S.L.U.T student project today: porterde.py

The only testing I did is running it on the complete list of examples on above page (pink background).

To use, import porterde as a module and call porterde.stem(word). The function requires and returns unicode strings.

Edit: I should have looked harder before jumping into this, there already is PyStemmer which does the same but for eight languages. Thanks to Mathieu for pointing that out.

On Diversity

Arguing for diversity is a tricky business. In my view, the main difficulty is to distinguish between a) why diversity is good in the first place, and b) why it would be bad to lose existing diversity.

In his book Language Death, linguist David Crystal argues for preserving the diversity of the languages of the world, exceptionally many of which are threatened by extinction nowadays. The second chapter presents five answers to the question “Why should we care?”, the first answer being “Because we need diversity.” Unfortunately, Crystal fails to draw aforementioned distinction. Supposedly, he counters the view according to which mankind would be better off with a single universal language, or as few different languages as possible. Actually, he almost exclusively lists difficulties we would encounter in going the way to a monolingual planet – such as loss of profits for companies whose employees stop to learn foreign languages, or loss of cultural heritage because we wouldn’t be able to understand existing documents any longer. After a few pages, both Crystal’s argument and my patience were in shambles to an extent where I had to take a break and blog my five cents about cultural and linguistic diversity.

The way I think about language, the ability to develop and use it is innate to humans and can’t be taken away from us. It is, in other words, nothing we need to worry about in the context of language death. What is endangered is particular shapes in which our faculty of speech manifests itself, shapes that, for all we can say, will not cease to come into existence, change, and become extinct for as long as mankind will exist. The human facility of speech is like an ingenious algorithm running for eternity, generating a new unique fractal of stunning complexity and beauty every few seconds. While it is understandable that one might want to preserve all of those images, most people will certainly agree that the real value is in the algorithm and not any one of its infinitely many outputs. Likewise, for me, no particular language has any value in itself.

This is immediately relativized by the fact that particular languages are closely tied to particular cultures and human achievements, where I do see value, a lot of value indeed. The language that people speak is part of their identity, it is inextricably linked to their lives, their backgrounds, their social relations, their thoughts, their experiences, their emotions, their achievements, all of which I will subsume under the term culture from now on. One can’t take away people’s language without seriously endangering or damaging people’s culture. Even independently of their speakers, languages receive significance by virtue of being the medium of culture. For example, any philosophy, story, or useful recipe written down or recorded on tape is lost to mankind as soon as the respective language is no longer understood.

But both of these points are in the b) line of the distinction I’m insisting upon here! They are interesting only because we already live in a world with many, many languages. They do nothing to make the “one world, one language” utopia a dystopia. Wouldn’t it be great to live in a world where everybody had spoken the same language from the very beginning? A world without God’s punishment for building the Tower of Babel? Shouldn’t we strive towards tearing down all linguistic barriers, translating all of our cultural heritage into one single language and make that language the new world language, for every new citizen of the Earth to learn as their first language?

A true universal language, used not just as a lingua franca, but by everyone in all or nearly all situations, would certainly simplify a lot of matters. Matters of international communication, matters of preserving and making available information. Sure, the loss of diversity would be deplorable from an intellectual point of view: No more fascinating foreign languages to learn, to study, no more strange counterintuitive grammatical constructions to marvel at. The professions of translation and – to an extent – linguistics would be no more. But this would only take away a certain type of intellectual stimulus from certain language geeks like me, a fetish if you will. The loss would be in the domain of self-sufficient punditry (intellektuelles Gewichse, to put it drastically in German), which shouldn’t stand in the way of progress.

However, I don’t think this would work in the long run. I believe that cultural diversity leads to linguistic diversity, thus linguistic diversity cannot be eliminated except in an inhuman dicatorship. Unless you steal people the freedom they deserve, they will express their individuality, leading to cultural diversity, leading, since people invariably choose language as one means of expressing cultural identity, to linguistic diversity.

I have argued that linguistic diversity is a necessary consequence of cultural diversity, which in turn is a necessary consequence of people being people. I will now finally try to answer the question: Does linguistic diversity have any value of its own? Apart from the intellectual value, the pet of the self-sufficient punditry I mocked above, corresponding to the beauty of fractals, the value that in my view is not a real value at all?

Yes, I think it does. People wouldn’t strive for linguistic diversity if linguistic diversity didn’t do anything for them. The same thing is true for cultural diversity, in which case the desire for individuality and identity obviously play their parts. In linguistic as well as in cultural as well as in biological evolution, most things that persist have a function.

For one thing, linguistic as well as cultural diversity has an educational value. The existence of many different languages and cultures allows for studying them, observing their structures, commonalities and differences, and getting a better understanding of how they work. Diversity helps us to put our prejudices into perspective, to see that one culture or language isn’t “better” or “worse” than another. This is true for some anglocentric American who has almost no contact with foreign languages as well as for the African taxi driver cited in David Crystal’s book, who can communicate in all of the eleven languages of his country but doesn’t value this ability because he deems all of those languages inferior. Yet in a sense the educational value of diversity is merely the solution to a problem that wouldn’t exist without it: In a world without diversity, there wouldn’t be any need to fight prejudice.

Thus, in my view, the strongest argument that can be made for diversity a priori is its protective value. David Crystal’s book reports a point made by Peter Trudgill, namely “that languages as partial barriers to communication are actually a good thing, ecologically speaking, because they make it more difficult for dominant cultures to penetrate smaller ones.” This is a point for linguistic diversity under the premise that cultural diversity is good. Similarly, as a point for cultural diversity, there are cultural barriers, preventing memes from spreading uncontrolledly. Imagine what would happen if the meme “torture is a good thing” gained momentum on a planet with just one culture! Luckily, people from one culture are not too likely to pick up memes from another.

Of course we don’t want those barriers too strong, but sometimes it is good to reinforce them a little. This is when programmes for fostering diversity are called for. A biological analogy is obvious; it involves bark beetles and the hard time they have invading mixed forests whereas monocultures are an easy prey. As another example involving linguistic barriers, consider the burning of Danish flags in the Arab world after those Muhammad cartoons were published. Imagine how much more of that shit we would see in the hypothetical and impossible many-cultures-one-language world.

All of that is relative, mind you. Of course an individual can belong to many different cultures, and even the “belong to a culture” notion isn’t black-and-white but comes in a million degrees and variations. Those partial barriers aren’t walls and constitute no fundamental obstacle to Free Flow of Information, of which I am a fan (remind me to create a Facebook page). I hope I haven’t oversimplified too much – I just wanted to get those points out of my head so I can continue reading that book, hoping it gets better.

Mort aux balises?

The projects described involved the digitisation of everything from monumental inscriptions in the classical world (one of the many new words I learnt was „epigraphy“), through a 10th century palimpsest containing the earliest known manuscript of Archimedes, through the complete correspondence of the German composer Carl Maria von Weber, all the way to 1930s comic books. In all cases the challenge is to capture the detail – for example the fact that several words in an inscription might now be illegible, but were recorded in the 18th century by the first antiquarian visitors to a site. Capturing different features of the text often leads to a need for parallel markup, with corresponding XSLT challenges – but as I say, we didn’t get much technical detail.

Michael Kay, reporting from last year’s TEI conference

I’m probably just going through an ignorant phase, but reading the above made me realize I don’t see the merits of XML anymore. For many applications, that is. After having contributed to a linguistic research project making heavy use of XML and XSLT for a year and a half, I wonder: Why bother to force data into the shape of trees when the data is clearly more complex than that? Because you can then do more operations with standard tools? But does this rather fuzzy advantage outweigh the morbidness of some graph-to-tree conversions and the resultant wrenches you have to make in writing tools to process the data? If I were to design, say, a platform for storing, browsing and querying linguistic annotations right now, I would definitely put a relational database at its core and not an XML one. Any similarities between this arbitrary example and real projects – *cough* – are purely coincidental.

Every big blue boxer

Have a glimpse at what the program I’m writing for my B.A. thesis in progress can do.

?- translate.

> Every big blue boxer that kills every big blue woman that is Mia loves a robber.

1 Jede große blaue Boxerin, die jede große blaue Frau, die Mia ist, tötet, liebt eine Räuberin.
2 Jede große blaue Boxerin, die jede große blaue Frau, die Mia ist, tötet, liebt einen Räuber.
3 Jeder große blaue Boxer, der jede große blaue Frau, die Mia ist, tötet, liebt eine Räuberin.
4 Jeder große blaue Boxer, der jede große blaue Frau, die Mia ist, tötet, liebt einen Räuber.

Yay! That is quite some syntactic complexity already, isn’t it? Never mind the crazy contents, I haven’t spent much time on the lexicon yet.

The main problem here is that none of the four offered sentences is a correct translation of the exquisitly gender-unaware English sentence. I actually spent most of the afternoon and evening on gendering issues only to find out that to achieve a satisfactory general solution, I’ll probably have to invest the better part of another day. What I would like for the above example is the following. It’s how I would express the state of affairs politically correctly.

Jede/r große blaue Boxer/in, der/die jede große blaue Frau,
die Mia ist, tötet, liebt eine/n Räuber/in.

Then of course there’s the blemish that the inner relative clause is not sufficiently extrapolated to sound nice. I’ll fix that too.

If you would like to stalk me at work and have access to my code repository, please let me know. Currently I can’t make it public like Aleks’s because I also store literature in there.

Umzugsnotizen (5)

My old web site soviseau.de was hacked today and by God it wasn’t my fault. That gave reason to move the last curious odds and ends, boxes with bricolages from ages ago, to texttheater.de, from one attic to another, so to speak. Among them:

Der Puzzleteil-Navigator, a study for an unusual two-level menu. I remember creating the image maps with Paint and Notepad.

Riddle Sport, an interactive (edible!) virtual chocolate bar. Enjoy!

Last but not least:

dunkelwind&zwillingslicht, a series of analog photos I made of nightly London in 2003. They received artistic value by what happened to them in the photo laboratory – yet instead of paying royalties, I got a full refund! Hee hee!

Translating with Semantic Representations

The beauty of the topic I’ve chosen for my B.A. thesis lies in the fact that it is about Machine Translation (MT), the prototypical application of Computational Linguistics (CL), relatively easy to explain yet incorporating many subdisciplines of CL. When I started studying CL three years ago, I told my grandparents about MT, now I can tell them I’m actually doing it.

Of course, what I’m doing compares to state-of-the-art MT systems like a lever does to a particle accelerator. My thesis will be about devising and implementing, in Prolog,  a program that translates a certain class of English sentences, like (1a), to correct and elegant German counterparts, like (1b). Crucially, in generating the German sentence, no other information about the English input will be used than the semantic first-order logical formulae, as in (2), derived from the input using BB1, the software accompanying Blackburn’s and Bos‘ textbook. (Since the English and the German sentence both carry the same ambiguity, both of them are associated with two logical formulae, (2a) and (2b)).

(1a) Every boxer loves a woman.
(1b) Jeder Boxer liebt eine Frau.

(2a) ∀x(boxer(x)→∃y(woman(y)∧love(x,y)))
(2b) ∃x(woman(x)∧∀y(boxer(y)→love(y,x)))

In other words, I use these simple logical formulae as an interlingua, leaving parsing and „understanding“ to (a possibly marginally extended version of) the existing system by Blackburn and Bos and focussing my efforts on generating German sentences. Nested quantification and negation, predicates best expressed as relative clauses, nouns, verbs or adjectives, and anaphoric expressions, among other things, will make this quite interesting.

Right now, I am concerned with defining my approach precisely and describing the place it has among other approaches from the literature. Today I collected dimensions along which Natural Language Generation (NLG) and MT systems can be positioned. I hope to post something about them tomorrow, before leaving for a break on the bicycle.

Here’s the title of my B.A. thesis: Problems of Generating German from Logical Formulae in Automatic Translation from English to German.