The landscape of NLP tools is vast. However, most of those tools still don’t work as perfectly for other languages as they do for English. Even the simplest NLP tasks as tokenization become a challenge for mainstream tools as Spacy, especially when the texts get more conversational and domain specific.
With some search, though, you might find tools that perform much better for isolated tasks. The self-proclaimed German text state-of-the-art tokenizer SoMaJo is backed by academic research and worked very well for my highly domain specific and conversational dataset.
Installation and Usage
pip install SoMaJo
The only default you have to be aware of is that paragraphs should be separated by a new line.
If you have your separated text in a file like a
corpus.txt, you can just get your tokens into a new file like this:
somajo-tokenizer corpus.txt > tokens.txt
For further functionality, like splitting XMLs or running the process on multiple processes, you can refer to the tool’s documentation.
nlp = spacy.load('de_core_news_sm') doc = nlp(u""" Ca. 90min mit newmotion geladen, weil ich mit Maingau/EinfachStromLaden keine Verbindung über die App bekam. Säule hat keinen RFID-Leser usw. 2. Buchse seit 2 Tagen mit Kommunalfahrzeug/EWV blockiert. """) for token in doc: print(token.text)
… yields these tokens:
Ca . 90min mit newmotion geladen , weil ich mit Maingau/EinfachStromLaden keine Verbindung über die App bekam . Säule hat keinen RFID-Leser usw. 2 . Buchse seit 2 Tagen mit Kommunalfahrzeug/EWV blockiert .
Ca. as one token:
90min as two tokens:
Doesn’t recognize end of sentence
Doesn’t recognize that this isn’t an end of sentence and this should be 2. (like second)
Besides the fact that sentences are split pretty wrong and the slashed words are not split, it gets even tougher when Spacy tries to PoS tag and
90min is tagged as VERB…
SoMaJo handles a lot of those German and not German special cases and splits the sentences above perfectly.
The tool is easily extendable for special cases of your dataset. For example, you can add special cases like
E.ON to be handled as one word to the library’s
Unfortunately it doesn’t seem to be possible to load tokenized text into Spacy. You would rather have to train your own Spacy tokenizer to get better results with it.
But from here you could feed your tokens to something like NLTK‘s corpus reader and do basic statistics on the texts as well as preprocessing your corpus further. Or you could tag them with Part-of-Speech tags first. The latter I will describe for German texts in one of my next posts, again using a research-proven tool from the SoMaJo creator: SoMeWeTa.