Many text mining tools, but are they really helping the investigator?

Henk Knoester (invited speaker)
Investigating Officer, Dutch Tax and Customs Administration


Large fraud investigations come with a lot of heterogeneous data from confiscated computers and networks. In all cases the investigators know that a crime has likely been committed: money has been lost by one or more parties. The task is to find evidence of the fraud and link it to suspects in order to help building the case.

In case it is known in what way the criminals committed the crime, the investigator can look for specific digital traces. For example, if it is known that the fraud had been committed by antedating or postdating documents, these documents can be found by comparing the document (file) meta data with the contents of the document. Meta data can easily be scraped, while a named entity recognizer can assist in finding dates in documents.

On the other hand, if the modus operandi is not known, the investigator has to test a range of hypotheses of possible modi operandi. For instance it can be assumed that criminals cover their illegal activities by using substitute words. These wordings are often out of context, for example in a case where the criminals referred to money as “ginger bread nuts”. For such applications tools can be developed. However, the wording is not always so clearly out of context. In another case, a criminal indicated that he was “going to the forest”, whenever he was going to deposit money in his Luxembourg bank account. This type of ordinary sentence might be within context or slightly out of context and will be hard to detect automatically. Moreover, to implement this type of anomaly detection, relevant corpora need to be collected, such that the normal use of language can be modeled.

For many fraud related domains, however, neither knowledge nor reference databases are available to be able to model the normal process. Also, the range of modus operandi hypotheses is unbounded. Lots of general purpose text mining tools and techniques have been developed. Unfortunately, few of them have been applied successfully in special purpose investigations in our Lab.

The question is whether text mining tools that are mostly developed for information retrieval are indeed suitable for fraud investigators. It is time to view fraud intelligence on its own for the development of forensic text analytical tools.

Short biography of Henk Knoester

Henk Knoester is investigating officer at the Dutch Tax and Customs Administration — Criminal Investigations branch. The focus of his job is on IT forensic investigations. Knoester is responsible for the research into and the development of tools for (mainly) large fraud investigations.

Knoester received a degree in mathematics from The Hague College. He worked for 20 years at the Dutch Department of Defense where he was responsible for the development of communications systems. Since 2001 he holds his current position.