Publication Search Form




We found publication with these paramters.

Reconstructing Textual Documents from n-grams

Matthias Gallé, Matias Tealdi
We analyze the problem of reconstructing documents when we only have access to the n-grams and their counts from the original documents and a xed n. Formally, we are in- terested in recovering the longest contiguous substrings of whose presence in the original documents we are certain. We map this problem on a de Bruijn graph, where the n- grams form the edges and where every Eulerian cycles gives a plausible reconstruction. We de ne two rules that reduce this graph in such a way of preserving all possible reconstruction, while at the same time increasing the length of the edge labels. From a theo- retical perspective we prove that the iterative application of these rules gives an irreducible graph equivalent to the orig- inal one. We then apply this on the data from the Guten- berg project to measure the number and size of the obtained longest substrings. Moreoever, we analyze how the n-gram corpus could be noised to prevent reconstruction, showing empirically that removing low frequent n-grams has little impact. Instead, we propose another method consisting in adding strategically ctitious n-grams and show that a noised corpus like that is much harder to reconstruct while increasing only little the perplexity of a language model obtained through it.
21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10-13 August, 2015.


2015-006.pdf (1.26 MB)