partial uniq using a lru cache
Recently, I was faced with the challenge of removing duplicate lines from a number of large data files.
Typically, I use a combination or sort
and uniq
or just sort -u
, but in
this circumstance many duplicate lines were close together.
I found that, first, partially filtering duplicates by using a LRU cache to keep track of and omit recently seen lines doubled the speed.
#!/usr/bin/env python
# file: lru-uniq.py
import fileinput
from repoze.lru import LRUCache
size = 10000
cache = LRUCache(size)
for line in fileinput.input():
if not cache.get(line):
print line,
cache.put(line, True)
$ time sort ns | uniq | wc -l
936442
real 1m58.768s
$ time ./lru-uniq.py ns | sort | uniq | wc -l
936442
real 0m55.236s
I wanted to speed it up a little more, and it turns out that although I have never written a program in Go before it was the fastest way to write a compiled version.
So, here it is, and I’m sure it is terrible Go.
package main
import (
"os"
"fmt"
"bufio"
"github.com/golang/groupcache/lru"
)
func main() {
cache := lru.New(10000)
stdin := bufio.NewReader(os.Stdin)
for {
line, err := stdin.ReadString('\n')
if err != nil {
break
}
_, hit := cache.Get(line)
if !hit {
fmt.Print(line)
}
cache.Add(line, 1)
}
}
$ time ./lru-uniq < ns | sort | uniq | wc -l
936442
real 0m35.218s