Recently, I was faced with the challenge of removing duplicate lines from a number of large data files.

Typically, I use a combination or sort and uniq or just sort -u, but in this circumstance many duplicate lines were close together.

I found that, first, partially filtering duplicates by using a LRU cache to keep track of and omit recently seen lines doubled the speed.

#!/usr/bin/env python
# file:
import fileinput
from repoze.lru import LRUCache

size = 10000
cache = LRUCache(size)
for line in fileinput.input():
  if not cache.get(line):
    print line,
  cache.put(line, True)
$ time sort ns | uniq | wc -l
real	1m58.768s

$ time ./ ns | sort | uniq | wc -l
real	0m55.236s

I wanted to speed it up a little more, and it turns out that although I have never written a program in Go before it was the fastest way to write a compiled version.

So, here it is, and I’m sure it is terrible Go.

package main

import (

func main() {
  cache := lru.New(10000)
  stdin := bufio.NewReader(os.Stdin)
  for {
    line, err := stdin.ReadString('\n')
    if err != nil {
    _, hit := cache.Get(line)
    if !hit {
    cache.Add(line, 1) 
$ time ./lru-uniq < ns | sort | uniq | wc -l
real	0m35.218s