Widefinder in Kamaelia?
December 09, 2007 at 05:55 PM | categories: python, oldblog | View Comments
Tim Bray's widefinder project is an interesting little problem. How do you find the top 10 most popular accesses to your website in a way that gains performance when you use more CPUs? Having done log analysis in the past, there's always a tradeoff Looking at the original Ruby source
So, what would this need to look like? Well, to my mind it'd look something like this: (pseudo kamaelia)
Aside from distribution across CPUs, and collation of results, this looks relatively simple to implement. Mainly blogging about this whilst I remember. The other thing here as well to note is that the results would never need to be completely sorted in this version. In fact, the more CPUs you have, the less sorted the data would be!
counts = {}This really looks fairly Kamaelia friendly. It requires support not yet written ( :) ), but it certainly fits Kamaelia's model. (The code needing writing is relatively minimal now though, given we have basic multicore support)
counts.default = 0
ARGF.each_line do |line|
if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }
counts[$1] += 1
end
end
keys_by_count = counts.keys.sort { |a, b| counts[b] <=> counts[a] }
keys_by_count[0 .. 9].each do |key|
puts "#{counts[key]}: #{key}"
end
So, what would this need to look like? Well, to my mind it'd look something like this: (pseudo kamaelia)
Pipeline(Even then, for this problem, this is suboptimal - we want the merge to not actually output the merged heap, but rather the values from the heaps, so that it doesn't do unnecessary work. Either way though, an interesting problem. May get round to implementing this after I've dealt with STM.
readfile,
SplitWorkers(8, -> Create 8 Analyser workers, ideally in different CPUs
Pipeline(Analyser -> Builds a heap for counts, emits heap
Serialiser -> outputs values in order from the heap
) -> Output from a worker is a tuple (id, value)
)
merge(8) -> merge heap streams (Can't start until it has values from all workers
Head(10) -> output first 10 items
ConsoleEchoer()
)
Aside from distribution across CPUs, and collation of results, this looks relatively simple to implement. Mainly blogging about this whilst I remember. The other thing here as well to note is that the results would never need to be completely sorted in this version. In fact, the more CPUs you have, the less sorted the data would be!