Concurrent software is not the problem - Intel talking about 1000s of cores
July 05, 2008 at 12:47 PM | categories: python, oldblog | View CommentsWhy? Like the erlang group, in Kamaelia, the thing we've focussed on is making concurrency easy to work with, primarily by aiming for making concurrent software maintenance easier (for the average developer). In practical terms this has meant putting friendly metaphors (hopefully) on top of well established principles of message passing systems, as well as adding support for other forms of constrained shared data. (STM is a bit like version control for variables).
We've done this by using various application domains as the starting point, such as DVB, networking and use of audio/video etc, and used Python as the language of choice to do so (Though we probably could've shouted about our application uses more/better, though we've getting better I think :-). However the approaches apply to more or less any non-functional language - so there's a proof of concept of our miniaxon core in C++, Ruby, & Java as well. (C++ & ruby ones deliberately simple/naive coding style :)
This does mean that now when we approach a problem - such as the desire to build a tool that assists a child learning to read and write - we end up with a piece of code that internally exhibits high levels of concurrency. For example, even the simple Speak And Write application is made of 37 components which at present all right in the same process, but could be easily be made to use 37 processes... (prepending all Pipelines & Graphlines with the word "Process")
Despite this, we don't normally think in terms of number of components or concurrent things, largely because you don't normally think of the number of functions you use in a piece of code - we just focus on the functionality we want from the system. I'm sure once upon a time though people did, but I don't know anyone who counts the number of functions or methods they have. The diagram below for example is the high level functionality of the system:
bgcolour = (255,255,180)
Backplane("SPEECH").activate()
Pipeline(
SubscribeTo("SPEECH"),
UnixProcess("while read word; do echo $word | espeak -w foo.wav --stdin ; aplay foo.wav ; done"),
).activate()CANVAS = Canvas( position=(0,40), size=(800,320),
bgcolour = bgcolour ).activate()
CHALLENGE = TextDisplayer(size = (390, 200), position = (0,40),
bgcolour = bgcolour, text_height=48,
transparent =1).activate()
TEXT = Textbox(size = (800, 100), position = (0,260), bgcolour = (255,180,255),
text_height=48, transparent =1 ).activate()
Graphline(
CHALLENGER = Challenger(),
CHALLENGE_SPLITTER = TwoWaySplitter(),
CHALLENGE_CHECKER = Challenger_Checker(),
SPEAKER = PublishTo("SPEECH"),
CHALLENGE = CHALLENGE,
TEXT = TEXT,
CANVAS = CANVAS,
PEN = Pen(bgcolour = bgcolour),
STROKER = StrokeRecogniser(),
OUTPUT = aggregator(),
ANSWER_SPLITTER = TwoWaySplitter(),
TEXTDISPLAY = TextDisplayer(size = (800, 100), position = (0,380),
bgcolour = (180,255,255), text_height=48 ),
linkages = {
("CANVAS", "eventsOut") : ("PEN", "inbox"),
("CHALLENGER","outbox") : ("CHALLENGE_SPLITTER", "inbox"),
("CHALLENGE_SPLITTER","outbox") : ("CHALLENGE", "inbox"),
("CHALLENGE_SPLITTER","outbox2") : ("SPEAKER", "inbox"),
("PEN", "outbox") : ("CANVAS", "inbox"),
("PEN", "points") : ("STROKER", "inbox"),
("STROKER", "outbox") : ("OUTPUT", "inbox"),
("STROKER", "drawing") : ("CANVAS", "inbox"),
("OUTPUT","outbox") : ("TEXT", "inbox"),
("TEXT","outbox") : ("ANSWER_SPLITTER", "inbox"),
("ANSWER_SPLITTER","outbox") : ("TEXTDISPLAY", "inbox"),
("ANSWER_SPLITTER","outbox2") : ("CHALLENGE_CHECKER", "inbox"),
("CHALLENGE_CHECKER","outbox") : ("SPEAKER", "inbox"),
("CHALLENGE_CHECKER", "challengesignal") : ("CHALLENGER", "inbox"),
},
).run()
However, what has this got to do with 1000s of cores? After all, even a larger application (like the Whiteboard) only really exhibits a hundred or two hundred of degrees of concurrency... Now, clearly if every application you were using was written the approach of simpler, friendlier component metaphors that Kamaelia currently uses, then it's likely that you would probably start using all those CPUs. I say "approach", because I'd really like to see people taking our proofs of concept and making native versions for C++, Ruby, Perl, etc - I don't believe in the view of one language to rule them all. I'd hope it was easier to maintain and be more bug free, because that's a core aim, but the proof of the approach is in the coding really, not the talking.
However, when you get to 1000s of cores a completely different issue suddenly arises that you didn't have with concurrency at the levels of 1,5, 10, 100 cores. That of software tolerance of hardware unreliability, and that, not writing concurrent software is the REAL problem.
It's been well noted that Google currently scale their applications across 1000s of machines using Map Reduce, which fundamentally is just another metaphor for writing code in a concurrent way. However, they are also well known to work on the assumption that they will have a number of servers fail every single day. This will will fundamentally mean half way through doing something. Now with a web search, if something goes wrong, you can just redo the search, or just not aggregate the results of the search.
In a desktop application, what if the core that fails is handling audio output? Is it acceptable for the audio to just stop working? Or would you need to have some mechanism to back out from the error and retry? It was thinking about these issues early this morning that I realised that what you you need is a way of capturing what was going to be running on that core before you execute it, and then launch it. In that scenario, if the CPU fails (assuming a detection mechanism) you can then restart the component on a fresh core.
The interesting thing here is that ProcessPipeline can help us out here. The way process pipeline works is as follows. Given the following system:
ProcessPipeline( Producer(), Transformer(), Consumer() ).run()
Such as:
ProcessPipeline( SimpleFileReader(), AudioDecoder(), AudioPlayer() ).run()
Then ProcessPipeline runs in the foreground process. For each of the components listed in the pipeline, it forks, and runs the component using the pprocess library, with data passing between components via the ProcessPipeline (based on the principle of the simplest thing that could possibly work). The interesting thing about this is this: ProcessPipeline therefore has a copy of each component before it started executing. Fundamentally this allows process pipeline to be able (at some later point in time) to be able to detect erroneous death of the component (somehow :) ), either due to bugs, or hardware failure, and to be able to restart the component - masking the error from the other components in the system.
Now, quite how that would actually work in practice, I'm not really sure, ProcessPipeline is after all experimental at present, with issues in it being explored by a Google Summer of Code project aimed at a multi-process paint program (by a first year CS student...). However, it gives me warm fuzzy feelings about both our approach and it's potential longevity - since we do have a clear reasonable answer as to how to deal with that (hardware) reliability issue.
So, whilst Intel may have some "unwelcome advice", and people may be reacting thinking "how on earth do I even structure my code to work that way", but the real problem is "how do I write application code that is resilient to and works despite hardware failure".
That's a much harder question, and the only solution to both that I can see is "break your code down into restartable, non-datasharing, message passing, replaceable components". I'm sure other solutions either exist or will come along though :-) After all, Kamaelia turns out to have similarities to Hugo Simpon's MASCOT (pdf, see also wikipedia link) which is over 30 years old but barely advertised, so I'm sure that other approaches exist.