Re: Optimization! Where?
- From: "Alexander Grigoriev" <alegr@xxxxxxxxxxxxx>
- Date: Tue, 9 Dec 2008 06:38:03 -0800
"Intel Core" architecture is what's used in Intel Core Duo, and other latest
processors, except the most recent generation (code-named Nehalem).
You misrepresent a concept of command pipeline here. An instruction has to
pass these 14 or 20 or 31 stages. Some stages can have multiple execution
units; for example, 4 units to decode an instruction, 4 units to start an
instruction, 3 ALU, 6 units to retire the instruction results (write the
results to a register or post a memory wrte). Those numbers mean how many
u-ops can be on any stage. There is no 14 or 20 pipelines. There is a number
of u-op in flight at any moment, though.
"Alan Carre" <alan@xxxxxxxxxxxxxxxxx> wrote in message
news:OtGfACgWJHA.2372@xxxxxxxxxxxxxxxxxxxxxxx
"Alexander Grigoriev" <alegr@xxxxxxxxxxxxx> wrote in message
news:ONWmohbWJHA.1528@xxxxxxxxxxxxxxxxxxxxxxx
There's no need to explain me all these little things.
These are not "little things", and anyway there's no need to tell me about
the BPU. But who's counting?
Intel Core architecture:
14-stage pipeline
Three ALU
Four instruction decoders
up to six u-ops per clock to start
up to four u-ops per clock to retire
I assume that you're referring to the Pentium Pro (from the article I
mentioned). I was not, I was referring to modern processors; the Pentium
4.
Note that 14 stage pipeline is not the same as 14 pipelines.
This is true, but not the whole story. Let me requote:
[Re-quote]
Internally, Intel architechture (IA) chips are 'deeply pipelined' -- they
have many pipelines, say 20 or so.
[/Re-quote]
The words 'deeply' and 'many' are near-synonyms here. "Deeply pipelined"
just means that multiple streams can share the same pipeline in an
interleaved fashion - this has the same effect as having multiple
pipelines. I believe this is also referred to as "pipeline coloring"
(analogous to "page coloring" used in cacheing technology) and, as legend
might have it, probably has its roots in the now-famous "Los Alamos grad
students" who, working under Feynman, used colored punch-cards in order to
keep track of multiple computations running on an IBM "tabulator" back
around 1942 during construction of the first atomic bomb. This was
probably the first-ever use of pipelining in an electronic-computing
environment (though they had no name for it back then).
[Side note: You can read about this story in Feynman's very funny and
interesting semi-autobiography entitled "Surely You're Joking Mr. Feynman"
: highly recommended - I have a pdf if you want to read it.]
Anyway, each "micro-op digester" (or micro-cpu if you wish) is latched on
to a particular stage of an N-stage pipeline where N, in your example has
an upper bound of 14. So if we had 20 internal decoders we could only have
at most:
0. 14 "1-stage" pipelines [14 "pipelines"]
1. Seven "2-stage" pipelines [7 pipelines]
2. Five "2-stage" and one "4-stage" pipelines. [6 pipelines]
3. Four "3-stage" and one "2-stage" pipelines. [4 pipelines]
... to all possible permutations.
...with each unit in the process of executing whatever instruction-stream
it happened to be provided. These extra instruction-streams will typically
include the many branch-predicted varients of a particular process'
"current" code-stream.
I put "current" in quotes since (due to HT [hyperthreading technology])
these pipelines may simultaneously contain instructions from yet another
thread's "current" code stream (max ~= 2). Such "hyper-threaded" micro-ops
*don't* get purged from the pipeline along with the rest in the event of a
cache miss, but that doesn't provide any help the thread in question
obviously.
Concerning P4 based microprocessors - released 2000:
--------------------------------------------------------
Intel Pentium 4 processors contain a minimum 20-stage pipeline:
http://en.wikipedia.org/wiki/Instruction_pipelining
Before discontinuation (August 8, 2008) several varients of the contanied
even longer pipelines, such as the "Prescott" and "Cedar Mill" revisions,
both of which contained a 31-stage pipeline:
http://en.wikipedia.org/wiki/Pentium_4 .
Concerning the P4's successor:
-------------------------------
Code-named "Conroe", is best known to most of us as "Intel Core 2". This
is the current trend; namely abandoning "Cedar Mill" type varients in
favor of multi-core P4 based IA. BTW, Cedar Mill was slated to become a
9GHz Pentium 4 varient. Yes, I said "9GHz single-core processor".
-----------
That paper analysed performance hit for a simulated PPro.
Things changed a little bit since 1995 (when that paper was written).
I'm not sure PPro even had u-op cache.
Firstly: I guess you didn't read the article very closely. For one, it
references a paper written in 1996 so it had to have been written after
1995. Also, it happens to be dated 10/96 (ie. ~= 1997): "OOPSLA 96 -
10/96, San Jose, CA USA". You did get the copyright year correct though
(though I suspect that was a typo).
Secondly: the article doesn't deal with the form of the instruction set
(mu-ops or x86 assembly) so that point is irrelevant to the discussion. It
was me who (parenthetically) mentioned the trace cache as a well-known
context where you might find of the kind of data I was talking about.
Thirdly, quoting from the article:
"It should be noted that none of these processors is intended to exactly
model an existing processor; for example, the Intel Pentium Pro's
instruction set and microarchitecture is very different from P96-Pro, and
so the latter should not be used to predict the Pentium Pro's performance
on C++ programs. ***Instead, we use these processors to mark plausible
points in the design space, and their distance and relationship to
illustrate particular effects or trends.***" (emphasis mine)
And finally (again from the article):
On future processors, this dispatch overhead is likely to *increase
moderately* (emphasis mine).
- Alan Carre
P.S. You should note that these superscalar processors with BPB's and
BTB's were specifically designed to deal with the problems associated with
things like "call through address" and other important problems like the
use of "if" for conditionals (ie. instead of "switch/case"). So this is
exactly where (finally!) a shining example of being able to cope with
virtual functions should've manifested itself. That it was a dismal
failure, and that their synopsis implies that the outlook for future
processors is "glim at best" should say something.
P.P.S. If you have a more recent paper that shows otherwise I'd be glad to
read it.
.
- Follow-Ups:
- Re: Optimization! Where?
- From: Alan Carre
- Re: Optimization! Where?
- References:
- Re: Optimization! Where?
- From: Alan Carre
- Re: Optimization! Where?
- From: Alexander Grigoriev
- Re: Optimization! Where?
- From: Alan Carre
- Re: Optimization! Where?
- From: Alexander Grigoriev
- Re: Optimization! Where?
- From: Alan Carre
- Re: Optimization! Where?
- Prev by Date: Re: Optimization! Where?
- Next by Date: Is this doable in MS C?
- Previous by thread: Re: Optimization! Where?
- Next by thread: Re: Optimization! Where?
- Index(es):
Relevant Pages
|