Matrix Core Programming on AMD CDNA Architecture

rocm.blogs.amd.com

60 points by salykova 8 days ago

So from CDNA3 to 4 they doubled fp16 and fp8 performance but cut fp32 and fp64 by half?

Wonder why the regression on non-AI workloads?

adrian_b - 3 days ago

Because those who nowadays have money for investing, do not invest them in the research problems whose solutions are urgently needed for the survival of humanity, e.g. for developing technologies for using all substances in closed cycles (like biosphere did before humans), but instead of that they invest all their money in research for the dream of developing AGI, which even if successful will be of benefit only for a small number of humans, not for all mankind.
The fp64 and fp32 performance is needed for physical simulations required by the former goal, while fp16 and fp8 performance is useful only for the latter goal.
So AMD's choice logically follows the choice of those who control the investment money.
- Archit3ch - 3 days ago
  
  > The fp64 and fp32 performance is needed for physical simulations
  In the very unlikely case where
  1) You need fp64 Matrix-Matrix products for physical simulations
  2) You bought the MI355X accelerator instead of hardware better suited for the task
  you can still emulate it with the Ozaki scheme.
  - stonogo - 2 days ago
    
    What hardware is better suited for the task? FLOPS per dollar, nvidia is in retreat just as much as AMD is when it comes to fp64.
    
    Archit3ch - 2 days ago
    
    ARMv9 Scalable Matrix Extension (SME). Apple had outer-product matrix hardware (AMX) since 2019, but you cannot buy the chips by themselves.
    
    stonogo - a day ago
    
    Yeah, I saw the presentations at SC25, but I wasn't able to get anyone to commit to being able to buy them in the next year or three. Right now I have two open RFPs and nobody is bidding ARM.
- jjtheblunt - 3 days ago
  
  expanding (i think) to your point, it's perhaps just a fork into two product lines for different uses?
  - walterbell - 3 days ago
    
    Will there be future hardware optimized for physical simulations, or should existing/faster hardware be stockpiled now?
    
    adrian_b - 2 days ago
    
    I am still using ancient AMD GPUs, bought between 2015 and 2019, because all later GPUs have much worse FP64 throughput per dollar.
    So I was never able to upgrade them, because all newer GPUs are worse.
    There was a little hope when the last generation of Intel discrete desktop Battlemage GPUs improved their FP64 throughput. While their throughput is relatively modest, i.e. half of a Zen 5 desktop Ryzen, they are extremely cheap so their performance per dollar is very good. Therefore they can be used to multiply the throughput of a desktop computer at a modest additional cost.
    Unfortunately, with the new Intel CEO the future of the Intel GPUs is very unclear, so it is unknown whether they will be followed by better GPUs or they will be canceled. If Intel will stupidly choose to no longer compete in the GPU market, the last source of GPUs with good FP64 throughput will disappear.
    The datacenter GPUs that still have good FP64 throughput have huge prices that cannot be justified for any small business or individual. In order to recover the cost of such GPUs you must have a workload that keeps them busy continuously, day and night. Such workloads must be aggregated from a large number of users. So we have regressed to the mainframes used by time-sharing around the beginning of the seventies of the last century, backwards from the freedom of personal computers.
    I see no hope for the future availability of any computing devices with better FP64 throughput per dollar than the desktop CPUs. Technically, it would be trivial to make such devices, but the companies like AMD and NVIDIA do not care about small business or individual customers but only about selling to other equally huge companies, so they dimension their devices accordingly and they also set fictitious retail prices many times greater than the actual price that will be negotiated with the big companies. While the big companies will pay much less, small businesses or individuals cannot buy at other prices than the list prices, which means that they must give up on buying such devices as they are not worth such prices.
    
    walterbell - 2 days ago
    
    It took about 25 years for the cycle from Napster -> MP3 players -> flash memory -> smartphones -> big data -> big GPUs -> LLMs and generative AI -> OpenAI buying 100% of remaining memory wafer capacity from SK Hynix and Samsung = little left for the edge, with 100% price hikes for consumer DIMMs.
    https://openai.com/index/samsung-and-sk-join-stargate/
    > Samsung Electronics and SK hynix plan to scale up production of advanced memory chips, targeting 900,000 DRAM wafer starts per month at an accelerated capacity rollout, critical for powering OpenAI’s advanced AI models.
    We need a new "Napster moment" to restart supply chain investment and business models at the edge. Humanoid robotics might qualify, since robots will need low-latency responses to local sensor input.
    Another factor in edge vs. mainframe economics is the cost of energy in each location.
bigdict - 3 days ago

cuz area and power
- fancyfredbot - 3 days ago
  
  Area and power are why there was a choice to make. AI data centre demand is why they made this choice specifically.
trueismywork - 3 days ago

Non-AI workloads prefer vector units and not matrix units
- phkahler - 3 days ago
  
  >> Non-AI workloads prefer vector units and not matrix units
  FEA and other "scientific" workloads are all matrix math. This is why super computers have been benchmarked using BLAS and LAPACK for the past 40 years. OTOH are those matrix * vector where AI is matrix * matrix?
  Either way its a regression which seems strange.
  - trueismywork - 3 days ago
    
    Nvidia b200 did the same. A lot of FEA go explicit (matrix free) because scaling is better.
    Also lookup ozaki algorithms.
    
    adrian_b - 2 days ago
    
    I do not see which is the relationship between Ozaki algorithms and algorithms that are supposedly "matrix free".
    The Ozaki scheme and its variants improves the precision of matrix-matrix multiplications, allowing a matrix-matrix multiplication done with operations having lower-precision to approach the precision of the same multiplication done with operations with higher precision.
    So it is an improvement for matrix-matrix operations, which are better done in matrix units. It is not any kind of "matrix free" algorithm.
    The Ozaki scheme is not good enough for emulating FP64 in a GPU with poor FP64 throughput, but good FP32 throughput. The reason is that not only the greater precision of FP64 is important, but also its much greater dynamic range in comparison with FP32. In computations with FP64, overflows and underflows are extremely rare events and easy to avoid. On the other hand, in complex physical simulations it is impossible to avoid overflows and underflows in FP32, unless one uses extremely cumbersome frequent rescalings, which eliminate all the advantages of using floating-point numbers instead of fixed-point numbers.
    I do not know to which kind of "matrix free" algorithms for FEA you are referring .
    Nevertheless, the problem of any "matrix free" algorithm is exactly its poor scaling, because any "matrix free" algorithm must do similar amounts of computational operations and memory transfers. This limits the performance to that of the memory, which prevents scaling.
    The advantage of the algorithms based on matrices is exactly the better scaling, because only such algorithms can do more computational operations than memory transfers, so their scaling is no longer limited by the memory interface.
    For implementing matrix-matrix operations, the matrix units introduced initially by NVIDIA and then by AMD, Apple, Intel and since next year also by Arm, are preferable, because they reduce even more the number of memory transfers that prevent scaling, in comparison with implementing the same matrix-matrix operations in vector units.
    
    imtringued - 2 days ago
    
    Matrix free generally refers to using "X-vector product" operators, where X is something like the Jacobian or Hessian, but you do not materialize the final Jacobian or Hessian matrix. A big X operator is split into smaller X operators and you operate on the X operator by obtaining the X-vector products sequentially. This doesn't necessarily mean there are no matrices in the individual X-vector products. The smaller X operators could still be matrix vector products.
    In fact, one of the big benefits of splitting your big matrix into a series of small matrix vector products is that some of the matrix vector products are parameterized and some are not or at least they share the same parameters over multiple matrix vector products. This means you can perform matrix-matrix multiplication against some of the operators. This is particularly evident in batched training of neural networks.
- adrian_b - 2 days ago
  
  False.
  While there are indeed parts of the workloads that must be executed in vector units, those parts are limited by the memory interface throughput, not by the computational throughput.
  Only the matrix-matrix operations are limited by the computational throughput, not by the memory throughput, and all matrix-matrix operations (this includes the solving of dense systems of equations, which is the most frequent kind of non-AI workload) are better done with dedicated matrix units, because the matrix units reduce the number of memory transfers that are required for performing matrix operations.
  - Archit3ch - 2 days ago
    
    > this includes the solving of dense systems of equations
    Is there even dedicated hardware for LU?
    
    adrian_b - 2 days ago
    
    There is no need for dedicated hardware for LU, because for big matrices LU can be reduced to matrix-matrix multiplications of smaller submatrices.
    LU for small matrices and most other operations with small matrices are normally better done in the vector units.
    
    imtringued - 2 days ago
    
    There is a mild lack of context here. If you have a single vector and want to solve LUx=b, you actually have matrix vector multiplication. It's the batched LUX=B case, where X and B are matrices where you need matrix matrix multiplication.
    For those who don't know. One of the most useful properties of triangular matrices is that the block matrices in the diagonal blocks are triangular matrices themselves. This means you an solve a subset of the x using the first triangular block. Since the sub-x vector is now known, you can now do a forward multiplication against the non-triangular blocks that take your sub-x vector as input and subtract them from the b vector. This is the same as if you removed one of the columns or rows in the triangular matrix. The remaining matrix stays triangular, which means you can just keep repeating this until the entire matrix is solved.

If AMD were serious they would show a fully-worked out GEMM, not just "here is our theoretical performance, this is the instruction to use".

- 2 days ago

[deleted]