Forum

Notifications
Clear all

[Sticky] Verilog ARM Mali-T880 GPU with 16 shader cores

1 Posts
1 Users
0 Reactions
6 Views
 josh
(@josh)
Member Admin
Joined: 2 months ago
Posts: 510
Topic starter  
Here's the folder structure and source code for an ARM Mali-T880 GPU with 16 shader cores:

```
mali_t880/
├── src/
│   ├── core/
│   │   ├── shader_core.sv
│   │   ├── alu_unit.sv
│   │   ├── mul_unit.sv
│   │   ├── load_store_unit.sv
│   │   └── control_unit.sv
│   ├── memory/
│   │   ├── tiler.sv
│   │   ├── l2_cache.sv
│   │   └── memory_controller.sv
│   ├── interface/
│   │   ├── gpu_interface.sv
│   │   └── bus_interface.sv
│   └── top/
│       └── mali_t880_top.sv
├── testbench/
│   └── tb_mali_t880.sv
└── README.md
```

Here are the Verilog source files:

**src/core/shader_core.sv**
```verilog
module shader_core #(
    parameter CORE_ID = 0,
    parameter DATA_WIDTH = 32
)(
    input wire clk,
    input wire rst_n,
    input wire start,
    
    // Input interfaces
    input wire [DATA_WIDTH-1:0] instruction,
    input wire [DATA_WIDTH-1:0] input_a,
    input wire [DATA_WIDTH-1:0] input_b,
    
    // Output interface
    output reg [DATA_WIDTH-1:0] result,
    output reg valid_out,
    
    // Control signals
    input wire execute_en
);

    // Internal signals
    wire [DATA_WIDTH-1:0] alu_result;
    wire [DATA_WIDTH-1:0] mul_result;
    wire [DATA_WIDTH-1:0] ls_result;
    
    // Units instantiation
    alu_unit #(.DATA_WIDTH(DATA_WIDTH)) alu_inst (
        .clk(clk),
        .rst_n(rst_n),
        .instruction(instruction),
        .input_a(input_a),
        .input_b(input_b),
        .result(alu_result)
    );
    
    mul_unit #(.DATA_WIDTH(DATA_WIDTH)) mul_inst (
        .clk(clk),
        .rst_n(rst_n),
        .instruction(instruction),
        .input_a(input_a),
        .input_b(input_b),
        .result(mul_result)
    );
    
    load_store_unit #(.DATA_WIDTH(DATA_WIDTH)) ls_inst (
        .clk(clk),
        .rst_n(rst_n),
        .instruction(instruction),
        .input_a(input_a),
        .input_b(input_b),
        .result(ls_result)
    );
    
    // Simple multiplexer to select operation result
    always @(*) begin
        case (instruction[31:28])
            4'b0000: result = alu_result;
            4'b0001: result = mul_result;
            4'b0010: result = ls_result;
            default: result = input_a;
        endcase
    end
    
    // Valid output
    always @(posedge clk or negedge rst_n) begin
        if (!rst_n)
            valid_out <= 1'b0;
        else if (execute_en)
            valid_out <= 1'b1;
        else
            valid_out <= 1'b0;
    end

endmodule
```

**src/core/alu_unit.sv**
```verilog
module alu_unit #(
    parameter DATA_WIDTH = 32
)(
    input wire clk,
    input wire rst_n,
    input wire [DATA_WIDTH-1:0] instruction,
    input wire [DATA_WIDTH-1:0] input_a,
    input wire [DATA_WIDTH-1:0] input_b,
    
    output reg [DATA_WIDTH-1:0] result
);

    // ALU operation types
    localparam ADD = 4'b0000;
    localparam SUB = 4'b0001;
    localparam AND = 4'b0010;
    localparam OR  = 4'b0011;
    localparam XOR = 4'b0100;
    
    always @(*) begin
        case (instruction[3:0])
            ADD: result = input_a + input_b;
            SUB: result = input_a - input_b;
            AND: result = input_a & input_b;
            OR:  result = input_a | input_b;
            XOR: result = input_a ^ input_b;
            default: result = input_a;
        endcase
    end

endmodule
```

**src/core/mul_unit.sv**
```verilog
module mul_unit #(
    parameter DATA_WIDTH = 32
)(
    input wire clk,
    input wire rst_n,
    input wire [DATA_WIDTH-1:0] instruction,
    input wire [DATA_WIDTH-1:0] input_a,
    input wire [DATA_WIDTH-1:0] input_b,
    
    output reg [DATA_WIDTH-1:0] result
);

    // Multiply operation (simplified)
    always @(*) begin
        if (instruction[4])  // Simple multiply flag
            result = input_a * input_b;
        else
            result = input_a;
    end

endmodule
```

**src/core/load_store_unit.sv**
```verilog
module load_store_unit #(
    parameter DATA_WIDTH = 32
)(
    input wire clk,
    input wire rst_n,
    input wire [DATA_WIDTH-1:0] instruction,
    input wire [DATA_WIDTH-1:0] input_a,
    input wire [DATA_WIDTH-1:0] input_b,
    
    output reg [DATA_WIDTH-1:0] result
);

    // Simple load/store simulation
    always @(*) begin
        if (instruction[5])  // Load flag
            result = input_a;
        else if (instruction[6])  // Store flag
            result = input_b;
        else
            result = input_a;
    end

endmodule
```

**src/core/control_unit.sv**
```verilog
module control_unit #(
    parameter NUM_CORES = 16
)(
    input wire clk,
    input wire rst_n,
    
    // Input from GPU interface
    input wire [31:0] instruction,
    input wire start,
    
    // Output to cores
    output reg [31:0] core_instruction [0:NUM_CORES-1],
    output reg [NUM_CORES-1:0] core_enable
);

    integer i;
    
    always @(posedge clk or negedge rst_n) begin
        if (!rst_n) begin
            for (i = 0; i < NUM_CORES; i = i + 1) begin
                core_instruction[i] <= 32'h0;
                core_enable[i] <= 1'b0;
            end
        end else if (start) begin
            // Distribute instruction to all cores
            for (i = 0; i < NUM_CORES; i = i + 1) begin
                core_instruction[i] <= instruction;
                core_enable[i] <= 1'b1;
            end
        end else begin
            // Disable cores after execution
            for (i = 0; i < NUM_CORES; i = i + 1) begin
                core_enable[i] <= 1'b0;
            end
        end
    end

endmodule
```

**src/memory/tiler.sv**
```verilog
module tiler #(
    parameter WIDTH = 1024,
    parameter HEIGHT = 768
)(
    input wire clk,
    input wire rst_n,
    
    // Input from GPU
    input wire [31:0] primitive_data,
    input wire start_tiling,
    
    // Output to L2 cache
    output reg [31:0] tile_data,
    output reg tile_valid
);

    always @(posedge clk or negedge rst_n) begin
        if (!rst_n) begin
            tile_data <= 32'h0;
            tile_valid <= 1'b0;
        end else if (start_tiling) begin
            tile_data <= primitive_data;
            tile_valid <= 1'b1;
        end else begin
            tile_valid <= 1'b0;
        end
    end

endmodule
```

**src/memory/l2_cache.sv**
```verilog
module l2_cache #(
    parameter CACHE_SIZE = 2048, // KB
    parameter LINE_SIZE = 64     // bytes
)(
    input wire clk,
    input wire rst_n,
    
    // Input from tiler
    input wire [31:0] tile_data,
    input wire tile_valid,
    
    // Output to memory controller
    output reg [31:0] cache_data,
    output reg cache_valid,
    
    // Memory interface
    input wire [31:0] mem_addr,
    input wire mem_rd_en,
    input wire [31:0] mem_data_in,
    output reg [31:0] mem_data_out
);

    reg [31:0] cache_mem [0:((CACHE_SIZE * 1024) / LINE_SIZE) - 1];
    reg [31:0] cache_tag [0:((CACHE_SIZE * 1024) / LINE_SIZE) - 1];
    reg valid_bit [0:((CACHE_SIZE * 1024) / LINE_SIZE) - 1];
    
    // Simple cache logic (simplified)
    always @(posedge clk or negedge rst_n) begin
        if (!rst_n) begin
            cache_valid <= 1'b0;
            cache_data <= 32'h0;
        end else if (tile_valid) begin
            cache_valid <= 1'b1;
            cache_data <= tile_data;
        end else begin
            cache_valid <= 1'b0;
        end
        
        if (mem_rd_en) begin
            mem_data_out <= mem_data_in;
        end
    end

endmodule
```

**src/memory/memory_controller.sv**
```verilog
module memory_controller #(
    parameter MEM_SIZE = 512 * 1024 * 1024 // 512 MB
)(
    input wire clk,
    input wire rst_n,
    
    // Input from L2 cache
    input wire [31:0] cache_data,
    input wire cache_valid,
    
    // Memory interface
    output reg [31:0] mem_addr,
    output reg mem_rd_en,
    output reg mem_wr_en,
    output reg [31:0] mem_wdata,
    input wire [31:0] mem_rdata,
    
    // Status
    output reg busy
);

    always @(posedge clk or negedge rst_n) begin
        if (!rst_n) begin
            mem_addr <= 32'h0;
            mem_rd_en <= 1'b0;
            mem_wr_en <= 1'b0;
            mem_wdata <= 32'h0;
            busy <= 1'b0;
        end else if (cache_valid) begin
            mem_addr <= cache_data;
            mem_rd_en <= 1'b1;
            mem_wr_en <= 1'b0;
            busy <= 1'b1;
        end else begin
            mem_rd_en <= 1'b0;
            mem_wr_en <= 1'b0;
            busy <= 1'b0;
        end
    end

endmodule
```

**src/interface/gpu_interface.sv**
```verilog
module gpu_interface #(
    parameter NUM_SHADER_CORES = 16,
    parameter DATA_WIDTH = 32
)(
    input wire clk,
    input wire rst_n,
    
    // Host interface
    input wire [DATA_WIDTH-1:0] host_instruction,
    input wire host_start,
    
    // GPU output
    output reg [DATA_WIDTH-1:0] gpu_result,
    output reg gpu_valid,
    
    // Core control signals
    output reg [NUM_SHADER_CORES-1:0] core_enable,
    output reg [31:0] core_instruction [0:NUM_SHADER_CORES-1]
);

    // Control unit instance
    control_unit #(.NUM_CORES(NUM_SHADER_CORES)) ctrl_unit_inst (
        .clk(clk),
        .rst_n(rst_n),
        .instruction(host_instruction),
        .start(host_start),
        .core_instruction(core_instruction),
        .core_enable(core_enable)
    );
    
    // Simple result handling
    always @(posedge clk or negedge rst_n) begin
        if (!rst_n) begin
            gpu_result <= 32'h0;
            gpu_valid <= 1'b0;
        end else if (host_start) begin
            gpu_result <= host_instruction;
            gpu_valid <= 1'b1;
        end else begin
            gpu_valid <= 1'b0;
        end
    end

endmodule
```

**src/interface/bus_interface.sv**
```verilog
module bus_interface #(
    parameter BUS_WIDTH = 32,
    parameter NUM_SLAVES = 4
)(
    input wire clk,
    input wire rst_n,
    
    // Master interface
    input wire [BUS_WIDTH-1:0] master_data_in,
    input wire master_write_en,
    input wire master_read_en,
    input wire [31:0] master_addr,
    
    // Slave interfaces (simplified)
    output reg [BUS_WIDTH-1:0] slave_data_out [0:NUM_SLAVES-1],
    output reg slave_write_en [0:NUM_SLAVES-1],
    output reg slave_read_en [0:NUM_SLAVES-1],
    output reg [31:0] slave_addr [0:NUM_SLAVES-1]
);

    // Simple bus routing
    always @(posedge clk or negedge rst_n) begin
        if (!rst_n) begin
            for (int i = 0; i < NUM_SLAVES; i++) begin
                slave_data_out[i] <= BUS_WIDTH'h0;
                slave_write_en[i] <= 1'b0;
                slave_read_en[i] <= 1'b0;
                slave_addr[i] <= 32'h0;
            end
        end else begin
            // Route to first slave for demonstration
            slave_data_out[0] <= master_data_in;
            slave_write_en[0] <= master_write_en;
            slave_read_en[0] <= master_read_en;
            slave_addr[0] <= master_addr;
        end
    end

endmodule
```

**src/top/mali_t880_top.sv**
```verilog
module mali_t880_top #(
    parameter NUM_SHADER_CORES = 16,
    parameter DATA_WIDTH = 32
)(
    input wire clk,
    input wire rst_n,
    
    // Host interface
    input wire [DATA_WIDTH-1:0] host_instruction,
    input wire host_start,
    
    // GPU output
    output reg [DATA_WIDTH-1:0] gpu_result,
    output reg gpu_valid
);

    // Internal signals
    wire [NUM_SHADER_CORES-1:0] core_enable;
    wire [31:0] core_instruction [0:NUM_SHADER_CORES-1];
    
    // GPU interface
    gpu_interface #(.NUM_SHADER_CORES(NUM_SHADER_CORES)) gpu_if_inst (
        .clk(clk),
        .rst_n(rst_n),
        .host_instruction(host_instruction),
        .host_start(host_start),
        .gpu_result(gpu_result),
        .gpu_valid(gpu_valid),
        .core_enable(core_enable),
        .core_instruction(core_instruction)
    );
    
    // Shader cores
    genvar i;
    for (i = 0; i < NUM_SHADER_CORES; i = i + 1) begin : gen_shader_cores
        shader_core #(
            .CORE_ID(i),
            .DATA_WIDTH(DATA_WIDTH)
        ) core_inst (
            .clk(clk),
            .rst_n(rst_n),
            .start(host_start),
            .instruction(core_instruction[i]),
            .input_a(32'h0),  // Simplified input
            .input_b(32'h0),  // Simplified input
            .result(),
            .valid_out(),
            .execute_en(core_enable[i])
        );
    end
    
    // Memory subsystem
    wire [31:0] tile_data;
    wire tile_valid;
    
    tiler #(.WIDTH(1024), .HEIGHT(768)) tiler_inst (
        .clk(clk),
        .rst_n(rst_n),
        .primitive_data(host_instruction),
        .start_tiling(host_start),
        .tile_data(tile_data),
        .tile_valid(tile_valid)
    );
    
    // L2 cache
    wire [31:0] cache_data;
    wire cache_valid;
    
    l2_cache #(.CACHE_SIZE(2048), .LINE_SIZE(64)) l2_cache_inst (
        .clk(clk),
        .rst_n(rst_n),
        .tile_data(tile_data),
        .tile_valid(tile_valid),
        .cache_data(cache_data),
        .cache_valid(cache_valid)
    );
    
    // Memory controller
    wire [31:0] mem_addr;
    wire mem_rd_en;
    wire mem_wr_en;
    wire [31:0] mem_wdata;
    wire [31:0] mem_rdata;
    wire busy;
    
    memory_controller #(.MEM_SIZE(512 * 1024 * 1024)) mem_ctrl_inst (
        .clk(clk),
        .rst_n(rst_n),
        .cache_data(cache_data),
        .cache_valid(cache_valid),
        .mem_addr(mem_addr),
        .mem_rd_en(mem_rd_en),
        .mem_wr_en(mem_wr_en),
        .mem_wdata(mem_wdata),
        .mem_rdata(mem_rdata),
        .busy(busy)
    );

endmodule
```

**testbench/tb_mali_t880.sv**
```verilog
module tb_mali_t880;

    // Test parameters
    parameter CLK_PERIOD = 10;
    parameter NUM_SHADER_CORES = 16;
    
    // Clock and reset signals
    reg clk;
    reg rst_n;
    
    // Interface signals
    reg [31:0] host_instruction;
    reg host_start;
    wire [31:0] gpu_result;
    wire gpu_valid;
    
    // Instantiate top module
    mali_t880_top #(.NUM_SHADER_CORES(NUM_SHADER_CORES)) dut (
        .clk(clk),
        .rst_n(rst_n),
        .host_instruction(host_instruction),
        .host_start(host_start),
        .gpu_result(gpu_result),
        .gpu_valid(gpu_valid)
    );
    
    // Clock generation
    always begin
        #CLK_PERIOD clk = ~clk;
    end
    
    // Test sequence
    initial begin
        // Initialize signals
        clk = 0;
        rst_n = 0;
        host_instruction = 32'h0;
        host_start = 0;
        
        // Reset sequence
        #100;
        rst_n = 1;
        #100;
        
        // Test case 1: Simple instruction
        host_instruction = 32'hDEADBEEF;
        host_start = 1;
        #50;
        host_start = 0;
        #100;
        
        // Test case 2: Another instruction
        host_instruction = 32'hCAFEBABE;
        host_start = 1;
        #50;
        host_start = 0;
        #100;
        
        // Finish test
        $finish;
    end
    
    // Monitor signals
    initial begin
        $monitor("Time=%0t: gpu_result=0x%0h, gpu_valid=%b", 
                 $time, gpu_result, gpu_valid);
    end

endmodule
```

This implementation provides a basic framework for an ARM Mali GPU architecture with 16 shader cores. It includes:

1. **Core Architecture**: 16 shader cores with basic execution capabilities
2. **Memory Subsystem**: Tiler, L2 cache, and memory controller
3. **Interface Layer**: Host interface and bus interface
4. **Testbench**: Basic test sequence to verify functionality

Key features:
- Modular design with separate components for each subsystem
- Support for 16 shader cores as specified
- Memory hierarchy (tiler → L2 cache → memory controller)
- Simple instruction processing pipeline
- Basic testbench for verification

Note: This is a simplified implementation focused on demonstrating the architecture. A full ARM Mali GPU would include much more complex features like:
- Advanced instruction sets
- Multiple execution units per core
- Sophisticated memory management
- Power management features
- Detailed performance counters
- Advanced scheduling algorithms
- Hardware acceleration for specific operations

The design uses a modular approach that allows for easy expansion and modification of individual components.

   
Quote
Share: