Mission Vishwakarma Download Roadmap Pricing

Graphics API

API stands for Application Programming Interface. Basically a set of conventions / standards, compute engineers have come up with to write the software into. We need to pick sides here.

Choosing a graphics API to base our software upon is one of the most fundamental design we are going to make. For all practical purpose (read sunk man-month reasons) once we choose an API we will be “stuck” with it forever. This is one of the topics where I intentionally choose Performance over Development velocity. We could speed up software development by choosing a ready built engines such as open source ImGUI, GoDot, QT etc. Though, “engines” isolate the software from underlying APIs, we may get constrained by the engine itself at some point in future. We rule out closed source engines such as Unity and Unreal Engine for political reasons ! Fun Fact: This attitude is sometimes called NIH Syndrome i.e. Not-Invented-Here Syndrome. ;) So coming back to lower level APIs, we have limited APIs on each of the Operating Systems.

On windows, we have DirectX 9 / 10 / 11 / 12, OpenGL and Vulkan. OpenGL has been deprecated long back and newer graphics features such as Ray Tracing aren’t supported by it. Vulkan is generally a 2nd class citizen in windows compared to DirectX. Hence we choose the most modern flavor DirectX12. Remember, DirectX12 itself was 1st released in 2014. Hence setting it as a baseline requirement for our software is a reasonable decision. Hence DirectX12 is our ONLY graphics API for Windows Operating System. We support Windows 10 and 11 both for now (2025). This covers perhaps 90% of our target worldwide users. We also presume support of Heap_Tier_2 inside DirectX12. Note: Heap_Tier_2 started appearing in 2015/2016 timeline. What ShaderModel Level ? To be figured out. If you are feeling over-hyped to get deep down, read the 1st ( of 4 ) tutorial on DirectX12 here. It is ~100 pages !

Next most “market-share” operating system is MacOS on Apple Devices. In Apple world, Metal APIs are the only recommended ( non-deprecated ) APIs, hence we go with Metal. Even Vulkan works though a translation layer such as MoltenVK etc. Still for performance and 1st party support, we choose Metal API. Mac Graphics / Metal API shall also be partially reusable on iPhone / iPad devices, since they also have Metal as the preferred API.

Next up is Linux ( Ubuntu ) Operating System. This being open source operating system, open standard Vulkan is preferred here. We want our software to be available on even free operating systems. Hence we must have a Vulkan based US as well. Another reason for keeping this Vulkan interface is due to overlap with Android Mobile Operating System. For Android Phones, we have only 2 options, deprecated OpenGL or modern Vulkan. Hence we choose Vulkan. The within last 10 year version ! i.e. Vulkan 1.1.

Above 3 APIs are for desktop application. Next up is Brower based engine. Here upcoming ( as on 2025) API named WebGPU is chosen-one. This is supported by all major web-browser vendors i.e. Google Chrome, Apple Safari and Mozilla.

Having made above decisions, we have to be realistic about our core-engineering-degree-holder software developers. We can’t expect a chemical / civil / electrical / instrumentation / mechanical background people/developers to be familiar with such deep computer science concepts. Hence we structure our code in sort of mini-engine (NIH?), where adding a new UI element doesn’t involve fiddling deep down in graphics APIs. This will be sorted out progressively as our software matures.

Our software installer will verify that all the relevant APIs are present on the system, before installation. So this way, inside application, we don’t check every time whether a particular feature is supported by available hardware. Unless the initial installed-hardware itself changes. By default this check shouldn’t take more than a few micro-seconds during application startups.

More Graphics design decisions as specified in our Source Code !

  1// Copyright (c) 2025-Present : Ram Shanker: All rights reserved.
  2
  3/*
  4Windows Desktop C++ DirectX12 application for CAD / CAM use.
  5This file is our Architecture . Primitive data structures common to all platforms may be added here.
  6
  7At startup, pickup the GPU with highest VRAM. All rendering happens here only. Only 1 device supported for rendering. 
  8However OS may send the display frame to monitor connected to other / integrated GPU.
  9
 10VERTEX DETAILS:
 11VertexLayout Common to all geometry:
 123x4 Bytes for Position, 4 Bytes for Normal, 4 Bytes for Color RGBA / 8 Bytes if HDR Monitor present. 
 13    = 20 / 24 Bytes per vertex.
 14Anyway go with 24 Bytes format ONLY. Tone mapping (HDR -> SDR) should happen in the Pixel Shader.
 15Initial Development will be on R8G8B8A8, latter when we implement HDR, will will upgrade this.
 16Some hardware may not support HDR, so keep both version of shaders.
 17Further, wether to load HDR or SDR shaders is decided at the application startup times.
 18If graphics card support HDR and there is at least 1 monitor present with HDR capability, sitch to HDR.
 19Once HDR ON, the application maintains HDR shaders even if HDR monitors disconnects. Till app closes.
 20
 21Initially Hemispheric Ambient Lighting
 22Factor = (Normal.z \times 0.5) + 0.5
 23AmbientLight = Lerp(GroundColor, SkyColor, Factor)
 24Screen Space Ambient Occlusion (SSAO) to darken creases and corners in future revision.
 25
 26All vertexes are positioned on object local space. World matrix applied in vertex shader.
 27This enables, moving even 1000vertex objects just a 48 bytes world matrix update per oject.
 28We use packed 48 bytes world matrix instead of 64 bytes to save bandwidth.
 29Since last row is always 0,0,0,1, we can omit it. In shader, we reconstruct the last row.
 30
 31Separate render threads (1 per monitor) and single Copy thread. Copy thread is the ringmaster of VRAM!
 32Separate render threads per monitor are in VSync with monitors unique refresh rate.
 33Here separate render queue per monitor.
 34
 35We use ExecuteIndirect command with start vertex location instead of DrawIndexedInstanced per object.
 36I want per tab VRAM isolation, each tab will be completely separate.
 37Except for uncloseble tab 0 which stores common textures and UI elements.
 38
 39To support 100s of simultaneous tab, we start with small heap say 4MB per tab and grow heap size only when necessary.
 40Each page could be a mixture of various geometry types. Say Cylinders, Cubes, I beams etc.
 41Instead of allocating 1 giant 256MB buffer. Don't manually destroy heaps on tab switch. Use Evict. 
 42It allows the OS to handle the caching. 
 43If the user clicks back to a heavy tab, MakeResident is faster than re-creating heaps. Tab 0 is always resident. 
 44Eviction happens with a time lag of few seconds. 
 45Advanced system memory budget based eviction strategy after rest of spec implemented.
 46
 47Each page will be accompanied by a corresponding ExecuteIndirect argument buffer.
 48Each TAB will also have it's dedicated World Matrix buffer.
 49When we defragment a page, we must simultaneously rebuild its corresponding Argument Buffer.
 50
 51LOCK FREE VRAM MANAGEMENT:
 52We will now be using ExecuteIndirect command + versioned geometry pages. Page max size 4 MB (initially).
 53On geometry modify (Add / Modify / Delete)
 54If the amount of new geometry (+ filled up last active page) is more than 4 MB page threshold, create new pages. 
 55Do not touch existing ones. And then publish. Otherwise:
 56Allocate new page. By Copy Queue. Copy Queue makes a READ-ONLY operation (allowed) on existing page to create a newPage. 
 57This newPage is not published for rendering it.
 58DirectQueue0, DirectQueue1, DirectQueue2 and so on can keep rendering as usual. Leave oldPage in COMMON state permanently.
 59Never explicitly transition it to VERTEX.
 60Render / CopySource both allowed on respective Queue by implicit promotion of COMMON stage.
 61CopyQueue: Finalize this VRAM copied newPage as required. Upload delta. For additions, just add, 
 62for modify / delete, if page free space < threshold → rebuild /defragment page.
 63Publish pointer swap atomically. Once all render threads have passed a fence, Retire old page later by releasing buffers.
 64
 65Geometry will NOT be stored in CPU RAM once it is uploaded to VRAM due to memory efficiency reason. 
 66Keeping memory scarcity on iGPU systems!
 67They are generated on demand by engineering thread and simply handed over to copy queue. 
 68However to be able to defragment, copy queue stores the Byte/Index ranges of all objects loaded into a Page.
 69
 70Copy queue prepares newPage ( VertexBuffer, IndexBuffer, ExecuteIndirec Buffers) and uploads it to VRAM.
 71This (PCIe transfer) happens in parallel while the other render threads are already running.
 72So, iteration over all objects has been removed altogether from the engine. 
 73Further, there are 2 level of batching. Engineering thread will batch the changes together to some extent, 
 74and copy thread will also batch the changes by emptying the submission list.
 75
 76There will be multiple render threads running at different VSync refresh rates (say 1 at 60 Hz, 1 at 144 Hz, 1 at 30 Hz).
 77Each monitor has its own render thread, command queue / allocator / commandlist.
 78
 79Geometry pages: • Created in COMMON • Never explicitly transitioned • Only used in read-only states
 80• Copied from (COPY_SOURCE) • Copied into (COPY_DEST) only before publishing. Once published, 
 81there is no write operation on it. •Drawn from (VERTEX / INDEX / INDIRECT)
 82
 83Our strict invariants: • Geometry pages are immutable after publish.• No explicit state transitions for page buffers.
 84• All page swaps are atomic. • Old pages are destroyed only after all queues retire.
 85
 86There will be multiple views per tab.
 87Each View will maintain a pair ( double buffered ) of ExecuteIndirect command buffer.
 88When an object is deleted, copy thread receive command from engineering thread. 
 89Copy thread than update the next double buffer and record the hole in Vertex/index buffer.
 90Except for currently filling head buffer,
 91
 92Maintain a Free-List Allocator (e.g., a Segregated Free List) on the CPU. Per Tab.
 93The Allocator knows: "I have a 12KB middle gap in Page 3, and a 40KB middle gap in Page 8."
 94When a 10KB request comes in, the Allocator immediately returns "Page 3". No iterating through Page objects.
 95If freelist says none of existing pages can accommodate new geometry, than create new heap/placed resource buffer.
 96Free list does not track internal holes created from deleting objects. 
 97Only middle empty space. Aggregate holes are tracked per page. Defragmented occasionally.
 98
 99When a buffer gets >25% holes, it does creates a new defragmented buffer, once complete, switches over to new buffer.
100For new geometry addition. Maximum 1 buffer is defragmented at a time (between 2 frames). Since max page size is 64MB, 
101This will not produce high latency stall during async with copy thread.
102
103Root Signature puts the "Constants" (View/Proj matrix) in root constants or a very fast descriptor table,
104as these don't change between pages. Only the VBV/IBV and the EI Argument Buffer change per batch/page.
105
106OBJECT REPRESENTATION:
107Here is the realistic "Worst Case" Hierarchy for a CAD Frame:
108• ​Index Depth: 16-bit vs 32-bit (Hardware Requirement) Examples: Nuts/Bolts (16) vs Engine Blocks (32)
109• ​Transparency: Opaque vs Transparent (Sorting Requirement). Transparent objects must be drawn last for alpha blending.
110• ​Topology: Triangles (Solid) vs Lines (Wireframe) (PSO Requirement). 
111    We cannot draw lines and triangles in the same call.
112• ​Culling: Single-Sided vs Double-Sided (PSO Requirement) . Sheet metal vs Solids.
113    Since section is a common use case, perhaps we could have all geometry double sided. To be ascertained latter.
114• ​Buffer Pages (N): How many 256MB blocks you are using.
115​Total Unique Batches = 2 x 2 x 2 x 2 x N = 16 x N
116This will ensure no pipeline state reset while rendering single Page. ExecuteIndirect call for every Page.
117
118To be clarified latter: How do we  handle repeat geometry ? Say bolts.
119They will only need set of vertex/index buffers. We can draw them with different world matrices.
120
121NORMALS:
122​The industry standard solution for Normals is not 16-bit floats, but Packed 10-bit Integers.
123​We use the format: DXGI_FORMAT_R10G10B10A2_UNORM.
124​X: 10 bits (0 to 1023), ​Y: 10 bits (0 to 1023), ​Z: 10 bits (0 to 1023), ​Padding: 2 bits (unused)
125​Total: 32 bits (4 Bytes). Why this is perfect for Normals:
126​Size: It is 3x smaller than 12-byte normal. (4 bytes vs 12 bytes). ​Precision: 10 bits gives us 2^{10} = 1024 steps. 
127Since normals are always between -1.0 and 1.0, this gives you a precision of roughly 0.002.
128This is visually indistinguishable from 32-bit floats for lighting, even in high-end CAD.
129Vertex Shader Normalization: Normal = Input.Normal * 2.0 - 1.0.
130
131PAGE STRUCTURE:
132Vertex and Index buffer in same Page : superior architectural choice for three reasons:
133​Halves the Allocation Overhead: We only manage 1 heap/resource per 4MB page instead of 2.
134​Cache Locality: When the GPU fetches a mesh, the vertices and indices are physically close in VRAM (same memory page).
135This can slightly improve cache hit rates.
136​Vertices start at Offset 0 and grow UP. ​Indices start at Offset Max (4MB) and grow DOWN.
137​Free Space is always the gap in the middle. ​Page Full when Vertex_Head_Ptr meets or crosses Index_Tail_Ptr.
13864 Bytes mandatory gap in middle to address alignment concerns.
139
140Lazy Creation.
141​When a user creates a new Tab, allocated memory = 0 MB.
142User draws a Bolt (Solid): Allocate Solid_Page_0 (4MB).
143​User draws a Glass Window: Allocate Transparent_Page_0 (4MB).
144​User never draws a Wireframe: Wireframe_Page remains null.
145
146Resource state is together. i.e. D3D12_RESOURCE_STATE_VERTEX_AND_CONSTANT_BUFFER | D3D12_RESOURCE_STATE_INDEX_BUFFER
147Feature				Decision				Benefit
148Page Content	    Single Type Only		Zero PSO switching during Draw.
149Growth Logic		Chained Doubling		4->8->16->32->64. No moving old data.
150Max Page Size		64 MB					Prevents fragmentation failure on low-VRAM GPUs.
151Allocation			Lazy (On Demand)		Keeps "Hello World" tabs lightweight.
152Sub-Allocation		Double-Ended Stack		Maximizes usage for varying ratio of Vertex/Index Buffers.
153
154New geometry is appended (in the middle ) only if both new vertex and index buffers fit inside.
155Otherwise allocate new buffer. Copy thread also does batching. 
156It aggregates all(who fit in  current buffer) objects coming from engineering thread into single GPU upload.
157The Copy Thread should consume batches of updates, 
158coalescing them into single ExecuteCommandList calls where possible to reduce API overhead.
159
160"Big Buffer" fallback. If Allocation_Size > Max_Page_Size, 
161allocate a dedicated Committed Resource just for that object, bypassing the paging system.
162Handles large STL. or terrain map. Treat "Big Buffers" as a special Page Type. Add a "Large Object List" to your loop.
163Do not try to jam them into the standard EI logic if they require unique resource bindings per object.
1641 separate draw command for such Jumbo objects.
165
166Create a separate std::vector<BigObject> in Tab structure. Rendering:
167​Loop through Pages (ExecuteIndirect).
168​Loop through BigObjects (Standard DrawIndexedInstanced or EI with count 1).
169
170Defragmentation Logic:
171Copy queue marks the page for defragmentation. All frames of that tab freeze. Keep presenting previous render output.
172Any 1 of the rendering thread/queue reads the mark, Transition the resource to Common. Signal a fence.
173Copy queue picks it up , once defragmented, return the new resource.
174I am willing to accept the freeze of few frames on screen.
175This is a recognised engineering  tradeoff. Acceptable to CAD users.
176
177EI Argument Buffers tightly coupled to the Memory Pages. 
178When we defragment a Page, we must simultaneously rebuild its corresponding Argument Buffer.
179Do not try to "patch" the Argument buffer; regenerate it for that Page.
180
181Growth Logic: Similar to above defragmentation. How does my copy queue handle async ( without blocking render thread?) 
182addition of 1 small geometry  say 10kb to already existing 64MB heap out of which 50MB is filled up. 
183All Views/frames of that particular tab freeze. However other tabs being handled by render thread keep processing.
184No thread stall. Transition that page to copy destination. Copy new data. 
185Transition back to render status for render thread to pick up.
186
187FREEZE LOGIC:
188RenderToTexture to implement frame freeze since swap chain is FLIP_DISCARD. 
189Side benefits? HDR handling. UI composition. Multi-monitor flexibility. Eviction safety. Clean defrag freezes
190
191Known Issues / Limitations (to be resolved in latter revision):
192Transparency sorting. accepting imperfect sorting for "Glass" pages during rotation, 
193    and doing a CPU Sort + Args Rebuild only when the camera stops.
194Hot page for object drag / active mutation.
195Evict logic.
196Comput shader frustum culling.
197Telemetry. Per-tab VRAM usage graphs. Page fragmentation heatmap. Eviction frequency counters.
198    Copy queue stall tracking.
199Selection Highlighter methodology.
200Mesh Shader on supported hardware (RTX2000 onwards, RX6000 onwards).
201Instanced based LOD optimization . Optionally using compute shader.
202
203Miscellaneous Specification: 
204There will be a uniform object ID ( 64 bit ) unique across all objects across entire process memory. 
205Each object can have up-to 16? different simultaneous variations of vertex geometry / graphics representation.
206We am expecting 1000 to 5000 draw calls per frame ?
207How should I handle multiple partially overlapping windows? 
208Each windows can be independently resized or maximized / minimized.
209Lowest distance between object and ALL the different view camera position shall be used by logic threads,
210    to decided the Level of Detail.
211It will have some mechanism to manage memory over pressure.
212To signal the logic threads to reduce the level of detail within some distance.
213Our GPU Memory manager will be a singleton. There will be only 1 instance of that class managing entire GPU memory.
214
215Consider a Desktop PC. It has 2 discrete graphics card and 1 integrated graphics card.
2161 Monitor is connected and active to each of these 3 devices.
217We can use exactly 1 device for rendering for all monitor!
218Windows 10/11 WDDM supports heterogeneous multi-adapter. When window moves: DWM composites surfaces.
219Frame copied across adapters if needed. This works but is slow since all frames need to traverse PCIe bus.
220
221TO-DO LIST : As things get completed, 
222    they will be removed from this pending list and get incorporated appropriately in design document.
223
224Phase 1: The Visual Baseline (Get these out of the way)
225[Done] Update Vertex format to include Normals. (Required for lighting).
226[Done] Hemispherical Lighting in shader. (Verify normals are correct).
227[Done] Mouse Zoom/Pan/Rotate (Basic).
228
229Phase 2: The "Freeze" Infrastructure
230Before you break the memory model, build the mechanism that hides the breakage.
231[Done] Render To Texture (RTT) & Full-Screen Quad. Goal: Detach the "Drawing" from the "Presenting."
232
233Phase 3: The API Pivot (The Hardest Part)
234Switching to ExecuteIndirect changes how you pass data. Do this BEFORE implementing custom heaps to isolate variables.
235[Done] Implement Structured Buffer for World Matrix. StructuredBuffer<float4x4> and a root constant index.
236We cannot do ExecuteIndirect for multiple objects without a way to tell the shader which object is being drawn. 
237[Done] DrawIndexedInstanced → ExecuteIndirect (EI).
238Advice: Implement this using your current committed resources first. Just get the API call working.
239
240Phase 4: The Memory Manager (The "Vishwakarma" Core)
241Now that EI is working, replace the backing memory.
242[ ] [MISSING] Global Upload Ring Buffer.
243Critical: Copy thread needs a staging area. If we don't build this, 
244our "VRAM Pages" step will stall waiting on CreateCommittedResource for uploads.
245[Done] VRAM Pages per Tab (The Stack Allocator). Advice: Implement the "Double-Ended Stack" (Vertex Up, Index Down) here.
246[ ] CPU-Side Free List Allocator. (The logic that tracks the holes).
247[ ] Tab Management / View Management. (Integrating the heaps into the UI).
248[ ] Basic Ribbon UI.
249
250Phase 5: Advanced Features & Polish
251[ ] Migrated to Shader Model 6.
252[ ] VRAM Defragmentation. (Now safe to implement because RTT exists).
253[ ] Click Selection / Window Selection. (Requires Raycasting against your CPU Free List/Data structures).
254[ ] Instanced optimization for Pipes.
255[ ] SSAO.
256[ ] Upgrade Vertices to HDR + Tonemapping.
257[ ] Transparency Sorting. (CPU Sort + Args Rebuild when camera stops moving).
258
259Phase 6: Performance & Telemetry
260[ ] Per-Tab VRAM Usage Graphs. (Helps identify memory leaks or inefficient usage).
261[ ] Page Fragmentation Heatmap. (Visualize which pages are most fragmented).
262[ ] Eviction Frequency Counters. (Track how often eviction occurs and its impact on performance).
263[ ] Copy Queue Stall Tracking. (Identify bottlenecks in the copy thread).
264
265Phase 7: Extreme performance optimizations (Only after all above is done and stable)
266[ ] LOD Optimization. (Using instancing or compute shaders to manage levels of detail based on camera distance).
267[ ] Compute Shader Frustum Culling. (To reduce the number of objects sent to the GPU).
268[ ] Mesh Shader Implementation. (For supported hardware, to further reduce draw call overhead). (Only for pipes)
269[ ] GPU-Based Defragmentation. (Offload defragmentation to the GPU to minimize CPU stalls).
270[ ] Asynchronous Resource Creation. (Use D3D12's async resource creation to further reduce stalls
271  during heap growth or defragmentation).
272[ ] Page Level optimization : Static pages → single draw, Semi-dynamic pages → EI , 
273  Highly dynamic pages → EI + GPU compaction
274
275Not to do list:
276Multi-GPU Rendering. (Too complex for initial implementation, and Windows' multi-adapter support is limited).
277Face-wise Geometry colors. (Implementation detail). Maybe necessary for future mechanical parts.
278
279*/
280
281#pragma once
282#include <DirectXMath.h>
283
284struct CameraState { // Each view gets its own camera state. 
285    //This is part of the "View" data structure, not the "Tab" data structure. Each tab can have multiple views.
286    DirectX::XMFLOAT3 position;
287    DirectX::XMFLOAT3 target;
288    DirectX::XMFLOAT3 up;
289    float fov;
290    float aspect;
291    float nearZ;
292    float farZ;
293
294    CameraState() { Initialize(); }
295    void Initialize() {
296        position = { 0.0f, -10.0f, 2.0f };
297        target = { 0.0f, 0.0f,  0.0f };
298        up = { 0.0f, 0.0f,  1.0f }; // Z-Up is perfect for an XY orbit.
299
300        fov = DirectX::XMConvertToRadians(60.0f);
301        aspect = 1.0f; // SAFE DEFAULT
302        nearZ = 0.1f;
303        farZ = 1000.0f;
304    }
305};
306
307inline void UpdateCameraOrbit(CameraState& cam)
308{
309    static float rotationAngle = 0.0f; // Remove static when implementing tab UI.
310    rotationAngle += 0.002f;   // per-frame speed 
311
312    // Calculate the 2D radius from the target on the XY plane. We ignore Z here to prevent the "spiral away" bug.
313    float dx = cam.position.x - cam.target.x;
314    float dy = cam.position.y - cam.target.y;
315    float radius = hypotf(dx, dy);
316    if (radius < 0.001f) radius = 10.0f;// Safety check to prevent radius becoming 0 (which locks the camera)
317
318    float x = cam.target.x + cosf(rotationAngle) * radius; // Orbit in XY plane
319    float y = cam.target.y + sinf(rotationAngle) * radius;
320    float z = cam.position.z;// Z remains static (height)
321    cam.position = { x, y, z };
322}

Actual Code of our graphics engine.

  1// Copyright (c) 2025-Present : Ram Shanker: All rights reserved.
  2#pragma once
  3
  4//DirectX 12 headers. Best Place to learn DirectX12 is original Microsoft documentation.
  5// https://learn.microsoft.com/en-us/windows/win32/direct3d12/direct3d-12-graphics
  6// You need a good dose of prior C++ knowledge and Computer Fundamentals before learning DirectX12.
  7// Expect to read at least 2 times before you start grasping it !
  8
  9//Tell the HLSL compiler to include debug information into the shader blob.
 10#define D3DCOMPILE_DEBUG 1 //TODO: Remove from production build.
 11#define WIN32_LEAN_AND_MEAN
 12#include <windows.h>   // MUST be before d3d12.h
 13#include <d3d12.h> //Main DirectX12 API. Included from %WindowsSdkDir\Include%WindowsSDKVersion%\\um
 14//helper structures Library. MIT Licensed. Added to the project as git submodule.
 15//https://github.com/microsoft/DirectX-Headers/blob/main/include/directx/d3dx12.h
 16#include <d3dx12.h>
 17#include <dxgi1_6.h>
 18#include <dxgidebug.h>
 19#include <wrl.h>
 20#include <d3dcompiler.h>
 21#include <DirectXMath.h> //Where from? https://github.com/Microsoft/DirectXMath ?
 22#include <vector>
 23#include <string>
 24#include <unordered_map>
 25#include <random>
 26#include <ctime>
 27#include <iostream>
 28#include <thread>
 29#include <chrono>
 30#include <map>
 31#include <list>
 32
 33#include "ConstantsApplication.h"
 34#include "MemoryManagerGPU.h"
 35#include "UserInterface-DirectX12.h"
 36#include "डेटा.h"
 37
 38using namespace Microsoft::WRL;
 39
 40//DirectX12 Libraries.
 41#pragma comment(lib, "d3d12.lib") //%WindowsSdkDir\Lib%WindowsSDKVersion%\\um\arch
 42#pragma comment(lib, "dxgi.lib")
 43#pragma comment(lib, "d3dcompiler.lib")
 44#pragma comment(lib, "dxguid.lib")
 45
 46/* Double buffering is preferred for CAD application due to low input lag.Caveat: If rendering time
 47exceeds frame refresh interval, than strutting distortion will appear. However
 48we low input latency outweighs the slight frame smoothness of triple buffering.
 49Double buffering (2x) is also 50% more memory efficient Triple Buffering (3x). */
 50const UINT FRAMES_PER_RENDERTARGETS = 2; //Initially we are going with double buffering.
 51
 52// Constants
 53constexpr UINT64 MaxVertexBufferSize = 1024 * 1024 * 64; // 64 MB
 54constexpr UINT64 MaxIndexBufferSize = 1024 * 1024 * 16; // 16 MB
 55
 56// Represents complete geometry and index data associated with 1 engineering object..
 57// This structure holds information about a resource allocated in GPU memory (VRAM)
 58struct GpuResourceVertexIndexInfo {
 59    ComPtr<ID3D12Resource> vertexBuffer;
 60    D3D12_VERTEX_BUFFER_VIEW vertexBufferView;
 61    ComPtr<ID3D12Resource> indexBuffer;
 62    D3D12_INDEX_BUFFER_VIEW indexBufferView;
 63    UINT indexCount;
 64    uint32_t matrixIndex = 0;
 65
 66    //TODO: Latter on we will generalize this structure to hold textures, materials, shaders etc.
 67    // Currently we are letting the Drive manage the GPU memory fragmentation. Latter we will manage it ourselves.
 68    //uint64_t vramOffset; // Simulated VRAM address
 69    //uint64_t size;
 70    // In a real DX12 app, this would hold ID3D12Resource*, D3D12_VERTEX_BUFFER_VIEW, etc.
 71};
 72
 73struct IndirectCommand { // OPTIMIZED Indirect Command
 74    uint32_t matrixIndex; // 4 Bytes (Root Constant b1)
 75	// Since Jumbo buffer ( or pages in future ) remains same, we bind it once.
 76    // REMOVED: D3D12_VERTEX_BUFFER_VIEW vbv (Saved 16 Bytes)
 77    // REMOVED: D3D12_INDEX_BUFFER_VIEW  ibv (Saved 16 Bytes)
 78    D3D12_DRAW_INDEXED_ARGUMENTS drawArguments;// 20 Bytes
 79}; // Total size: 24 Bytes (down from 56 Bytes!)
 80static_assert(sizeof(IndirectCommand) == 24, "IndirectCommand must be exactly 24 bytes.");
 81
 82/* Page Metadata: GeometryPlacementRecordInPage (CPU-side only).
 83One entry per geometry object inside a GeometryPage. Used by Copy Thread for defragmentation, 
 84rebuilds, and future features. (frustum culling, ray-cast selection, LOD, etc.).
 85Total size = 56 bytes (tightly packed, cache-friendly). */
 86struct GeometryPlacementRecordInPage {
 87    uint64_t objectID;           // Unique 64-bit ID across entire process (unchanged)
 88
 89    // Byte offsets into this page's vertex/index buffers (page max = 4 MB → uint32_t is safe)
 90    // Vertex region (grows upward)
 91    uint32_t vertexByteOffset; // Start of this object's vertices in the page (bytes)
 92    uint32_t vertexSize;       // In bytes
 93
 94    // Index region (grows downward)
 95    uint32_t indexByteOffset;    // Start of this object's indices in the page (bytes)
 96    uint32_t indexSize;          // In bytes
 97
 98    uint32_t indexCount;         // Number of indices (not bytes) For ExecuteIndirect
 99    uint32_t matrixIndex;        // Index into the per-tab WorldMatrix structured buffer
100
101    // Axis-Aligned Bounding Box (AABB) – stored as float32 only (24 bytes total)
102    // Always present for future use (frustum culling, selection, etc.).
103    // Set to {0,0,0} / {0,0,0} if we don't need it yet – costs nothing extra.
104    float minX, minY, minZ, maxX, maxY, maxZ; // Minimum corner (X,Y,Z) Maximum corner (X,Y,Z)
105
106    // Optional padding for perfect 8-byte alignment (not needed – compiler will pad anyway)
107	bool isDeleted = false; // Marked for deletion (soft delete, for defragmentation)
108};
109
110static_assert(sizeof(GeometryPlacementRecordInPage) == 64, 
111    "GeometryPlacementRecordInPage must be exactly 64 bytes for optimal cache/line usage.");
112
113struct GeometryPage {
114    // GPU RESOURCES. Single unified 4 MB buffer
115    Microsoft::WRL::ComPtr<ID3D12Resource> buffer;// Layout:[Vertex Region ↑ ][Free Space][ Index Region ↓ ]
116    Microsoft::WRL::ComPtr<ID3D12Resource> indirectBuffer;// ExecuteIndirect argument buffer for this page
117    uint32_t indirectCount = 0; // Number of valid indirect draw commands
118
119    // ALLOCATION STATE (CPU-side only)
120    uint32_t vertexHead = 0; // Vertex region grows upward from 0
121    // Index region grows downward from pageSize
122    uint32_t indexTail = 0;  // Initialized to pageSize
123    uint32_t pageSize = 0;   // Typically 4 * 1024 * 1024
124    static constexpr uint32_t SAFETY_GAP = 64; // alignment guard
125
126    // FRAGMENTATION TRACKING
127    uint32_t liveBytes  = 0;   // Actively used bytes
128    uint32_t holeBytes  = 0;   // Deleted object space
129    uint32_t objectCount = 0;  // Active objects
130
131    // VERSIONING & LIFETIME CONTROL
132    uint32_t version = 0;                // Incremented on rebuild
133    std::atomic<bool> published = false; // Immutable once true
134    uint64_t retireFence = 0; // Fence value after which this page is safe to destroy
135
136    std::vector<GeometryPlacementRecordInPage> objects; // CPU METADATA (NO GEOMETRY STORED)
137
138    // UTILITY
139    bool IsFull(uint32_t incomingVertexBytes, uint32_t incomingIndexBytes) const  {
140        //If: incomingIndexBytes > indexTail then : indexTail - incomingIndexBytes wraps to huge value.
141        if (incomingIndexBytes > indexTail) return true;
142        uint32_t alignedVertexHead = AlignUp(vertexHead, 16);
143        uint32_t alignedIndexTail  = AlignDown(indexTail - incomingIndexBytes, 4);
144		return (alignedVertexHead + incomingVertexBytes + SAFETY_GAP > alignedIndexTail);
145    }
146
147    static uint32_t AlignUp(uint32_t value, uint32_t alignment) {
148        return (value + alignment - 1) & ~(alignment - 1);
149    }
150
151    static uint32_t AlignDown(uint32_t value, uint32_t alignment) {
152        return value & ~(alignment - 1);
153    }
154};
155
156struct BigGeometryObject {
157    Microsoft::WRL::ComPtr<ID3D12Resource> buffer;
158    Microsoft::WRL::ComPtr<ID3D12Resource> indirectBuffer;
159    uint32_t indexCount = 0;
160    uint32_t matrixIndex = 0;
161    uint64_t retireFence = 0;
162    std::atomic<bool> published = false;
163};
164
165struct GeometryPageSnapshot {// A lightweight, immutable snapshot of the current pages.
166    // We use raw pointers here because the Render thread only needs to observe them.
167    // Iterating over a contiguous array of pointers is extremely cache-friendly.
168    std::vector<GeometryPage*> pages;
169};
170
171struct TabGeometryStorage {
172    // THE RCU POINTER: Render threads read this, Copy thread writes to it.
173    std::atomic<GeometryPageSnapshot*> activeSnapshot{ nullptr };
174    // WRITER-ONLY STATE: Only the Copy thread touches these, so they need no locks/atomics.
175    std::vector<std::unique_ptr<GeometryPage>> activePages; // Actually owns the memory
176
177    // Cleanup queues for the Copy thread
178    struct RetiredSnapshot { GeometryPageSnapshot* snapshot; uint64_t retireFence; };
179    struct RetiredPage { std::unique_ptr<GeometryPage> page; uint64_t retireFence; };
180    std::vector<RetiredSnapshot> retiredSnapshots;
181    std::vector<RetiredPage> retiredPages;
182
183    /* TODO: RCU version of all of the following vectors need to be developed. Only 1st done so far.
184    std::vector<std::unique_ptr<GeometryPage>> opaquePages; // Opaque geometry pages
185    std::vector<std::unique_ptr<GeometryPage>> transparentPages; // Transparent geometry pages
186    std::vector<std::unique_ptr<GeometryPage>> wireframePages; // Wireframe pages (if used)
187    std::vector<std::unique_ptr<BigGeometryObject>> bigObjects; // Dedicated large objects
188    std::atomic<uint32_t> currentVersion = 0;
189    std::vector<std::unique_ptr<GeometryPage>> retiredPages;
190    */
191};
192
193/* DirectX 12 resources are organized at 3 levels:
1941. The Data   : Per Tab (Jumbo Buffers for geometry data, materials, textures, 
195    Pipeline State Object, Root Signature, Command Signature etc.)
1962. The Target : Per Window (Swap Chain, Render Targets, Depth Stencil Buffer etc.)
1973. The Worker : Per Render Thread. 1 For each monitor. (Command Queue, Command List etc.
198    Resources shared across multiple windows on the same monitor) */
199
200struct DX12ResourcesPerTab { // (The Data) Geometry Data
201
202    // Upload Heaps (CPU -> GPU Transfer)
203    // Moved here because the Copy Thread writes to these when adding objects to the TAB.
204    ComPtr<ID3D12Resource> vertexBufferUpload;
205    ComPtr<ID3D12Resource> indexBufferUpload;
206
207    // Persistent Mapped Pointers (CPU Address)
208    UINT8* pVertexDataBegin = nullptr;// Pointer for mapped vertex upload buffer
209    UINT8* pIndexDataBegin = nullptr;  // Pointer for mapped index upload buffer
210
211	// TODO: We will generalize this to hold materials, shaders, textures etc. unique to this project/tab
212    ComPtr<ID3D12DescriptorHeap> srvHeap;
213
214    mutable std::mutex objectsOnGPUMutex;// Make mutex mutable so const references can lock it in rendering paths.
215    // Copy thread will update the following map whenever it adds/removes/modifies an object on GPU.
216    std::map<uint64_t, GpuResourceVertexIndexInfo> objectsOnGPU;
217
218    //Copy thread owns/writes following variables exclusively. Render threads only read it. Without Lock.
219    ComPtr<ID3D12Resource> worldMatrixBuffer; // TODO: Doublebuffer it per frame.
220    UINT8 * pWorldMatrixDataBegin = nullptr;
221    uint32_t               matrixCapacity = 4096; 
222    uint32_t               matrixCount = 0;
223	std::vector<uint32_t>  freeMatrixSlots;   // free-list for matrix indices.
224    //To enable re-use of slots when objects are removed.
225
226	// Initially rootSignature & pipelineState were in PerWindow, but now moved here, 
227    // when adding commandSignature and indirect drawing infrastructure.
228    // Since Root Signature and Pipeline State are closely tied to the command signature, 
229    ComPtr<ID3D12RootSignature> rootSignature;
230    ComPtr<ID3D12PipelineState> pipelineState;
231
232    ComPtr<ID3D12CommandSignature> commandSignature;// Indirect Drawing
233
234	CameraState camera; //Reference is updated per frame. 
235    //Currently per tab, but latter we will have this per view. Since each tab can have multiple views.
236};
237
238struct DX12ResourcesPerWindow {// Presentation Logic
239    int WindowWidth = 800;//Current ViewPort ( Rendering area ) size. excluding task-bar etc.
240    int WindowHeight = 600;
241    ID3D12CommandQueue* creatorQueue = nullptr; // Track which queue this windows was created with. 
242    //To assist with migrations.
243    
244    ComPtr<IDXGISwapChain3>         swapChain; // The link to the OS Window
245	//ComPtr<ID3D12CommandQueue>    commandQueue; // Moved to OneMonitorController
246    ComPtr<ID3D12DescriptorHeap>    rtvHeap;
247    ComPtr<ID3D12Resource>          renderTargets[FRAMES_PER_RENDERTARGETS];
248
249    // Render To Texture Infrastructure
250    ComPtr<ID3D12Resource>          renderTextures[FRAMES_PER_RENDERTARGETS];
251    ComPtr<ID3D12DescriptorHeap>    rttRtvHeap;
252    ComPtr<ID3D12DescriptorHeap>    rttSrvHeap;
253    
254    // TODO: When we will implement HDR support, we wil have change above format to following.
255    //DXGI_FORMAT                     rttFormat = DXGI_FORMAT_R16G16B16A16_FLOAT; // HDR ready
256
257    ComPtr<ID3D12Resource> depthStencilBuffer;// Depth Buffer (Sized to the window dimensions)
258    ComPtr<ID3D12DescriptorHeap> dsvHeap;
259
260    D3D12_VIEWPORT viewport;// Viewport & Scissor (Dependent on Window Size).
261    D3D12_RECT scissorRect;
262
263    ComPtr<ID3D12Resource> constantBuffer;
264    ComPtr<ID3D12DescriptorHeap> cbvHeap;
265    UINT8* cbvDataBegin = nullptr;
266
267	UINT frameIndex = 0; // Remember this is different from allocatorIndex in Render Thread.
268    // It can change even during windows resize.
269};
270
271struct DX12ResourcesPerRenderThread { // This one is created 1 for each monitor.
272    // For convenience only. It simply points to OneMonitorController.commandQueue
273	ComPtr<ID3D12CommandQueue> commandQueue;
274
275    // Note that there are as many render thread as number of monitors attached.
276    // Command Allocators MUST be unique to the thread.
277    // We need one per frame-in-flight to avoid resetting while GPU is reading.
278    ComPtr<ID3D12CommandAllocator> commandAllocators[FRAMES_PER_RENDERTARGETS];
279	UINT allocatorIndex = 0; // Remember this is different from frameIndex available per Window.
280
281    // The Command List (The recording pen). Can be reset and reused for multiple windows within the same frame.
282    ComPtr<ID3D12GraphicsCommandList> commandList;
283
284    // Synchronization (Per Window VSync)
285    HANDLE fenceEvent = nullptr;
286    ComPtr<ID3D12Fence> fence; // TODO: Discard this. use the fence inside monitor.
287};
288
289struct OneMonitorController { // Variables stored per monitor.
290    // System Fetched information.
291    bool isScreenInitalized = false;
292    int screenPixelWidth = 800;
293    int screenPixelHeight = 600;
294    int screenPhysicalWidth = 0; // in mm
295    int screenPhysicalHeight = 0; // in mm
296    int WindowWidth = 800;//Current ViewPort ( Rendering area ) size. excluding task-bar etc.
297    int WindowHeight = 600;
298
299    HMONITOR hMonitor = NULL; // Monitor handle. Remains fixed as long as monitor is not disconnected / disabled.
300    std::wstring monitorName;            // Monitor device name (e.g., "\\\\.\\DISPLAY1")
301    std::wstring friendlyName;           // Human readable name (e.g., "Dell U2720Q")
302    RECT monitorRect;                    // Full monitor rectangle
303    RECT workAreaRect;                   // Work area (excluding task bar)
304    int dpiX = 96;                       // DPI X
305    int dpiY = 96;                       // DPI Y
306    double scaleFactor = 1.0;            // Scale factor (100% = 1.0, 125% = 1.25, etc.)
307    bool isPrimary = false;              // Is this the primary monitor?
308    DWORD orientation = DMDO_DEFAULT;    // Monitor orientation
309    int refreshRate = 60;                // Refresh rate in Hz
310    int colorDepth = 32;                 // Color depth in bits per pixel
311
312    bool isVirtualMonitor = false;       // To support headless mode.
313
314    // DirectX12 Resources.
315	// TODO: Move these to per render thread structure.
316	ComPtr<ID3D12CommandQueue> commandQueue;    // Persistent. Survives thread restarts.
317    bool hasActiveThread = false;// We need to know if this specific monitor is currently being serviced by a thread
318    ComPtr<ID3D12Fence> renderFence; // Signalled each frame by GpuRenderThread
319    uint64_t renderFenceValue = 0; // Last value signalled (written by render thread)
320    // Above is intentionally NOT std::atomic since gpu.renderFenceValue is the std::atomic serving all monitors.
321    HANDLE renderFenceEvent = nullptr;
322};
323
324// Commands sent from Generator thread(s) to the Copy thread
325enum class CommandToCopyThreadType { NONE = 0, ADD, MODIFY, REMOVE };
326struct CommandToCopyThread
327{
328    CommandToCopyThreadType type;
329    std::optional<GeometryData> geometry; // Present for ADD and MODIFY
330    uint64_t id = 0; // Always present
331    uint64_t tabID = 0; // NEW: We must know which tab this object belongs to!
332};
333
334extern std::atomic<bool> pauseRenderThreads; // Defined in Main.cpp
335
336// Packet of work for a Render Thread for one frame
337struct RenderPacket {
338    uint64_t frameNumber;
339    std::vector<uint64_t> visibleObjectIds;
340};
341
342class HrException : public std::runtime_error// Simple exception helper for HRESULT checks
343{
344public:
345    HrException(HRESULT hr) : std::runtime_error("HRESULT Exception"), hr(hr) {}
346    HRESULT Error() const { return hr; }
347private:
348    const HRESULT hr;
349};
350
351inline void ThrowIfFailed(HRESULT hr) {
352    if (FAILED(hr)) { throw HrException(hr); }
353}
354
355
356class ThreadSafeQueueGPU {
357public:
358    void push(CommandToCopyThread value) {
359        std::lock_guard<std::mutex> lock(mutex);
360        fifoQueue.push(std::move(value));
361        cond.notify_one();
362    }
363
364    // Non-blocking pop
365    bool try_pop(CommandToCopyThread& value) {
366        std::lock_guard<std::mutex> lock(mutex);
367        if (fifoQueue.empty()) { return false; }
368        value = std::move(fifoQueue.front());
369        fifoQueue.pop();
370        return true;
371    }
372
373    // Shuts down the queue, waking up any waiting threads
374    void shutdownQueue() {
375        std::lock_guard<std::mutex> lock(mutex);
376        shutdown = true;
377        cond.notify_all();
378    }
379
380private:
381    std::queue<CommandToCopyThread> fifoQueue; // fifo = First-In First-Out
382    std::mutex mutex;
383    std::condition_variable cond;
384    bool shutdown = false;
385};
386
387inline ThreadSafeQueueGPU g_gpuCommandQueue;
388
389// VRAM Manager : This class handles the GPU memory dynamically.
390// There will be exactly 1 object of this class in entire application. Hence the special name.
391// भगवान शंकर की कृपा बनी रहे. Corresponding object is named "gpu".
392class शंकर {
393public:
394    OneMonitorController screens[MV_MAX_MONITORS];
395    int currentMonitorCount = 0; // Global monitor count. It can be 0 when no monitors are found (headless mode)
396
397    // IDXGIFactory6 / IDXGIAdapter4 Prerequisite : Windows 10 1803+ / Windows 11
398    ComPtr<IDXGIFactory6> factory6; //The OS-level display system manager. Can iterate over GPUs.
399    ComPtr<IDXGIAdapter4> hardwareAdapter;// Represents a physical GPU device.
400    //Represents 1 logical GPU device on above GPU adapter. Helps create all DirectX12 memory / resources / comments etc.
401
402	ComPtr<ID3D12Device> device; //Very Important: We support EXACTLY 1 GPU device only in this version.
403    bool isGPUEngineInitialized = false; //TODO: To be implemented.
404    DXGI_FORMAT rttFormat = DXGI_FORMAT_R8G8B8A8_UNORM;
405
406    DX12ResourcesUI uiResources;
407    
408    //Following to be added latter.
409    //ID3D12DescriptorHeapMgr    ← Global descriptor allocator
410    //Shader& PSO Cache         ← Shared by all threads
411    //AdapterInfo                ← For device selection / VRAM stats
412
413    /* We will have 1 Render Queue per monitor, which is local to Render Thread.
414    IMPORTANT: All GPU have only 1 physical hardware engine, and can execute 1 command at a time only.
415    Even if 4 commands list are submitted to 4 independent queue, graphics driver / WDDM serializes them.
416    Still we need to have 4 separate queue to properly handle different refresh rate.
417
418    Ex: If we put all 4 window on same queue: Window A (60Hz) submits a Present command. The Queue STALLS
419    waiting for Monitor A's VSync interval. Window B (144Hz) submits draw comand. 
420    Window B cannot be processed because the Queue is blocked by Windows A's VSync wait. 
421    By using 4 Queues, Queue A can sit blocked waiting for VSync, 
422    while Queue B immediately push work work to the GPU for the faster monitor.*/
423
424    std::atomic<uint64_t> renderFenceValue = 0; // Global. This is in addition to per monitor render fence value.
425
426	ComPtr<ID3D12CommandQueue> copyCommandQueue; // There is only 1 across the application.
427    ComPtr<ID3D12Fence> copyFence;// Synchronization for Copy Queue
428	std::atomic<uint64_t> copyFenceValue = 1; // thread safe.
429    //Start from 1 to avoid confusion with default fence value of 0.
430    HANDLE copyFenceEvent = nullptr;
431
432public:
433    // Maps our CPU ObjectID to its resource info in VRAM
434    std::unordered_map<uint64_t, GpuResourceVertexIndexInfo> resourceMap;
435
436    // Simulates a simple heap allocator with 16MB chunks
437    uint64_t m_nextFreeOffset = 0;
438    const uint64_t CHUNK_SIZE = 16 * 1024 * 1024;
439    uint64_t m_vram_capacity = 4 * CHUNK_SIZE; // Simulate 64MB VRAM
440
441    // When an object is updated, the old VRAM is put here to be freed later.
442    struct DeferredFree {
443        uint64_t frameNumber; // The frame it became obsolete
444        GpuResourceVertexIndexInfo resource;
445    };
446    std::list<DeferredFree> deferredFreeQueue;
447
448	// Allocate space in VRAM. Returns the handle. What is this used for?
449    // std::optional<GpuResourceVertexIndexInfo> Allocate(size_t size);
450
451	// Descriptor sizes for RTV and CBV/SRV/UAV. We need these to calculate offsets in descriptor heaps.
452	// These are initialized during device creation and remain constant. i.e. They are hardware properties of GPU.
453    // We store them here for easy access across threads.
454	UINT rtvDescriptorSize = 0, cbvSrvUavDescriptorSize = 0; //Initialized during device creation.
455
456    void ProcessDeferredFrees(uint64_t lastCompletedRenderFrame);
457
458	//शंकर() {}; // Our Main function initializes DirectX12 global resources by calling InitD3DDeviceOnly().
459    void InitD3DDeviceOnly();
460    void InitD3DPerTab(DX12ResourcesPerTab& tabRes); // Call this when a new Tab is created
461    void InitD3DPerWindow(DX12ResourcesPerWindow& dx, HWND hwnd, ID3D12CommandQueue* commandQueue);
462    void PopulateCommandList(ID3D12GraphicsCommandList* cmdList, //Called by per monitor render thread.
463        DX12ResourcesPerWindow& winRes, const DX12ResourcesPerTab& tabRes, TabGeometryStorage& storage);
464    void WaitForPreviousFrame(const DX12ResourcesPerRenderThread& dx);
465    void ResizeD3DWindow(DX12ResourcesPerWindow& dx, UINT newWidth, UINT newHeight);
466
467    // Called when a monitor is unplugged or window is destroyed. Destroys SwapChain/RTVs but KEEPS Geometry.
468    void CleanupWindowResources(DX12ResourcesPerWindow& winRes);
469    // Called when a TAB is closed by the user. Destroys the Jumbo Vertex/Index Buffers.
470    void CleanupTabResources(DX12ResourcesPerTab& tabRes);
471    // Called ONLY at application exit (wWinMain end).Destroys the Device, Factory, and Global Copy Queue.
472	// Thread resources are cleaned up by the Render Thread itself before exit.
473    void CleanupD3DGlobal();
474};
475
476void FetchAllMonitorDetails();
477BOOL CALLBACK MonitorEnumProc(HMONITOR hMonitor, HDC hdcMonitor, LPRECT lprcMonitor, LPARAM dwData);
478
479/*
480IID_PPV_ARGS is a MACRO used in DirectX (and COM programming in general) to help safely and correctly
481retrieve interface pointers during object creation or querying. It helps reduce repetitive typing of codes.
482COM interfaces are identified by unique GUIDs. Than GUID pointer is converted to appropriate pointer type.
483
484Ex: IID_PPV_ARGS(&device) expands to following:
485IID iid = __uuidof(ID3D12Device);
486void** ppv = reinterpret_cast<void**>(&device);
487*/
488
489// Structure to hold transformation matrices
490struct ConstantBuffer {
491    DirectX::XMFLOAT4X4 viewProj;   // 64 bytes
492};
493
494// Externs for communication 
495extern std::atomic<bool> shutdownSignal;
496
497// Logic Thread "Fence"
498extern std::mutex g_logicFenceMutex;
499extern std::condition_variable g_logicFenceCV;
500extern uint64_t g_logicFrameCount;
501
502// Copy Thread "Fence"
503extern std::mutex g_copyFenceMutex;
504extern std::condition_variable g_copyFenceCV;
505extern uint64_t g_copyFrameCount;
506
507//TODO: Implement this. In a real allocator, we would manage free lists and possibly defragment memory.
508/*
509std::optional<GpuResourceVertexIndexInfo> शंकर::Allocate(size_t size) {
510
511    if (nextFreeOffset + size > m_vram_capacity) {
512        std::cerr << "VRAM MANAGER: Out of memory!" << std::endl;
513        // Here, the Main Logic thread would be signaled to reduce LOD.
514        return std::nullopt;
515    }
516    GpuResourceVertexIndexInfo info{ nextFreeOffset, size };
517    nextFreeOffset += size; // Simple bump allocator
518    return info;
519}*/
520
521// Utility Functions
522
523// Waits for the previous frame to complete rendering.
524inline void WaitForGpu(DX12ResourcesPerWindow dx)
525{   //Where are we using this function?
526    /*
527    dx.commandQueue->Signal(dx.fence.Get(), dx.fenceValue);
528    dx.fence->SetEventOnCompletion(dx.fenceValue, dx.fenceEvent);
529    WaitForSingleObjectEx(dx.fenceEvent, INFINITE, FALSE);
530    dx.fenceValue++;*/
531}
532
533// Waits for a specific fence value to be reached
534inline void WaitForFenceValue(DX12ResourcesPerWindow dx, UINT64 fenceValue)
535{ // Where are we using this?
536    /*
537    if (dx.fence->GetCompletedValue() < fenceValue)
538    {
539        ThrowIfFailed(dx.fence->SetEventOnCompletion(fenceValue, dx.fenceEvent));
540        WaitForSingleObjectEx(dx.fenceEvent, INFINITE, FALSE);
541    }*/
542}
543
544// Thread Functions
545// Thread synchronization between Main Logic thread and Copy thread
546inline std::mutex toCopyThreadMutex;
547inline std::condition_variable toCopyThreadCV;
548inline std::queue<CommandToCopyThread> commandToCopyThreadQueue;
549
550// Thread Functions - Just Declaration!
551void GpuCopyThread();
552void GpuRenderThread(int monitorId, int refreshRate);
  1// Copyright (c) 2025-Present : Ram Shanker: All rights reserved.
  2#pragma once
  3
  4//DirectX 12 headers. Best Place to learn DirectX12 is original Microsoft documentation.
  5// https://learn.microsoft.com/en-us/windows/win32/direct3d12/direct3d-12-graphics
  6// You need a good dose of prior C++ knowledge and Computer Fundamentals before learning DirectX12.
  7// Expect to read at least 2 times before you start grasping it !
  8
  9//Tell the HLSL compiler to include debug information into the shader blob.
 10#define D3DCOMPILE_DEBUG 1 //TODO: Remove from production build.
 11#define WIN32_LEAN_AND_MEAN
 12#include <windows.h>   // MUST be before d3d12.h
 13#include <d3d12.h> //Main DirectX12 API. Included from %WindowsSdkDir\Include%WindowsSDKVersion%\\um
 14//helper structures Library. MIT Licensed. Added to the project as git submodule.
 15//https://github.com/microsoft/DirectX-Headers/blob/main/include/directx/d3dx12.h
 16#include <d3dx12.h>
 17#include <dxgi1_6.h>
 18#include <dxgidebug.h>
 19#include <wrl.h>
 20#include <d3dcompiler.h>
 21#include <DirectXMath.h> //Where from? https://github.com/Microsoft/DirectXMath ?
 22#include <vector>
 23#include <string>
 24#include <unordered_map>
 25#include <random>
 26#include <ctime>
 27#include <iostream>
 28#include <thread>
 29#include <chrono>
 30#include <map>
 31#include <list>
 32
 33#include "ConstantsApplication.h"
 34#include "MemoryManagerGPU.h"
 35#include "UserInterface-DirectX12.h"
 36#include "डेटा.h"
 37
 38using namespace Microsoft::WRL;
 39
 40//DirectX12 Libraries.
 41#pragma comment(lib, "d3d12.lib") //%WindowsSdkDir\Lib%WindowsSDKVersion%\\um\arch
 42#pragma comment(lib, "dxgi.lib")
 43#pragma comment(lib, "d3dcompiler.lib")
 44#pragma comment(lib, "dxguid.lib")
 45
 46/* Double buffering is preferred for CAD application due to low input lag.Caveat: If rendering time
 47exceeds frame refresh interval, than strutting distortion will appear. However
 48we low input latency outweighs the slight frame smoothness of triple buffering.
 49Double buffering (2x) is also 50% more memory efficient Triple Buffering (3x). */
 50const UINT FRAMES_PER_RENDERTARGETS = 2; //Initially we are going with double buffering.
 51
 52// Constants
 53constexpr UINT64 MaxVertexBufferSize = 1024 * 1024 * 64; // 64 MB
 54constexpr UINT64 MaxIndexBufferSize = 1024 * 1024 * 16; // 16 MB
 55
 56// Represents complete geometry and index data associated with 1 engineering object..
 57// This structure holds information about a resource allocated in GPU memory (VRAM)
 58struct GpuResourceVertexIndexInfo {
 59    ComPtr<ID3D12Resource> vertexBuffer;
 60    D3D12_VERTEX_BUFFER_VIEW vertexBufferView;
 61    ComPtr<ID3D12Resource> indexBuffer;
 62    D3D12_INDEX_BUFFER_VIEW indexBufferView;
 63    UINT indexCount;
 64    uint32_t matrixIndex = 0;
 65
 66    //TODO: Latter on we will generalize this structure to hold textures, materials, shaders etc.
 67    // Currently we are letting the Drive manage the GPU memory fragmentation. Latter we will manage it ourselves.
 68    //uint64_t vramOffset; // Simulated VRAM address
 69    //uint64_t size;
 70    // In a real DX12 app, this would hold ID3D12Resource*, D3D12_VERTEX_BUFFER_VIEW, etc.
 71};
 72
 73struct IndirectCommand { // OPTIMIZED Indirect Command
 74    uint32_t matrixIndex; // 4 Bytes (Root Constant b1)
 75	// Since Jumbo buffer ( or pages in future ) remains same, we bind it once.
 76    // REMOVED: D3D12_VERTEX_BUFFER_VIEW vbv (Saved 16 Bytes)
 77    // REMOVED: D3D12_INDEX_BUFFER_VIEW  ibv (Saved 16 Bytes)
 78    D3D12_DRAW_INDEXED_ARGUMENTS drawArguments;// 20 Bytes
 79}; // Total size: 24 Bytes (down from 56 Bytes!)
 80static_assert(sizeof(IndirectCommand) == 24, "IndirectCommand must be exactly 24 bytes.");
 81
 82/* Page Metadata: GeometryPlacementRecordInPage (CPU-side only).
 83One entry per geometry object inside a GeometryPage. Used by Copy Thread for defragmentation, 
 84rebuilds, and future features. (frustum culling, ray-cast selection, LOD, etc.).
 85Total size = 56 bytes (tightly packed, cache-friendly). */
 86struct GeometryPlacementRecordInPage {
 87    uint64_t objectID;           // Unique 64-bit ID across entire process (unchanged)
 88
 89    // Byte offsets into this page's vertex/index buffers (page max = 4 MB → uint32_t is safe)
 90    // Vertex region (grows upward)
 91    uint32_t vertexByteOffset; // Start of this object's vertices in the page (bytes)
 92    uint32_t vertexSize;       // In bytes
 93
 94    // Index region (grows downward)
 95    uint32_t indexByteOffset;    // Start of this object's indices in the page (bytes)
 96    uint32_t indexSize;          // In bytes
 97
 98    uint32_t indexCount;         // Number of indices (not bytes) For ExecuteIndirect
 99    uint32_t matrixIndex;        // Index into the per-tab WorldMatrix structured buffer
100
101    // Axis-Aligned Bounding Box (AABB) – stored as float32 only (24 bytes total)
102    // Always present for future use (frustum culling, selection, etc.).
103    // Set to {0,0,0} / {0,0,0} if we don't need it yet – costs nothing extra.
104    float minX, minY, minZ, maxX, maxY, maxZ; // Minimum corner (X,Y,Z) Maximum corner (X,Y,Z)
105
106    // Optional padding for perfect 8-byte alignment (not needed – compiler will pad anyway)
107	bool isDeleted = false; // Marked for deletion (soft delete, for defragmentation)
108};
109
110static_assert(sizeof(GeometryPlacementRecordInPage) == 64, 
111    "GeometryPlacementRecordInPage must be exactly 64 bytes for optimal cache/line usage.");
112
113struct GeometryPage {
114    // GPU RESOURCES. Single unified 4 MB buffer
115    Microsoft::WRL::ComPtr<ID3D12Resource> buffer;// Layout:[Vertex Region ↑ ][Free Space][ Index Region ↓ ]
116    Microsoft::WRL::ComPtr<ID3D12Resource> indirectBuffer;// ExecuteIndirect argument buffer for this page
117    uint32_t indirectCount = 0; // Number of valid indirect draw commands
118
119    // ALLOCATION STATE (CPU-side only)
120    uint32_t vertexHead = 0; // Vertex region grows upward from 0
121    // Index region grows downward from pageSize
122    uint32_t indexTail = 0;  // Initialized to pageSize
123    uint32_t pageSize = 0;   // Typically 4 * 1024 * 1024
124    static constexpr uint32_t SAFETY_GAP = 64; // alignment guard
125
126    // FRAGMENTATION TRACKING
127    uint32_t liveBytes  = 0;   // Actively used bytes
128    uint32_t holeBytes  = 0;   // Deleted object space
129    uint32_t objectCount = 0;  // Active objects
130
131    // VERSIONING & LIFETIME CONTROL
132    uint32_t version = 0;                // Incremented on rebuild
133    std::atomic<bool> published = false; // Immutable once true
134    uint64_t retireFence = 0; // Fence value after which this page is safe to destroy
135
136    std::vector<GeometryPlacementRecordInPage> objects; // CPU METADATA (NO GEOMETRY STORED)
137
138    // UTILITY
139    bool IsFull(uint32_t incomingVertexBytes, uint32_t incomingIndexBytes) const  {
140        //If: incomingIndexBytes > indexTail then : indexTail - incomingIndexBytes wraps to huge value.
141        if (incomingIndexBytes > indexTail) return true;
142        uint32_t alignedVertexHead = AlignUp(vertexHead, 16);
143        uint32_t alignedIndexTail  = AlignDown(indexTail - incomingIndexBytes, 4);
144		return (alignedVertexHead + incomingVertexBytes + SAFETY_GAP > alignedIndexTail);
145    }
146
147    static uint32_t AlignUp(uint32_t value, uint32_t alignment) {
148        return (value + alignment - 1) & ~(alignment - 1);
149    }
150
151    static uint32_t AlignDown(uint32_t value, uint32_t alignment) {
152        return value & ~(alignment - 1);
153    }
154};
155
156struct BigGeometryObject {
157    Microsoft::WRL::ComPtr<ID3D12Resource> buffer;
158    Microsoft::WRL::ComPtr<ID3D12Resource> indirectBuffer;
159    uint32_t indexCount = 0;
160    uint32_t matrixIndex = 0;
161    uint64_t retireFence = 0;
162    std::atomic<bool> published = false;
163};
164
165struct GeometryPageSnapshot {// A lightweight, immutable snapshot of the current pages.
166    // We use raw pointers here because the Render thread only needs to observe them.
167    // Iterating over a contiguous array of pointers is extremely cache-friendly.
168    std::vector<GeometryPage*> pages;
169};
170
171struct TabGeometryStorage {
172    // THE RCU POINTER: Render threads read this, Copy thread writes to it.
173    std::atomic<GeometryPageSnapshot*> activeSnapshot{ nullptr };
174    // WRITER-ONLY STATE: Only the Copy thread touches these, so they need no locks/atomics.
175    std::vector<std::unique_ptr<GeometryPage>> activePages; // Actually owns the memory
176
177    // Cleanup queues for the Copy thread
178    struct RetiredSnapshot { GeometryPageSnapshot* snapshot; uint64_t retireFence; };
179    struct RetiredPage { std::unique_ptr<GeometryPage> page; uint64_t retireFence; };
180    std::vector<RetiredSnapshot> retiredSnapshots;
181    std::vector<RetiredPage> retiredPages;
182
183    /* TODO: RCU version of all of the following vectors need to be developed. Only 1st done so far.
184    std::vector<std::unique_ptr<GeometryPage>> opaquePages; // Opaque geometry pages
185    std::vector<std::unique_ptr<GeometryPage>> transparentPages; // Transparent geometry pages
186    std::vector<std::unique_ptr<GeometryPage>> wireframePages; // Wireframe pages (if used)
187    std::vector<std::unique_ptr<BigGeometryObject>> bigObjects; // Dedicated large objects
188    std::atomic<uint32_t> currentVersion = 0;
189    std::vector<std::unique_ptr<GeometryPage>> retiredPages;
190    */
191};
192
193/* DirectX 12 resources are organized at 3 levels:
1941. The Data   : Per Tab (Jumbo Buffers for geometry data, materials, textures, 
195    Pipeline State Object, Root Signature, Command Signature etc.)
1962. The Target : Per Window (Swap Chain, Render Targets, Depth Stencil Buffer etc.)
1973. The Worker : Per Render Thread. 1 For each monitor. (Command Queue, Command List etc.
198    Resources shared across multiple windows on the same monitor) */
199
200struct DX12ResourcesPerTab { // (The Data) Geometry Data
201
202    // Upload Heaps (CPU -> GPU Transfer)
203    // Moved here because the Copy Thread writes to these when adding objects to the TAB.
204    ComPtr<ID3D12Resource> vertexBufferUpload;
205    ComPtr<ID3D12Resource> indexBufferUpload;
206
207    // Persistent Mapped Pointers (CPU Address)
208    UINT8* pVertexDataBegin = nullptr;// Pointer for mapped vertex upload buffer
209    UINT8* pIndexDataBegin = nullptr;  // Pointer for mapped index upload buffer
210
211	// TODO: We will generalize this to hold materials, shaders, textures etc. unique to this project/tab
212    ComPtr<ID3D12DescriptorHeap> srvHeap;
213
214    mutable std::mutex objectsOnGPUMutex;// Make mutex mutable so const references can lock it in rendering paths.
215    // Copy thread will update the following map whenever it adds/removes/modifies an object on GPU.
216    std::map<uint64_t, GpuResourceVertexIndexInfo> objectsOnGPU;
217
218    //Copy thread owns/writes following variables exclusively. Render threads only read it. Without Lock.
219    ComPtr<ID3D12Resource> worldMatrixBuffer; // TODO: Doublebuffer it per frame.
220    UINT8 * pWorldMatrixDataBegin = nullptr;
221    uint32_t               matrixCapacity = 4096; 
222    uint32_t               matrixCount = 0;
223	std::vector<uint32_t>  freeMatrixSlots;   // free-list for matrix indices.
224    //To enable re-use of slots when objects are removed.
225
226	// Initially rootSignature & pipelineState were in PerWindow, but now moved here, 
227    // when adding commandSignature and indirect drawing infrastructure.
228    // Since Root Signature and Pipeline State are closely tied to the command signature, 
229    ComPtr<ID3D12RootSignature> rootSignature;
230    ComPtr<ID3D12PipelineState> pipelineState;
231
232    ComPtr<ID3D12CommandSignature> commandSignature;// Indirect Drawing
233
234	CameraState camera; //Reference is updated per frame. 
235    //Currently per tab, but latter we will have this per view. Since each tab can have multiple views.
236};
237
238struct DX12ResourcesPerWindow {// Presentation Logic
239    int WindowWidth = 800;//Current ViewPort ( Rendering area ) size. excluding task-bar etc.
240    int WindowHeight = 600;
241    ID3D12CommandQueue* creatorQueue = nullptr; // Track which queue this windows was created with. 
242    //To assist with migrations.
243    
244    ComPtr<IDXGISwapChain3>         swapChain; // The link to the OS Window
245	//ComPtr<ID3D12CommandQueue>    commandQueue; // Moved to OneMonitorController
246    ComPtr<ID3D12DescriptorHeap>    rtvHeap;
247    ComPtr<ID3D12Resource>          renderTargets[FRAMES_PER_RENDERTARGETS];
248
249    // Render To Texture Infrastructure
250    ComPtr<ID3D12Resource>          renderTextures[FRAMES_PER_RENDERTARGETS];
251    ComPtr<ID3D12DescriptorHeap>    rttRtvHeap;
252    ComPtr<ID3D12DescriptorHeap>    rttSrvHeap;
253    
254    // TODO: When we will implement HDR support, we wil have change above format to following.
255    //DXGI_FORMAT                     rttFormat = DXGI_FORMAT_R16G16B16A16_FLOAT; // HDR ready
256
257    ComPtr<ID3D12Resource> depthStencilBuffer;// Depth Buffer (Sized to the window dimensions)
258    ComPtr<ID3D12DescriptorHeap> dsvHeap;
259
260    D3D12_VIEWPORT viewport;// Viewport & Scissor (Dependent on Window Size).
261    D3D12_RECT scissorRect;
262
263    ComPtr<ID3D12Resource> constantBuffer;
264    ComPtr<ID3D12DescriptorHeap> cbvHeap;
265    UINT8* cbvDataBegin = nullptr;
266
267	UINT frameIndex = 0; // Remember this is different from allocatorIndex in Render Thread.
268    // It can change even during windows resize.
269};
270
271struct DX12ResourcesPerRenderThread { // This one is created 1 for each monitor.
272    // For convenience only. It simply points to OneMonitorController.commandQueue
273	ComPtr<ID3D12CommandQueue> commandQueue;
274
275    // Note that there are as many render thread as number of monitors attached.
276    // Command Allocators MUST be unique to the thread.
277    // We need one per frame-in-flight to avoid resetting while GPU is reading.
278    ComPtr<ID3D12CommandAllocator> commandAllocators[FRAMES_PER_RENDERTARGETS];
279	UINT allocatorIndex = 0; // Remember this is different from frameIndex available per Window.
280
281    // The Command List (The recording pen). Can be reset and reused for multiple windows within the same frame.
282    ComPtr<ID3D12GraphicsCommandList> commandList;
283
284    // Synchronization (Per Window VSync)
285    HANDLE fenceEvent = nullptr;
286    ComPtr<ID3D12Fence> fence; // TODO: Discard this. use the fence inside monitor.
287};
288
289struct OneMonitorController { // Variables stored per monitor.
290    // System Fetched information.
291    bool isScreenInitalized = false;
292    int screenPixelWidth = 800;
293    int screenPixelHeight = 600;
294    int screenPhysicalWidth = 0; // in mm
295    int screenPhysicalHeight = 0; // in mm
296    int WindowWidth = 800;//Current ViewPort ( Rendering area ) size. excluding task-bar etc.
297    int WindowHeight = 600;
298
299    HMONITOR hMonitor = NULL; // Monitor handle. Remains fixed as long as monitor is not disconnected / disabled.
300    std::wstring monitorName;            // Monitor device name (e.g., "\\\\.\\DISPLAY1")
301    std::wstring friendlyName;           // Human readable name (e.g., "Dell U2720Q")
302    RECT monitorRect;                    // Full monitor rectangle
303    RECT workAreaRect;                   // Work area (excluding task bar)
304    int dpiX = 96;                       // DPI X
305    int dpiY = 96;                       // DPI Y
306    double scaleFactor = 1.0;            // Scale factor (100% = 1.0, 125% = 1.25, etc.)
307    bool isPrimary = false;              // Is this the primary monitor?
308    DWORD orientation = DMDO_DEFAULT;    // Monitor orientation
309    int refreshRate = 60;                // Refresh rate in Hz
310    int colorDepth = 32;                 // Color depth in bits per pixel
311
312    bool isVirtualMonitor = false;       // To support headless mode.
313
314    // DirectX12 Resources.
315	// TODO: Move these to per render thread structure.
316	ComPtr<ID3D12CommandQueue> commandQueue;    // Persistent. Survives thread restarts.
317    bool hasActiveThread = false;// We need to know if this specific monitor is currently being serviced by a thread
318    ComPtr<ID3D12Fence> renderFence; // Signalled each frame by GpuRenderThread
319    uint64_t renderFenceValue = 0; // Last value signalled (written by render thread)
320    // Above is intentionally NOT std::atomic since gpu.renderFenceValue is the std::atomic serving all monitors.
321    HANDLE renderFenceEvent = nullptr;
322};
323
324// Commands sent from Generator thread(s) to the Copy thread
325enum class CommandToCopyThreadType { NONE = 0, ADD, MODIFY, REMOVE };
326struct CommandToCopyThread
327{
328    CommandToCopyThreadType type;
329    std::optional<GeometryData> geometry; // Present for ADD and MODIFY
330    uint64_t id = 0; // Always present
331    uint64_t tabID = 0; // NEW: We must know which tab this object belongs to!
332};
333
334extern std::atomic<bool> pauseRenderThreads; // Defined in Main.cpp
335
336// Packet of work for a Render Thread for one frame
337struct RenderPacket {
338    uint64_t frameNumber;
339    std::vector<uint64_t> visibleObjectIds;
340};
341
342class HrException : public std::runtime_error// Simple exception helper for HRESULT checks
343{
344public:
345    HrException(HRESULT hr) : std::runtime_error("HRESULT Exception"), hr(hr) {}
346    HRESULT Error() const { return hr; }
347private:
348    const HRESULT hr;
349};
350
351inline void ThrowIfFailed(HRESULT hr) {
352    if (FAILED(hr)) { throw HrException(hr); }
353}
354
355
356class ThreadSafeQueueGPU {
357public:
358    void push(CommandToCopyThread value) {
359        std::lock_guard<std::mutex> lock(mutex);
360        fifoQueue.push(std::move(value));
361        cond.notify_one();
362    }
363
364    // Non-blocking pop
365    bool try_pop(CommandToCopyThread& value) {
366        std::lock_guard<std::mutex> lock(mutex);
367        if (fifoQueue.empty()) { return false; }
368        value = std::move(fifoQueue.front());
369        fifoQueue.pop();
370        return true;
371    }
372
373    // Shuts down the queue, waking up any waiting threads
374    void shutdownQueue() {
375        std::lock_guard<std::mutex> lock(mutex);
376        shutdown = true;
377        cond.notify_all();
378    }
379
380private:
381    std::queue<CommandToCopyThread> fifoQueue; // fifo = First-In First-Out
382    std::mutex mutex;
383    std::condition_variable cond;
384    bool shutdown = false;
385};
386
387inline ThreadSafeQueueGPU g_gpuCommandQueue;
388
389// VRAM Manager : This class handles the GPU memory dynamically.
390// There will be exactly 1 object of this class in entire application. Hence the special name.
391// भगवान शंकर की कृपा बनी रहे. Corresponding object is named "gpu".
392class शंकर {
393public:
394    OneMonitorController screens[MV_MAX_MONITORS];
395    int currentMonitorCount = 0; // Global monitor count. It can be 0 when no monitors are found (headless mode)
396
397    // IDXGIFactory6 / IDXGIAdapter4 Prerequisite : Windows 10 1803+ / Windows 11
398    ComPtr<IDXGIFactory6> factory6; //The OS-level display system manager. Can iterate over GPUs.
399    ComPtr<IDXGIAdapter4> hardwareAdapter;// Represents a physical GPU device.
400    //Represents 1 logical GPU device on above GPU adapter. Helps create all DirectX12 memory / resources / comments etc.
401
402	ComPtr<ID3D12Device> device; //Very Important: We support EXACTLY 1 GPU device only in this version.
403    bool isGPUEngineInitialized = false; //TODO: To be implemented.
404    DXGI_FORMAT rttFormat = DXGI_FORMAT_R8G8B8A8_UNORM;
405
406    DX12ResourcesUI uiResources;
407    
408    //Following to be added latter.
409    //ID3D12DescriptorHeapMgr    ← Global descriptor allocator
410    //Shader& PSO Cache         ← Shared by all threads
411    //AdapterInfo                ← For device selection / VRAM stats
412
413    /* We will have 1 Render Queue per monitor, which is local to Render Thread.
414    IMPORTANT: All GPU have only 1 physical hardware engine, and can execute 1 command at a time only.
415    Even if 4 commands list are submitted to 4 independent queue, graphics driver / WDDM serializes them.
416    Still we need to have 4 separate queue to properly handle different refresh rate.
417
418    Ex: If we put all 4 window on same queue: Window A (60Hz) submits a Present command. The Queue STALLS
419    waiting for Monitor A's VSync interval. Window B (144Hz) submits draw comand. 
420    Window B cannot be processed because the Queue is blocked by Windows A's VSync wait. 
421    By using 4 Queues, Queue A can sit blocked waiting for VSync, 
422    while Queue B immediately push work work to the GPU for the faster monitor.*/
423
424    std::atomic<uint64_t> renderFenceValue = 0; // Global. This is in addition to per monitor render fence value.
425
426	ComPtr<ID3D12CommandQueue> copyCommandQueue; // There is only 1 across the application.
427    ComPtr<ID3D12Fence> copyFence;// Synchronization for Copy Queue
428	std::atomic<uint64_t> copyFenceValue = 1; // thread safe.
429    //Start from 1 to avoid confusion with default fence value of 0.
430    HANDLE copyFenceEvent = nullptr;
431
432public:
433    // Maps our CPU ObjectID to its resource info in VRAM
434    std::unordered_map<uint64_t, GpuResourceVertexIndexInfo> resourceMap;
435
436    // Simulates a simple heap allocator with 16MB chunks
437    uint64_t m_nextFreeOffset = 0;
438    const uint64_t CHUNK_SIZE = 16 * 1024 * 1024;
439    uint64_t m_vram_capacity = 4 * CHUNK_SIZE; // Simulate 64MB VRAM
440
441    // When an object is updated, the old VRAM is put here to be freed later.
442    struct DeferredFree {
443        uint64_t frameNumber; // The frame it became obsolete
444        GpuResourceVertexIndexInfo resource;
445    };
446    std::list<DeferredFree> deferredFreeQueue;
447
448	// Allocate space in VRAM. Returns the handle. What is this used for?
449    // std::optional<GpuResourceVertexIndexInfo> Allocate(size_t size);
450
451	// Descriptor sizes for RTV and CBV/SRV/UAV. We need these to calculate offsets in descriptor heaps.
452	// These are initialized during device creation and remain constant. i.e. They are hardware properties of GPU.
453    // We store them here for easy access across threads.
454	UINT rtvDescriptorSize = 0, cbvSrvUavDescriptorSize = 0; //Initialized during device creation.
455
456    void ProcessDeferredFrees(uint64_t lastCompletedRenderFrame);
457
458	//शंकर() {}; // Our Main function initializes DirectX12 global resources by calling InitD3DDeviceOnly().
459    void InitD3DDeviceOnly();
460    void InitD3DPerTab(DX12ResourcesPerTab& tabRes); // Call this when a new Tab is created
461    void InitD3DPerWindow(DX12ResourcesPerWindow& dx, HWND hwnd, ID3D12CommandQueue* commandQueue);
462    void PopulateCommandList(ID3D12GraphicsCommandList* cmdList, //Called by per monitor render thread.
463        DX12ResourcesPerWindow& winRes, const DX12ResourcesPerTab& tabRes, TabGeometryStorage& storage);
464    void WaitForPreviousFrame(const DX12ResourcesPerRenderThread& dx);
465    void ResizeD3DWindow(DX12ResourcesPerWindow& dx, UINT newWidth, UINT newHeight);
466
467    // Called when a monitor is unplugged or window is destroyed. Destroys SwapChain/RTVs but KEEPS Geometry.
468    void CleanupWindowResources(DX12ResourcesPerWindow& winRes);
469    // Called when a TAB is closed by the user. Destroys the Jumbo Vertex/Index Buffers.
470    void CleanupTabResources(DX12ResourcesPerTab& tabRes);
471    // Called ONLY at application exit (wWinMain end).Destroys the Device, Factory, and Global Copy Queue.
472	// Thread resources are cleaned up by the Render Thread itself before exit.
473    void CleanupD3DGlobal();
474};
475
476void FetchAllMonitorDetails();
477BOOL CALLBACK MonitorEnumProc(HMONITOR hMonitor, HDC hdcMonitor, LPRECT lprcMonitor, LPARAM dwData);
478
479/*
480IID_PPV_ARGS is a MACRO used in DirectX (and COM programming in general) to help safely and correctly
481retrieve interface pointers during object creation or querying. It helps reduce repetitive typing of codes.
482COM interfaces are identified by unique GUIDs. Than GUID pointer is converted to appropriate pointer type.
483
484Ex: IID_PPV_ARGS(&device) expands to following:
485IID iid = __uuidof(ID3D12Device);
486void** ppv = reinterpret_cast<void**>(&device);
487*/
488
489// Structure to hold transformation matrices
490struct ConstantBuffer {
491    DirectX::XMFLOAT4X4 viewProj;   // 64 bytes
492};
493
494// Externs for communication 
495extern std::atomic<bool> shutdownSignal;
496
497// Logic Thread "Fence"
498extern std::mutex g_logicFenceMutex;
499extern std::condition_variable g_logicFenceCV;
500extern uint64_t g_logicFrameCount;
501
502// Copy Thread "Fence"
503extern std::mutex g_copyFenceMutex;
504extern std::condition_variable g_copyFenceCV;
505extern uint64_t g_copyFrameCount;
506
507//TODO: Implement this. In a real allocator, we would manage free lists and possibly defragment memory.
508/*
509std::optional<GpuResourceVertexIndexInfo> शंकर::Allocate(size_t size) {
510
511    if (nextFreeOffset + size > m_vram_capacity) {
512        std::cerr << "VRAM MANAGER: Out of memory!" << std::endl;
513        // Here, the Main Logic thread would be signaled to reduce LOD.
514        return std::nullopt;
515    }
516    GpuResourceVertexIndexInfo info{ nextFreeOffset, size };
517    nextFreeOffset += size; // Simple bump allocator
518    return info;
519}*/
520
521// Utility Functions
522
523// Waits for the previous frame to complete rendering.
524inline void WaitForGpu(DX12ResourcesPerWindow dx)
525{   //Where are we using this function?
526    /*
527    dx.commandQueue->Signal(dx.fence.Get(), dx.fenceValue);
528    dx.fence->SetEventOnCompletion(dx.fenceValue, dx.fenceEvent);
529    WaitForSingleObjectEx(dx.fenceEvent, INFINITE, FALSE);
530    dx.fenceValue++;*/
531}
532
533// Waits for a specific fence value to be reached
534inline void WaitForFenceValue(DX12ResourcesPerWindow dx, UINT64 fenceValue)
535{ // Where are we using this?
536    /*
537    if (dx.fence->GetCompletedValue() < fenceValue)
538    {
539        ThrowIfFailed(dx.fence->SetEventOnCompletion(fenceValue, dx.fenceEvent));
540        WaitForSingleObjectEx(dx.fenceEvent, INFINITE, FALSE);
541    }*/
542}
543
544// Thread Functions
545// Thread synchronization between Main Logic thread and Copy thread
546inline std::mutex toCopyThreadMutex;
547inline std::condition_variable toCopyThreadCV;
548inline std::queue<CommandToCopyThread> commandToCopyThreadQueue;
549
550// Thread Functions - Just Declaration!
551void GpuCopyThread();
552void GpuRenderThread(int monitorId, int refreshRate);