White Paper: Hardware Raster Acceleration Methods for Video Compression
Foreword
Download this White Paper in PDF
Video Compression is used in many products, and arguably consumes more processor and memory clocks than any other data process for the last 20 years to present, with no downtrend in sight.
This paper is an overview of lossy video compression for applications such as streaming on-demand movies and TV as well as Low-Loss compression for applications such as remote desktop. And this paper then reviews the particular MiMax Inc raster-oriented Hardware (HW) Acceleration methods that apply to particular types of video data sets, types of compression, and sub-tasks of compression.
This paper quotes from other papers which cover the basic concepts of AV compression. Next, the WP will add new material where MiMax splices into the known methods of compression.
Audio compression is only noted for context, and many video streams have an accompanying audio track. Audio sampling data rates are much slower than video and represent an additional challenge to keep both in sync for compression and playback. Audio compression also uses similar Trig functions that video uses for compression. With Audio there are some examples of Video compression, synced with uncompressed audio.
Thanks to the many international authors of existing Audio and Video (AV) compression Whitepapers and Websites that are quoted here, this paper offers links for readers to review those entire published documents. This paper is dedicated to the memory of Atari-Amiga expert Electronics Engineer Jay Minor for his work on display-rasters, display-lists and general-purpose CPU operations combined as synchronous clocked machines; and to Electrical Engineering Professor Claud Shannon, considered the father of information-theory, who showed that all higher concepts of math equations and information tasks can be reduced to Boolean gate logic expressions.
The most important section of this paper is “Raster Hardware Acceleration Methods: Macroblock Compression Steps”. This paper graphically describes details of these raster hardware-oriented processes that are not often published.
Abstract
Lossy video compression is generally based on Discrete Cosine Transforms of macroblocks (macrocells, frame-partitioning) of various sizes of blocks of pixels. A popular size is 16×16 pixels. Other sizes of 8×8 also can come into use, and some systems of compression and decompression can use a mix of multiple sizes.
Lossy and Low-Loss video compression is generally a compression of an outer open loop system and an inner closed loop system. The open-outer-loop for video transmission control, when using an Ethernet UDP-packetized Network, may drop a percentage-of-packets and that is acceptable for live streaming video data.
Likewise, USB (such as for many USB cameras) is an Isochronous-packetized Network. It may drop a percentage-of-packets and that is ok overall for live streaming video data transfers.
This paper orients around the local raster circuits of the inner semi-closed loop process of video compression, to the transmission-ready-buffers. Also concentrates on technology areas where Hardware (HW) Acceleration finds the changed pixels, and creates the macroblock data units. The Forward-DCT function is done to those macroblock groups of pixels. Many variations of this process are adjusted to improve quality or frame rates, reduce latency or to reduce wattage or cost for the compression use case.
Herein are hardware raster methods that can be applied to Commercial-Off-the-Shelf (COTS) Graphics Processor Units (GPU) that can work on compression of video streams such as security cameras, video phones or remote desktop.
Core to these processes are particular methods to make the most use of the raster hardware registers for Start Address (SA) and Vertical Total (VT) with pixel clock gating and using two hardware phase locked rasters to produce a Pixel Change Maps (PCM), as to direct only changing macroblocks for compression. Additionally, the hardware process is covered for using a single Graphics Processor Unit (GPU-IC), common dual head type, to compress up to 16 virtual machines desktops at a server.
These methods reduce wattage and cost as many of these raster hardware components are in current GPU-IC’s or require only a few added gates.
1. Hardware Applied to Lossy & Low-Loss Video Compression Latency
As noted in this WP’s Foreword section, this WP will be brief here, as this section provides context of the main subject of Hardware Acceleration methods for compression. Some of the drawings of both software and hardware functional steps of compression, can be used either in Low-Loss, low-Latency compression or moderate lossy, high-latency (sometimes 10 or more frames) compression.
“Data-volume” (DA) per a given original set of frames, is also titled “compression ratio” (CR). CR and latency tend to be the opposite of performance in the subject of video stream compression. Latency is the time, or number of frames passing, needed to create the compression data sets, and its decompression number of frames count needed for a display of a video stream at a remote location.
Almost all video compression is lossy in some manner. This paper starts with organizing the standard compression methods around the variations of (1) Low Latency compression and (2) High Latency compression.
Video compression subjects can easily fill a 300-page technology book, and the detailed manuals and standards; for computer programmers and hardware engineers to practice the trade would be a set of volumes of 10,000 pages at a minimum. Here this paper is only highlighting these processes, as to provide context, of why hardware acceleration circuits are so important.
The Ivan-Assen Ivanov gamasutra webpage (link below) raises the subject of Moore’s law for transistor sizes versus software progression in the electronic-digital products marketplace.
https://www.gamasutra.com/view/feature/3090/image_compression_with_vector_.php?print=1
“”The famous Moore’s law, which states in rough terms that every 18 months the speed of computers doubles, has an evil twin: every 18 months software becomes twice as slow. A similar relationship can be formulated for RAM and game data: no matter how big the memory budget of your next-generation game may seem, your art team can probably fill it up faster than you can say “disk thrashing.” The appetite for art megabytes grows faster than the publisher’s willingness to raise the minimum platform requirements.””
The arrangement of gates and flip-flops and FIFOs in the subject area of video compression and playback, Hardware Acceleration (HW-Accel), has fortunately beat this evil-twin problem to a significant degree. Video movie playback and real time compression and playback are now possible for consumer and commercial priced products. Power consumption is low enough for battery operation; 90% is from HW-Accel advancements.
Off-Line video compression’s most common example is a movie studio producing (aka transferring) its film master to DVD or Blu-Ray data disk. If a compression algorithm-set and processing, taking days of time produces a better compression for Data-Volume (DA) [aka Compression Ratio (CR) or Bandwidth (BW)] and Quality; that trade-off for quality of a commercial disk is an acceptable latency.
The opposite of Off-Line video compression would be real-time with the most common examples being security cameras, video-telephone and remote-desktop.
1.1 Low Latency compression, Typically Low Compression Ratio
Lossless compression won’t be quoted much herein; rather Low-Loss, or very Low-Loss. If there is a need for perfect restoration of video later, some video of either still frames or video streams may have particular frames data sets that may not be compressible.
Also, if a particular complex original data video stream of high contrast, wide ranging rotational effects, and many colors, if in need of prefect restoration, this may require extraordinarily long frame buffer memories or off-line video compression. However, common remote-desktop compression is well suited to compression that can be near lossless. An example image is shown below.
Drawing 1: RLE/Huffman & Macroblock Encoding (low latency) RGB Desktop applicable
“LossLess” is quoted here, from a number of published papers, in the more general context for this paper “Low-Loss”, also covering the term “lossless”.
Low-Loss Compression for most real-time video streams, is stated herein as reduction of data to be Transmitted-or-Stored (T-or-S), in a series of video frames. The end product compression is near zero loss, but still lossy, when the full video is reconstructed. Remote desktop is in theory an exception, whereas given enough time, the rebuilt desktop image, when user-display activity quiets down, then the remote display will also settle down to a perfect or near reproduction of the original.
1.1.1 RLE, Delta & Huffman Coding, Dithering, Quantization
RLE, Delta & Huffman Coding are variations of Quantization, as described in this IEEE-2001-explore paper.
https://www.cse.unr.edu/~bebis/CS474/Handouts/ImageCompression.pdf
“..the “redundant” information the human eye considers perceptually insignificant is discarded. This is done using quantization. The new representation has desirable properties. The quantized data has much less variance than the original. Entropy coding is then applied to achieve further compression.”
Major performance considerations of a lossy compression scheme are: a) the compression ratio (CR), b) the signal-to-noise ratio (SNR) of the reconstructed image with respect to the original, and c) the speed of encoding and decoding. The compression ratio is given by:
Drawing: Compression Ratio (CR) and RSME Formulas
The PSNR is given by: PSNR = 20 log10(peak data value/RMSE)
where RMSE is the root mean square error, given by:
https://en.wikipedia.org/wiki/Quantization_(image_processing)
Color quantization: Main article: Color quantization
Color quantization reduces the number of colors used in an image. This is important for displaying images on devices that support a limited number of colors and for efficiently compressing certain kinds of images. Most bitmap editors and many operating systems have built-in support for color quantization. Popular modern color quantization algorithms include the nearest color algorithm (for fixed palettes), the median cut algorithm, and an algorithm based on octrees.
It is common to combine color quantization with dithering to create an impression of a larger number of colors and eliminate banding artifacts.
Frequency quantization for image compression: The human eye is fairly good at seeing small differences in brightness over a relatively large area, but not so good at distinguishing the exact strength of a high frequency (rapidly varying) brightness variation.
Quantization matrices: A typical video codec works by breaking the picture into discrete blocks (8×8 pixels in the case of MPEG[1]). These blocks can then be subjected to discrete cosine transform (Forward-DCT) to calculate the frequency components, both horizontally and vertically
Run Length Encoding (RLE) is perhaps the simplest compression of a data reduction, but will work well only on frames of imagery that have qualities typical of human engineered images, such as technical drawings, flow-charts, cartoons (to include those that have been anti-aliased), and computer desktops of text and icons
Huffman Encoding (that needs a histogram created to find the most used deep color pixel values) combined with run length encoding (RLE), and Delta (changes from sample to sample) encoding; would be the most common baseline combined three methods.
The Delta method typically is more used on scientific data. That tends to have a high sample rate, and the samples have a small amount of value change, compared to max and min values. Combinations of RLE and Huffman are more used on remote desktop displays.
Some PC’s display graphics cards and drivers can set up the display to use different memory planes for the desktop background and the icons in foreground. Often the mouse pointer function is carried out more like a video game system hardware, whereas it is a hardware sprite (see other MM white papers on graphics and gaming sprites).
Compression of this system, if working with knowledge of how the video driver (code that controls the video hardware display card for the OS) works, can greatly speed up the process and quality of compression and decompression. However, there can be issues if this system is not perfectly software debugged, as this cooperative relationship of compression/decompression and working with the video drivers and video hardware must be done correctly.
Potentially a complex system could use the faster, more complex methods, and then also at regular intervals, when the desktop changes calm down, use a slower method to check for a perfect match between the local and remote PC.
This is often done by a capture of the video stream of a full frame to produce a high quality checksum and transmit it to the system where decompression occurs. These local and remote PC’s typically assign tracking numbers to the display frame they are compressing and decompressing.
http://paulbourke.net/dataformats/compress/
RLE and Delta encoding algorithms, and sample images of percent of lossy-ness compared to data records size to be stored, and with sample “C” code:
http://paulbourke.net/dataformats/
Many image geometric-data and file format standards, and with “C” code:
https://en.wikipedia.org/wiki/Lossless_compression
“The primary encoding algorithms used to produce bit sequences are Huffman coding (also used by the deflate algorithm) and arithmetic coding. Arithmetic coding achieves compression rates close to the best possible for a particular statistical model, which is given by the information entropy, whereas Huffman compression is simpler and faster but produces poor results for models that deal with symbol probabilities close to 1.”….
… and other techniques that take advantage of the specific characteristics of images (such as the common phenomenon of contiguous 2-D areas of similar tones, …
…It is sometimes beneficial to compress only the differences between two versions of a file (or, in video compression, of successive images within a sequence). This is called delta encoding (from the Greek letter Δ, which in mathematics, denotes a difference).
DESIGN-OF-IMAGE-COMPRESSION-ALGORITHM-USING-MATLAB.pdf
http://dvd-hq.info/data_compression_1.php
Other image file formats that can use RLE compression include BMP, PCX, PackBits (a TIFF sub-format) and ILBM (an old format, used mainly in Amiga computers). Fax machines also use RLE (combined with another technique called entropy coding) to compress data before transmitting it.
Once the pixel data values, that may be 16 bit or 24 bit, are collected, a histogram is formed for a frame of the data pixel values. Next, a smaller value, such as an 8-bit number 0-255 can be used to represent full 24 bit 8-8-8, R-G-B colored pixels, with near 99% accuracy for many real-world pictures.
Apparently VNC now uses a version of Low-Loss compression. However, in the past it apparently transited uncrossed bitmaps as per the links below. The T-120 standard is a baseline for remote desktop coding.
https://helpful.knobs-dials.com/index.php/VNC_notes#VNC
“VNC captures your screen and sends bitmap versions of it. This is not very fast, but works and free servers and clients are available for Windows, Linux, and OSX, so can be an easy choice.”
https://blog.tedd.no/2011/04/28/optimizing-rdp-for-casual-use-windows-7-sp1-remotefx/
Second is a new feature in Windows 7 SP1 and Windows 2008 R2 SP1 called RemoteFX. This even works on virtualized guest os running on HyperV. In short it allows RDP sessions to use hardware acceleration for rendering. It also changes the sampling method from “every update to the screen” to “interval update”.
NSCodec bitmap compression is used when the RDP session color depth is 32 bpp and the bitmap of interest is either 24 bpp (RGB with no alpha channel) or 32 bpp (RGB with an alpha channel).
[MS-RDPEGDI]: Remote Desktop Protocol: Graphics Device Interface (GDI) Acceleration Extensions
Microsot’s “RDP” desktop lossless (or lossy) compression algorithm appears to be NS-codec bitmap method. Also it can apparently skip frames to reduce CPU load and network load, as per this link below
“RDP is based on, and is an extension of, the T-120 family of protocol standards.”
Below a while paper from Indonesia describes a logical method of examining a text file bitmap, as that is much like the desktop of a single-color background with icons.
The paper then recommends applying Huffman encoding, also a logical, and further compression.
Drawing 2: Text w/plain Backgrounds, Row/Column Scan to Create Re-usable Macroblocks
A common Low-Loss compression application example would be a remote desktop computer display, where icons and text are re-constructed at the receiving system, with zero image degradation, as fits the use case of how a viewer’s eyes have ample time to look at a static slow changing desktop and are expecting a clear perfect image. Another application example is Machine-Vision, where an industrial grade camera captures pixel data of objects on a factory line, examining for extreme detail, where the remote computer needs perfect image data to examine.
Making the subject of Low-Loss quite a bit more complex is the concept of compression on just a still image or a series of images, such as a changing PC desktop, movie, or long sequence from a security camera.
Low-Loss processing of a set of movie frames is perhaps the simplest when Transmitted-or-Stored (T-or-S) with Delta-method, or Huffman Encoding, or the very fastest RLE (but poorest Compression Ratio). With Delta, Huffman, and RLE, only the changed pixels from the current frame are buffered for transmission. Huffman compression may be used in this case, which can be applied to a single frame, with no regard to current or old frame temporal association, but T-or-S buffers only neighboring values of pixels of a frame, or for macroblocks of a frame, when a frame is spatially reviewed.
Spatial-Temporal De-Noising: The links below are of other public webpages and publications, with differing descriptions of these compression concepts, such as old frame to new frame compassion, are recommended reading:
Spatio-Temporal Video Denoising by Block-Based Motion detection
Seema Mishra1, Preety D Swami2
https://www.ijettjournal.org/volume-4/issue-8/IJETT-V4I8P124.pdf
- Temporal filtering is an approach of exploiting temporal correlation to reduce noise in a video sequence. A video sequence contains not only spatial correlation but also temporal correlation between consecutive frames. Temporal video denoising methods can remove the artifacts caused by spatial methods by tracking object motions through frames and thus make certain temporal consistency.
Drawing 3: “One frame delay reference” (ijettjounal vol4 issue8)
The key item in figure 1, this WP, noted from Seema Mishra1, and Preety D Swami2, is the required process of real-time comparison of old frames (one frame delay) to current (new frame).
The above image where One-Frame-Delay (as to compare old frame and new frame) plays a major role in processing, is a core process for high compression ratios compression, yet with low error rates.
1.1.2 YUV and RGB Color Formats
A Low-Loss compression is often of a RGB personal computer desktop, and user applications (both local and remote) working in color often need to affect pixels while keeping in this easy to work with RGB Pixel format, that holds all 3 color data values, in theory most original pure form. When in this pure form of the same data bits length of R, G and B, then any math done to the pixel’s RGB sub values, can be nearly Low-Loss except for the limitations of integer math that rounds or truncates any values less than the least significant 1 bit.
Thus, most image alteration software applications work in RGB, rather than YUV that is a compressed version of the pixel data.
For video cameras, a popular subpixel arrangement of the four-sub-pixels per whole pixel (equivalent of both a RGB and YUV whole pixel), would be “2×2”. For this whole pixel of 2×2 sub-pixels, the first row: G, and B, next row R and G. Note that G is used two times in the quad, due to the greater importance of brightness to the human eye. “G” data occurs twice in the quad, as “G” (green) has more monochrome brightness component in it. The four data pieces need to be processed to create a color pixel.
Drawing 4: Camera Color Image Sensor Bayer Filter Arrangement, translates to RGB or YUV
The technical marketplace still pursues the technical challenge of a color camera sensor that can detect a high quality RGB or YUV pixel data, in a single pixel sensor zone (no sub-zones for single full-color integration of light to elections) that is producible and cost efficient.
TV stations of yesteryear using NTSC brightness-&-color were combined real-time at the instantaneously temporal and spatial pixel location from the RGB camera sensor, or YUV video tape, and (later) at the RGB display, was transmitted in analog version of the YUV-420 packet (higher data quality emphasis on “Y” brightness). All of this NTSC analog data, flowing real-time toward the displays, was all in a single row as, YUV (pixel 1), YUV (pixel 2), YUV (pixel 3), And so on, across a whole line.
1.1.3 Special Case of Remote Desktop
Video compression of remote desktop has a number of extreme differences with video camera streams, video phone streams and off-line compressed entertainment movies. Foremost, almost all of the compression processes stop (or should stop) when the screen becomes static. When the user’s input, or the user’s applications, changes nothing on a desktop PC with only programs like a word processor or spreadsheet. Often the only few pixels that will change once per minute, is a tiny bit of pixels of text numbers of the seconds display count, for the time of day.
The effective compression frame rate per second (FPS) is bouncing around quite a bit for average Remote Desktop PC users. The color palette is almost always based on RGB, and even application windows showing YUV video, in that display, have been converted to RGB from YUV.
See section Hardware Mouse Sprite Transparent Overlay as the hardware mouse-sprite function, both complicates and makes desktop compression faster, and less computationally intensive and less network bandwidth intensive
Next a “RDP” technical detail is discussed at etutorials-dot-org, of the methods and issues, carried out in mix of open-standard and proprietary software and hardware to compress and transmit the local desktop to a remote computer. Note: “buffer” can also be called “cashe”. The more buffered video data stored at the remote location, the lower the network bandwidth needed for good quality compression. It is a trade-off.
“RDP Architecture … The screen is transmitted as a raster graphic (bitmap) from the server to the client or terminal. The client transmits the keyboard and mouse interactions to the server….
…Color palettes can further optimize bitmap use. A table containing color values is created and transmitted to the client. If individual pixels in a bitmap need coloration, the position coordinates in the table are transmitted, not the color value. The amount of data is always smaller when the color depth is high but the number of simultaneously used colors is relatively low. Thus, an individual color value requires up to three bytes, whereas the color position in a table with a maximum of 256 entries needs only one byte.
Even more problematic are animated images, that is, animated bitmaps. They result in significantly higher transfer rates on the network, even if they are very small (for example, animated or blinking mouse cursor).
In this instance, we need another mechanism to limit the data volume: caching….
…RDP supports the following buffers (aka caches):
* Bitmap buffer for different kinds of bitmaps. Size and number of these caches is determined upon connection.
* Font buffer for glyphs (character bitmaps). The cache size must be sufficient for storing all characters of a defined character set.
* Desktop buffer for a desktop screenshot. A character command stores or outputs this special bitmap.
* Cursor buffer for mouse cursors that need special handling. Special graphics algorithms ensure that the correct mouse cursor is displayed on the desktop without generating a lot of network traffic and overloading the local processor resources. Drag-and-drop operations with the mouse therefore do not require the transmission of new graphical elements, but only the transmission of the new coordinates.
* Text buffer for frequently used character strings and the corresponding formatting information. Glyphs from the font cache are used to generate a complete character string”
Additional links on RDP and VNC (also a popular remote desktop product) may be helpful, that describe the use of hardware acceleration regarding the hardware sprite mouse-cursor (aka pointer).
https://www.helpwire.app/blog/remote-desktop-protocol/
https://discover.realvnc.com/what-is-vnc-remote-access-technology
https://superuser.com/questions/1583138/improve-mouse-input-lag-over-remote-desktop-connection
“Improve mouse input lag over Remote Desktop connection …paraphrase of the post:
- On the host PC, Open “Edit Group Policy” (search for gpedit in the Start search bar)
- Browse to: Local Computer Policy\Computer Configuration\Administrative Templates\Windows Components\Remote Desktop Services\Remote Desktop Session Host\Remote Session Environment
- Enable “Use the hardware default graphics adapter for all Remote Desktop Services sessions” (right-click > edit > enabled)
- Enable “Configure H.264/AVC hardware encoding for Remote Desktop Connections” (forum post says this is optional, but I think any acceleration is welcomed if you have the hardware to support it) “
Unfortunately legal stuff does happen (May 2021), and this next link does show how thorny the discussion of macroblock function for Remote desktop Protocol (Microsoft’s, “RDP” so named remote desktop function).
“Microsoft Remote Desktop protocol may use the H.264 codec. H.264 is an industry standard, macroblock-based compression technology. The compression assistance data would be details of the macroblocks to be compressed.”…..
“”RemoteFX vGPU in managing multiple, concurrent virtual machines can be replaced by a GPU”
Indeed, it is do-able, and cost-reducing, for a multi-raster GPU-IC to compress the desktops of multiple virtual machines, on a server the that share one GPU board, whereas that GPU board, contains all the multiple desktops, and jumps from Desktop-1, then to Desktop-2, to Desktop-X, providing a compression solution for each, in a round robin repeat fashion, where Start Address and Vertical Total (VT) registrar values are updated at vertical blank time.
The Dual head GPU would start the process on each Vertical machine desktop, by checking for changed pixels. See this paper section “Raster HW Accel Only-Changed Macroblock Data Push, P, B-Frames toward DCT-function”
1.1.3.1 Hardware Mouse Sprite Transparent Overlay
The compression process is also in communication with the Operating System (OS) that informs the compression code, the X, Y location of the mouse. Usually in modem GPU cards, the main bitmap display is kept in one part of video memory, and the mouse pointer in another part, and even in a small FIFO in the GPU IC as to completely avoid the small mouse sprite from competing for main video buffer memory access bandwidth.
The video driver has innate knowledge of the video hardware and how it is being used, knows for instance if off screen virtual windowed hard located in other areas of the video memory of the display card, and whether those hardware virtual windows contained low rate change deeply data such as word processor, or a high rate of change such as video movie or video camera playback.
The video driver controls a hardware mouse pointer that is almost exactly like a video sprite of the early Atari and later Amiga computers and game systems.
Below is a graphic (from re-designer-dot-com/cursor-set/fedora) showing how the mouse pointer is in a memory array that represent a recharger mini-raster, and that most of its area, is usually the transplant color, ie the hardware color key pixel values for transparency.
Drawing 5: Sprite Mouse-Cursor, ColorKey Transparency, & X-Y Data for Desktop Compression
Due to the importance of how mouse-cursor-pointer-sprites, interact with desktop compression, below are more reference links to hardware mouse patents and discussion.
https://retrogamecoders.com/amos-basic-bobs-and-sprites/
“Amiga Hardware Sprites: The Amiga has built-in hardware that enables fast and easy sprite operations. In fact, the mouse pointer is a Sprite. You can set the mouse to use your own sprite as a pointer.”
https://patentcenter.uspto.gov/#!/applications/08176128
“ InventorDarwin P. Rackley R. Michael P. WestCurrent Assignee International Business Machines Corp
Hardware XOR sprite for computer display systems
Sprites, or cursors, are widely used in display systems as pointers to data displayed on a video display unit (VDU) of the system. Typically, a user controls the position of the sprite by means of an input device, such as a keyboard, mouse, or joystick.
The image that the sprite displays is defined by a sprite character which is stored in an area of bit-mapped memory referred to as a sprite RAM. During typical operation, the sprite character overlays a portion of an image that would normally be displayed at the pixel position occupied by the sprite.
A sprite character is stored in a sprite random access memory (RAM) and comprises a number of sprite data bits which are active (=1) when the sprite is to be displayed at the particular pixel location and inactive (=0) when the underlying image is to be displayed.
Sprites of fixed size are usually implemented in hardware such that their position on screen is controlled by X and Y position data generated by the input device and stored in X and Y position registers.”
https://www.freepatentsonline.com/10276131.pdf
Lee E. Ballard Gregory H. Aicklen Current Assignee Dell Products LP
Systems and methods for remote mouse pointer management
“Abstract: to synchronize a remote mouse pointer with a local mouse pointer that is manipulated by a local user of a local information handling system or to implement a single cursor mode for the remote mouse pointer.
Background: Agents, such as Microsoft Remote Desktop and some forms of VNC server, may be implemented as agents in the form of modified device drivers that provide mouse position feedback. Using this approach, any process running in the operating system (OS) can simply ask the OS where the mouse is, and the OS will respond with the mouse position. However, such agents depend on the OS and are not compatible with pre-OS operations like changing BIOS settings or installing an OS.
1.2 High Latency Compression, Typically High Compression Ratio
Lossy Compression for video stream is stated as reduction of data to be Transmitted-or-Stored (T-or-S), in a series of video frames, with an expected loss of frame images accuracy when the series frames are reconstructed later. The loss of accuracy is often referred to as the “compression ratio”, that is, how much data reduction by percent has been achieved by multiple methods of compression that are processed on the original image or motion video data.
Lossy JPG versus Lossless PNG still images. This paper mentions the two most popular still image standards, for context, that some compression methods for video streams, are shared for still images
JPG still image standard for lossy compression that uses DCT, and PNG which is Low-Loss in most use cases. This paper concentrates on video stream compression. However, to make the subject more complete discussion of still images is required. Lossy JPG uses YUV and DCT macro-blocking in its compression process. Lossless PNG makes significant used of Delta Encoding and Huffman encoding.
http://www.libpng.org/pub/png/book/chapter08.html
https://pi.math.cornell.edu/~web6140/TopTenAlgorithms/JPEG.html
Drawing 6: YUV Frames applicable for DCT Macroblocks & Motion Vectors (high latency compression)
Below the same ijettjournal, vol-4, issue-8 paper discusses the processes of spatial (the 2D frame single images of a video) and temporal (things that change over time).
Spatio-Temporal Video Denoising by Block-Based Motion detection
Seema Mishra1, Preety D Swami2
https://www.ijettjournal.org/volume-4/issue-8/IJETT-V4I8P124.pdf“Proposed video de-noising”
(in figure below) Consider both compression in each frame, and next, frame-to-frame changes
Drawing 7: Diagram of Proposed Video Denoising Algorithm ijettjournal.org/vol-4
De-noising process for a frame of video described below at science-direct web page
Numerical Analysis of Wavelet Methods
https://www.sciencedirect.com/topics/computer-science/signal-denoising
“wavelet-based parameter reduction are statistical estimation and signal denoising: a natural restoration strategy consists of thresholding the coefficients of the noisy signal at a level that will remove most of the noise, but preserve the few significant coefficients in the signal. This is in contrast to classical linear low pass filtering, which tends to blur the edges while removing the noise. Such nonlinear procedures have been the object of important theoretical work and have motivated recent progresses in nonlinear approximation”
Huffman-Encoding is added after the Discrete Cosine Transform stage, in this this drawing of a Raspberry-Pi project, published by
Drawing 8: DCT Discrete Cosine Transform Used in a Lossy Compression
An FPGA college project at Cornell shows efficient DCT computation and quantization are detailed for implication in gates and flops, and several more processes. Although for a jpg still image, much of the same processes apply to motion video.
Descriptive function, text-flow charting of the popular MPEG motion frames video compression process, is well explained in many papers and web pages that describe original frames compression, and change-only parts of next frames called I, P and B frames) along with “motion compensation” (finding matching macroblocks that have retained the same spatial pixel data but have re-positioned to a new location on the display.
1.2.1 Popular, But Lossy MPEG Video Compression Standard (motion streaming video)
Peak signal to noise ratio (PSNR) represents compression video codec image accuracy quality comparison numeric value. However, video multi-frame error correction processes, and other special techniques of hardware (HW) and software mixes layered on top of the main DCT and FFL compression processes, and latency (real-time transmitter camera phone, to real-time play-back phone) attribute a large effect on quality, to include what is the nature of particular raw video data to be compressed, makes the calculated PSNR not so absolutely valuable. It is noted here:
https://www.x265.org/compare-video-encoders/
peak signal to noise ratio (PSNR)
You must use your eyes. Comparing video encoders without visually comparing the video quality is like comparing wines without tasting them. While it’s tempting to use mathematical quality metrics, like peak signal to noise ratio (PSNR), or Structural Similarity (SSIM), these metrics don’t accurately measure what you are really trying to judge; subjective visual quality. Only real people can judge whether test sample A looks better than test sample B,
https://ottverse.com/i-p-b-frames-idr-keyframes-differences-usecases/
I, P, and B-frames – Differences and Use Cases Made Easy
Encoders search for matching macroblocks to reduce the size of the data that needs to be transmitted. This is done via a process of motion estimation and compensation. This allows the encoder to find the horizontal and vertical displacement of a macroblock in another frame.
An encoder can search for a matching block within the same frame (Intra Prediction) and adjacent (Inter Prediction) frames. It compares the Inter and Intra prediction results for each macroblock and chooses the “best” one. This process is dubbed “Mode Decision,” and in my opinion, it’s the heart of a video codec.
….. It’s a vast topic… However, B-frames are resource-heavy – both at the encoder and decoder. Let’s see why! To understand the impact of B-frames, let’s understand the concepts of Presentation/Display Order and Decoding Order.
https://web.stanford.edu/class/ee398a/handouts/lectures/EE398a_MotionEstimation_2012.pdf
EE398a Motion Estimation Summary:
* Video coding as a hybrid of motion compensation and prediction residual coding
* Motion models can represent various kinds of motions
* Lagrangian bit-allocation rules specify constant slope allocation to motion coefficients and prediction error
* In practice: affine or 8-parameter model for camera motion, translational model for small blocks
* Differential methods calculate displacement from spatial and temporal differences in the image signal
* Block matching computes error measure for candidate displacements and finds best match
* Speed up block matching by fast search methods, approximations, early terminations and clever application of triangle inequality
* Hybrid video coding has been drastically improved by enhanced motion compensation capabilities
1.2.1.1 YUV is compression of RGB Pixel Data
YUV was in effect invented by TV engineers to compress the RGB data. Vast sums of money were spent on TV development, and thus was a highly considered decision to use the YUV format. YUV survives stonefly to this day, in HD TV, for the same reason, it is a natural compression, and that takes advantage of the human eye placing more emphasis on monochrome brightness.
YUV takes less data to transmit or store (T-or-S) a pixel and is thus a form of compression in of itself. As noted above.
As a generalization, more video stream data, in modern times, is transited and recorded in YUV format than RGB. And more still image data, in modern times, is manipulated manually in graphics editing software in RGB format, than YUV. YUV to RGB and vice versa, real-time conversion is so common that it has hardware acceleration math circuits in almost all video cards (aka graphics cards) and video players and cell phones.
Then as the analog YUV data arrived then at the receiver, it was real-time reconstructed into RGB, and the electron gun painted the display group of an RGB pixel at the same temporal (when in time) and spatial (where is it on the X-Y 2D face of the TV display) location.
YUV, when created from RGB via conversion, is almost always lossy, and irreversible. In theory YUV could be Low-Loss, but that could require as much or more data bits than the original RGB data. When RBG is created from the original YUV, it typically is not very lossy, however a data bit per pixel engagement will occur. If an experiment was run of converting back and forth between the two-color formats of RGB and YUV, in less than 10 conversions and reversals, the data would become near a useless blurry result. See links below.
https://www.fourcc.org/yuv.php
https://forum.videohelp.com/threads/359657-If-I-convert-from-YUV-to-RGB-I-loss-quality
1.2.1.2 RGB (and loseless compression) is here to Stay, for Many Reasons
Even though YUV, at first visual inspection, uses less bits, however for personal computer use cases, RGB data pixel depth means a lot. The higher the bit depth, the nicer the desktop image, but in general will slow all graphics operations, and burn more wattage, as a straightforward conceptual trade-off. But RGB is here to stay, as it is conceptually Low-Loss, each time pixels are altered such as in adding or subtracting brightness of each of the 3 colors red, green and blue, which perfectly match the three display transistors or LED’ that make a display pixel.
Not just PC desktops, but the entire subject area of “machine vision” relies on the RGB format, links below. Virtually every egg, bottle of water, loaf of bread, toothpaste tube, and literally hundreds of thousands of common products, are inspected by machine vision. One could easily estimate that without “machine vision” industrial cameras and processing products that all common products would increase in cost by 10%, and for quality control to drop by some significant percentage.
https://forums.ni.com/t5/Machine-Vision/yuv-histogram-compared-with-rgb/td-p/1003576
https://www.jai.com/products/line-scan-cameras/3-sensor-r-g-b-prism/
https://docs.adaptive-vision.com/current/avl/functions/ImageColorSpaces/index.html
Regarding remote desktop, having a pixel bit depth setting deeper (more bits) than needed, can result in a significant slowdown in compression and some desktop compression processes won’t even allow a 24 bit setting.
Considering the use of YUV in desktop compression, can be done, for fast (almost zero latency) next frame update at a remote location, if the compression process then reverts to RGB correction over the next few frames, when the desktop changes settles down.
Usually, it is recommended that Personal Computer workstations that are to have their desktops compressed for internet transmission, be set to 16 bit RGB (565) color rather than the default setting of 24 bit color that many OS-installs use. The human eye-brain can only recognize 39 color hues in most people. Thus, the digital pixel Patel that was very well received by Commodore-Amiga computer users, of 12 bit RGB (4,4,4) was ample for almost all applications of business and art use. Not to be confused with “deep color” in Intel white papers regarding HDMI 36-bit RGB (12, 12,12)
Other writings argue for 1 million colors (that humans can see), as derived from the mathematical combination of all the color hues and brightness levels that they can see.
This paper would recommend that Linux, Apple and Windows OS distributions along with the graphics cards makers hardware and software-drivers, add a color desktop setting of 12-bit RGB color (4,4,4) , besides, the current popular settings of 16-bit RGB (5,6,5) and 24 bit color of (8,8,8). 12 bit color (4,4,4) with at least one hardware raster rectangular window, that could be in a higher color depth. Intel (in the December-2015 white paper link below) indicates otherwise, that all computer application display data windows to be conformed to the desktop general (perhaps refer this as “absolute” final display raster bit depth)
https://www.elitedaily.com/news/world/quarter-population-can-see-all-colors-chart/953812
https://petapixel.com/2018/09/19/8-12-14-vs-16-bit-depth-what-do-you-really-need/
https://forums.windowscentral.com/windows-10/370852-windows-10-no-16-bit-color-depth-options.html
https://blog.dhampir.no/content/remote-desktop-does-not-support-colour-depth-24-falling-back-to-16
Drawing 9: (Intel) Deep Color Support, Applications Convert Display to Desktop Color Depth
Keeping in mind, the context that Intel may be implying, due to so many Intel motherboards having built-in graphics display functions into that motherboard’s “companion chip”, where almost all the busses and ports and memory controller are created. Intel has used interleaved memory maps in lower cost PC’s, whereas the video-controller (aka GPU, aka raster-engine) may lack significant ability for hardware windows for applications.
Unfortunately, when a lower color depth application window is converted to a deeper color depth, using interpolation, this is mostly causing excess compression work to be done on display zones of manufactured, sort of virtual increased color and brightness variations. There would likely be more advantages to the tradeoffs of remote viewing and compression, if small areas of the desktop representing practical applications, remained in their more native color depth. It can be easily recommended that any YUV movie playback on a local RGB display, would be sent to the remote display in its native format, for that desktop zone.
Drawing 10: Mixed Color Depth Mode Remote Desktop Compression and Transmission.
When considering how much pixel bit-depth settings to use on a workstation, AMD published a web page reminding that the actual bit depth on HDMI panels is RGB (4,4,4).
https://www.amd.com/en/support/kb/faq/dh-007#faq-Pixel-Format-Overview
Articles are published that recommend nothing less than real life, high frame rate, deep pixel bit depth, high-resolution local-displays and remote-displays quality, such as with 30-bit (10, 10, 10) TFT or LED panels. The next link reviews this debate. Whereas many displays, although able to accept the digital data of very high bit depths, affect its D/A circuits that control analog display steps of brightness for transistors of each color of a pixel. However, most average displays in affordable price ranges do not actually show contrast resolution to those levels. And in fact, may rather be adding dithering at the display to imitate high contrast range resolution. This avsforum web page appears worth the time to read on this subject of panel hardware.
https://www.avsforum.com/threads/determining-display-panel-bit-depth.2424330/
1.2.1.3 DCT, DWT, FFT have Major Roles in Video & Audio Streams Lossy Compression
This paper covers DCT, DWT, FFT, even though MM hardware acceleration technics are so far on DCT, DWT, FFL, not published. DCT, DWT, FFT, all three are vacation of Trigonometric functions to represent a set of data points, where the Trig description, takes less bits of data, than the ordinal data points, albeit, in a lossy fashion.
When clicking wound the internet, the user sees many pictures and video streams. A common technical behavior the user sees, is when a picture is 1st remote displayed, for the very first 1 second, it can be rather blurry. This is a Trig-Function (one of the three variations DCT, DWT, FFT) across a whole image, left to right, and top to bottom.
For some of the sophisticated Low-Loss desktop compressions, a new screen refresh can be a lossy compression like jpg-image, using DCT, DWT, FFT, in first passes of display updates, and then over milliseconds to seconds, convert to a Low-Loss display desktop when the desktop quiets down to no changing every video frame, or just small changes.
Technically DCT is a lossless process. But in most cases of actual use in video compression, it is used in a lossy manner, as the programmer or hardware engineer averages some pixels, or limits the numbers of possible values of pixels, to reduce the bytes courts of the compressed result. The next web page provides good detail on this.
Is DCT lossy or Lossless? Mostly yes, but it depends on the context.
DCT-II is one of the many forms of Discrete Cosine Transforms, and probably the most widely used one, as it is (somehow) present in JPEG or MP3 formats. “Lossy” often refer to the compression standard which uses it, because the main loss results from quantization (and generally not the transform by itself). By itself, the transform is invertible, even more orthogonal or orthonormal, so in theory you have no loss (it is bijective).
As (commeter-xxxx) points out, its coefficients sometimes involve cosines, hence irrational numbers (except at specific values) that are not easily represented in finite arithmetic (float or int), and can induce round-off errors.
But: DCT-II of size 1 is 1, DCT-II of size 2 (un-normalized) is: 111−1 so DCT-II can be lossless (in special cases though),
For integer data (like images), it seems possible, with enough bit-depth computations (related to the initial integer range and the DCT size), to remain practically lossless, as you know that the transform/inverse transform result should be integer,
Very accurate integer, or dyadic approximations exist, meant for integer hardware: for instance the binDCT. My final answer is thus: DCT-II is mostly lossless, if you take care of it.
Jpeg, which uses DCT, and other methods combined, compared to Mpeg 1,2, 4 and H264 are briefly reviewed on this link.
https://www.sciencedirect.com/topics/computer-science/1-compression-ratio
Since JPEG is such a general-purpose standard, it has many features and capabilities. By adjusting the various parameters, compressed image size can be traded against reconstructed image quality over a wide range. Image quality ranges from “browsing” (100:1 compression ratio) to “indistinguishable from the source” (about 3:1 compression ratio).
How It Works: JPEG does not use a single algorithm, but rather a family of four, each designed for a certain application. The most familiar lossy algorithm is sequential DCT. Either Huffman encoding (baseline JPEG) or arithmetic encoding may be used. When the image is decoded, it is decoded left-to-right, top-to-bottom. Progressive DCT is another lossy algorithm, requiring multiple scans of the image. When the image is decoded, a coarse approximation of the full image is available right away, with the quality progressively improving until complete. This makes it ideal for applications such as image database browsing.
Either spectral selection, successive approximation, or both may be used. The spectral selection option encodes the lower-frequency DCT coefficients first (to obtain an image quickly), followed by the higher-frequency ones (to add more detail). The successive approximation option encodes the more significant bits of the DCT coefficients first, followed by the less significant bits.
1.2.1.4 Interframe Prediction, Macroblocks, Motion Estimation
“Inter-Frame Prediction” (aka Interframe) is perhaps the fewest words that can describe most used compression algorithms for video stream compression for movies and phone calls.
Error correction techniques are employed, where the lossy compression stages in video compression are processed on older frames, and compared for accuracy against newer frames they predict, while the compression process is underway. Those tests for how accurate the compression result is. If some predictive macroblocks, later when reverse-DCT is applied, the programmer or HW engineer may correct some buffered (ready for transmission to remote location) portions. Below is quoted Alexander Fox’s “how modern video compression algorithms actually work” and Ottverse’s i-p-b frames differences.
Drawing 11: Three Types of Frames used in Inter-Frame Prediction (A.Fox, how-comp-works)
I-frames are fully encoded images. Every I-frame contains all the data it needs to represent an image. P-frames are predicted based on how the image changes from the last I-frame. B-frames are bi-directionally predicted, using data from both the last P-frame and the next I-frame. P frames need only store the visual information that is unique to the P-frame. In the above example, it needs to track how the dots move across the frame, but Pac-Man can stay where he is.
The B-frame looks at the P-frame and the next I-frame and “averages” the motion across those frames. The algorithm has an idea of where the image “starts” (the first I-frame) and where the image “ends” (the second I-frame), and it uses partial data to encode a good guess, leaving out all the redundant static pixels that aren’t necessary to create the image.
Interface methods can just as well be applied to Low-Loss compression of a changing computer video desktop display. Below is quoted Alexander Fox’s “how modern video compression algorithms actually work”
Drawing 12: I-Frames are Full Frame of Macroblocks, P & B Frames Re-Use Some MB’s
https://ottverse.com/i-p-b-frames-idr-keyframes-differences-usecases/
From Krishna Rao Vijayanagar, at ottverse :
The concept of I-frames, P-frames, and B-frames is fundamental to the field of video compression. ….I-frames are generally inserted to designate the end of a GOP (Group of Pictures) or a video segment (refer to our article on ABR streaming fundamentals). Because I-frame compression is not dependent on previously-encoded pictures, it can “refresh” the video quality.
Encoders are typically tuned to favor I-frames in terms of size and quality because they play a critical role in maintaining video quality. After encoding an I-frame with high video quality, the encoder can then use it as a reference picture to compress P and B-frames.
https://www.maketecheasier.com/how-video-compression-works/ (by Alex Fox)
“Video encoders attempt to “predict” change from one frame to the next. The closer their predictions, the more effective the compression algorithm. This is what creates the P-frames and B-frames. The exact amount, frequency, and order of predictive frames, as well as the specific algorithm used to encode and reproduce them, is determined by the specific algorithm you use.
Drawing 13: B-Frames Built from MB Copies of Frames both Earlier & Later in Time
https://www.maketecheasier.com/how-video-compression-works/ (by Alex Fox)
… Data Compression, once the data is sorted into its frames, then it’s encoded into a mathematical expression with the transform encoder. H.264 employs a DCT (discrete-cosine transform) to change visual data into mathematical expression (specifically, the sum of cosine functions oscillating at various frequencies.) “
DCT is then applied to the macroblocks that reside in the old and new full frames, and then the DCT’s are forward time tested, against the frame they would have predicted.
Motion Estimation, and the creation of motion vectors and motion compensation circuits, is itself a broad subject. The hardware acceleration in this paper does some of the most repetitive selective macroblock data (memory access) processes. The links below review more details on Interframe motion as applied to video compression techniques.
https://en.wikipedia.org/wiki/Motion_compensation
https://www.cmlab.csie.ntu.edu.tw/cml/dsp/training/coding/motion/me1.html
https://web.stanford.edu/class/ee398b/handouts/lectures/02-Motion_Compensation.pdf
https://www.sciencedirect.com/topics/engineering/motion-compensation
Further motion compensation is at the core of modem compression, where much of the lossy-ness, centers on finding the best “match” DCT macroblocks.
https://web.stanford.edu/class/ee398b/handouts/lectures/02-Motion_Compensation.pdf
(Stanford.EDU Bernd Girod: EE398B Image Communication)
The presentation titled “Overview: motion-compensated coding” states:
“”Block-matching algorithm
- Subdivide current frame into blocks.
- Find one displacement vector for each block.
- Within a search range, find a best “match“ that minimizes an error measure””
The same Stanford-EDU presentation has this illustrative “History of motion-compensated coding” page, to add perspective to the overall subject of DCT, macroblocks, and motion-vector applications over a progression of years. It succinctly points out increasing complexity. The standards have become very complex as they progress, as to improve quality, compression ratio (CR) and achieve extraordinarily high video resolutions such as 4K video.
https://web.stanford.edu/class/ee398b/handouts/lectures/02-Motion_Compensation.pdf
Drawing 14: Stanford-EDU, History of motion-compensated coding
2. More Factors Affect both Lossy & Low-Loss Video Compression
Discussion of various twists and turns in the art of video compression, can be as endless as variations on a manuscript for play. Different papers will emphasize different methods, applied in a mix of manners, to take advantage of what the use-case is and what the hardware, processor and transmission networks are able to reasonably perform.
2.1 Compression Latency versus Complexity & Data Volumes
Some of the functional-stages of compression are: DCT and DWT discrete cosine transform, and discrete wavelet transform. These are math formulas that are operated typically on macroblock sized portions of the main raster image.
This paper’s version does not review DCT and DWT, except to copy links to other WP’s and webpages that cover those math equations. DCT and DWT are well-established in call-able pre-made software formulas and pre-made multi-clocked-hardware-logic.
2.2 Compress Steps: Pixel Change Map, Pixel Quantize, Macroblock Data-Push & DCT
There are many variations possible, with video compression. The very best compressions need to test the computer-hardware platform upon which they are operating, as for what acceleration hardware is Available. And further to ongoing test the compression performance, for quality, vs processor and memory bandwidth there is available. In battery products, power drain limits will also limit compression quality.
A semi-intelligent software flow chart can be arranged for a plan of compression that learns. By “learning”, it learns by storing how well which methods used said HW platform worked, and possible user input to the compression application.
If video compression is real-time, such as on a cell phone, it is critical to not overload the transmission (compressor) task, thus overloading the processor and memory overall on the phone hardware, to either cause the phone use experience to be poor, or to quickly run out its battery.
Abbreviations:
“Send-to-Target” compressed data ready for storage (buffer) or transmission, abbr. to “STT”
“Frame-Identification” (of a frame in a long series of frames abbr. to “Frame-ID”
“Pixel-Change-Map” from fame to next frame abbr. to “PCM”
“Original-Frame” Data, abbr. to “Org-Frame”
2.2.1 Common 1st Steps of Low-Loss/Low-Latency and Lossy/High-Latency Compression
Video movies of people or scenery vary in method from Flat-Background Remote Desktops, however still will have some common first group of steps in compression.
Step 1: “Pixel Change Map” PCM creation. Test for all Changed Pixels, in whole frame, from previous frame (use two phase locked frames feedback loop HW-accel)
Step 2: Quantize Pixel Data, use Pixel change map, to prioritize pixels to work on, then quantize new value set of pixel values.
Step 3: Count up percent of new whole Org-Frame changed, as to a selected pixel value percent change. Maintain count of changed pixels per line that have exceeded a pre-selected binary value percent.
Step 3-b: If there are HW resources, and latency allowable, then run the change map feedback loop at a 90-degree rotation, for a Pixel Change Map (PCM)
Step 4: If New Org-Frame is very low percent change, YUV or RGB-color, then jump to Method C,
If New Org-Frame is modestly low percent change and RGB-color, then jump to Method B.
If New Org-Frame is high percent change and YUV color, then jump to Method A.
2.2.2 Method-A: Moderate Lossy, High-Latency, High CR, YUV-color, DCT-MB moderate matches, Substantial use of Motion Vector, Constant Frame Rate
Security Cameras, Video Phones and Entertainment video streams, typically have a moderate frame rate of 20 to 30 FPS (frame per sec). Motion-Vector/Compensation is underlined to draw attention to this method difference.
After the first common steps, the steps (note in above section) proceed, as for a YUV Security Cameras, Video Phones and Entertainment video streams compression.
Step 5: Create Macroblocks 16×16 Pixel data push (use narrow column raster MB-push HW- accel).
Step 6: From the PCM, pursue DCT only on the columns that need compression,
Generate DCT’s on the Pixel data pushes from narrow column rasters.
Step 7: Test for all changed Macroblock (MB), from previous frames MBs, if changed, mark for Frame ID and then STT buffer.
Step 8: Test for all changed Macroblock (MB) Matches from previous frames for MB matches in new spatial location in display frame, for a motion vector creation for, MB re-use.
Step 9: Time-forward test MBs DCT’s (error testing) that were STT buffer, for prediction accuracy against newer Original-Frame’s, where PCM on MBs of newer Original-Frame dictates necessity.
Step 10: If real-time video compression, test receiver’s MBs-DCT buffer, while building a restored frame.
Step 11: If MBs DCT’s test good, and motion compensation (use of motion vector) has placed reusable MBs rebuilt MBs in all rebuilt frame locations, then display
Step 12: If time availability, and hardware availability permits, before the next new frame is introduced, then macroblocks creation loop can be fully revisited, as to make smaller size such as 8×8 pixels, for higher resolution in all or portions of the frames. Quantization enhancements can also be improved, such as smaller groupings of quantized groups of pixels, as to provide improved range of rare pixel values that would have been out of range and replaced with nearest palette values.
2.2.3 Method-B: LowLoss, Low-Latency, Moderate CR, RGB-color, DCT-MB exact matches, Limited or No use of Motion Vector, Adaptive Frame Rate.
Remote desktop is often using RGB color scheme, typically has a lower frame per sec (FPS) rate, (aka adaptive changing FPS as needed) than camera video or entertainment video streams.
Also, as years of advancement have occurred in Graphics Processor Unit IC’s (GPUs) and the video drivers and work by the operating systems software teams collaborate more, for linking the graphics’ card’s video driver to the compression process management.
Motion-Vector/Compensation is missing in this Desktop-Compression example.
The video driver what has innate knowledge of the video hardware and how it is being used, knows for instance if off screen virtual windowed hard located in other areas of the video memory of the display card, and whether those hardware virtual windows contained low rate change deeply data such as word processor, or a high rate of change such as video movie or video camera playback.
The video driver controls a hardware mouse pointer that is almost exactly like a video sprite of the early Atari video game systems. Hardware overlay-color-key (aka sprite) mouse is recreated at the remote PC, in the same manner it existed at the source PC, simply as an X, Y location on the whole display, and a definition of the mouse pointer image. (See Hardware Mouse Sprite Transparent Overlay section)
After the first common steps, the steps (noted in above section) this Method-B will proceed, as for an RGB computer desktop compression for remote display.
Step 5: Create Macroblocks (aka macrocells) 16×16 Pixel data push (use narrow column raster MB-push HW- accel)
Step 6: From the PCM, pursue DCT only on the columns that need compression,
Generate DCT’s on the Pixel data pushes from narrow column rasters
Step 7: Test for all changed Macroblock (MB), from previous frames MBs, if changed, mark for Frame ID and then STT buffer.
Step 8: loop test MB’ DCT’s (error testing) that were STT buffer, for accuracy against Org-Frame, where PCM dictates necessity. (Use narrow column raster MB-push HW- accel) to pump decompressed frame buffer against Org-Frame raster pixels for testing.
Step 9: If real-time video compression, receiver may have possession of the MB-DCT buffer, and is building a frame, but not yet displayed.
Step 10: If time availability, and hardware availability permits, before the next new frame is introduced, then macroblocks creation loop can be fully revisited, as to make smaller size such as 8×8 pixels, for higher resolution in all or portions of the frames. Quantization enhancements can also be improved, such as smaller groupings of quantized groups of pixels, as to provide improved range of rare pixel values that would have been out of range and replaced with nearest palette values.
2.2.4 Method-C: Low Data Volume, Low Latency, Few Pixel Changes Next Frame (YUV or RGB)
Scientific data, stock-market numeric values, and industrial processes are typical of these kinds of screen data. Statistical review of the real-time Pixel Change-Map informs of type of video use case.
(After the first common steps, the steps (note in section above) below, proceed, for real-time compression for security camera video or off-line compression of entrainment videos
Step 5: From PCM, act on portions of Org-Frame Quantized data for Huffman Encoding, and possibly delta encoding (in less common cases).
Step 6: Pixel change counts are tracked for quantity, and how much pixel values are changing inside each of the changed pixels. This is a pseudo statistical process.
Step 7: STT only if network bandwidth permits. If not enough bandwidth, then fall back to method-B, then return to Method-A, when display change rates settle down.
2.2.4.1 Compression “Quality” User Entry Settable Value or Set of Values
Below is text and image from James Babin at MyEasyTek, review of user-controlled compression settings for security cameras. This example provides for adjustment of the percent of I-Frames (recall I-Frames contained 100% or original Macroblocks, this increases the data to be stored or transmitted, but reduces errors when the video is decompressed. Below I-frame counts are adjusted per counts of B & P frames.
Drawing 15: From MyEasyTek IP Camera Settings, Compression “Quality” Control
“Everyone’s instinct is to turn these settings up as high as they go, but less is sometimes more. What we want to focus on is providing a smooth, crystal clear image that uses the least amount of bandwidth possible. The image at the top of the page is a screenshot of the IP Camera settings I typically use.
As you can see, the Main Stream and Sub Stream are separated into two columns. The Main Stream is typically what’s used for recording, which makes image quality of high importance. The substream is most commonly used to view the camera feed over the internet, which makes bandwidth management of greater importance than image quality.
The aggressiveness at which image quality or bandwidth is prioritized is based on the usage and the network environment. For example, if we have an office with a single IP Camera that records to an SD card then bandwidth becomes less important. Likewise, if we have an office that has 64 cameras recording to an NVR then bandwidth management becomes critical. Most real life use cases will lie somewhere in between.
Keep in mind that bandwidth usage directly relates to storage consumption. An IP camera outputting 4096 Kb/s will require four times the storage capacity as an IP camera streaming at 1024 Kb/s over the same amount of time. Encode Mode: My default IP camera encode mode is H.265.
Always choose the highest level of compression in this order: H.265, H.264H, H.264, and MJPEG. The only reason to go down a compression level is if there’s a compatibility issue. For instance, certain viewing software may not be able to view recorded H.265 footage, in which case defaulting to H.264 may be necessary. “
Next below is another example of detail from security camera professionals IPVM team, regarding the complex selections of factors to affect the video compression algorithm adjustments.
https://ipvm.com/reports/video-quality
(by IPVM team) “The fact that two exact shots (“or set of frames”.. added note this paper) with the same resolution can look significantly different has a number of important implications. Inside, we explain why, covering:
Quantization levels
Bandwidth vs. quality loss
Image quality examples
Manufacturer differences
MBR/VBR/CBR impact
Smart codec impact (“aka smart compression algorithm that may use a combination of both software and HW acceleration circuits”.. added note this paper)
Quantization Levels …Regardless of codec used (H.264, H.265, MJPEG, etc.), all IP cameras offer quality levels, often called ‘compression’ or ‘quantization’.
H.264 and H.265 quantization is measured on a standard scale ranging from 0 to 51, with lower numbers meaning less compression, and thus higher quality. If this seems counterintuitive to you, it is understandable, but these are simply the measurements defined in H.264 and H.265 standards. “
Dithering can also optionally be employed with the quantization process, where two original image pixels side by side, can be replaced with two others as like to mix black paint and white paint to obtain gray paint. This process is used extensively in paper printed pictures/images such as comic books and newspapers.
https://www.howtogeek.com/745906/what-is-dithering-in-computer-graphics/
In some cases, artists employed in this process, to manually select a limited palate that works best for the meaning of the artistic graphic material, and its message. This coincidentally can improve compression and near perfect de-compression.
https://visual.ly/community/Infographics/entertainment/comic-book-color-palettes
This next paper from Lighterra “Video Encoding Settings for H.264 Excellence” notes “quality” 124 times and appears to be a valuable asset to review for the many H.264 compression standards details and trade-offs.
https://www.lighterra.com/papers/videoencodingh264/
(Jason Robert Cary Peterson, Apr 2012)
“…learn about the various settings and tradeoffs,….
The high quality of x264 encoding is primarily due to…
aggressive motion-estimation search, which helps find as much temporal and spatial redundancy in the image as possible, using a large number of initial candidate predictors followed by a complex, uneven multi-hexagon search (with early exit for speed), followed by sub-pixel refinement using full rate-distortion optimization to account for the real, final cost-vs-benefit of each choice
excellent bitrate control/distribution, using macroblock-level analysis (“MB-tree”) to track the degree of referencing of each macroblock through the actual motion vectors from future frames, allowing the encoder to only lower the quality in the areas of each frame which are changing rapidly (not referenced much in the future), rather than lowering the quality of the whole frame as in most encoders – essentially traditional bitrate control but applied at the level of each 16×16 macroblock rather than at the whole-frame level – which helps maintain clear, stable backgrounds in the presence of moving foreground objects
intelligent, adaptive, variable use of B-frames, rather than just using a fixed pattern like IBBPBBPBBPBB as in most encoders, to make better use of the available bitrate by inserting the more expensive but higher-image-quality I- and P-frames where they’re of most benefit to serve as reference frames, which is good at all times but is particularly important during fades (one of the hardest things to compress well)
adaptive quantization, which varies the quantizer for each individual macroblock within each frame to avoid blur/blocking in flat areas containing fine/edge detail, such as calm water, sky and animation
full rate-distortion optimization used for motion-vector refinement, macroblock partitioning (subdividing each macroblock, balancing the cost of additional motion vectors against the benefit of the less complex residual image left to encode), and final quantization (the key lossy step!), which selects locally-optimal motion vectors, macroblock partitioning and quantization based on cost-vs-benefit using the real, actual cost of each possible choice when that choice is processed right through to final entropy encoding, versus the image-quality benefit as measured by the RDO metric (see below)
a “psycho-visual” rate-distortion optimization metric, which tries to match perceived visual quality better by de-emphasizing blurry “low-error but low-energy” choices, rather than using simpler metrics like sum of absolute differences (SAD), peak signal-to-noise ratio (PSNR) or structural similarity of images (SSIM), which all tend to lean towards low numeric pixel differences but too much blur”
3. Raster Hardware Accel Methods: Macroblock Compression Steps
Applying raster hardware acceleration to
- Dual Raster-Feedback-Loop Pixel Change Map (PCM) tracking real-time,
- Raster Macroblock creation for I-Frames (whole display),
- Raster Macroblock creation for only changed P and B Frames,
- Motion Vector “Interframe Prediction”,
provides a massive speed up of the process and lowering of wattage used.
The Dual Raster Feedback Loop method appears to be the lowest cost for both wattage, and circuit cost as Commercial off the Shelf (COTS) GPUs, can be used to make old frames, and just as important, to control the temporal timing of old frames to current frames. This creates the real-time Pixel Change Map.
The example drawings of hardware raster scanning for macroblock generation is shown as a column process. This can also be done in rows and often is, as a second pass, especially in off-line compression, Then next testing for which scan pattern of column or rows, produces the better Compression ratio (CR). This is also commonly referred to as 90-degree raster macroblock process, in that is 90-degrees off of the most common row scan that is seen in displays, and with video camera sensor data transfers.
The Pixel Change Map (PCM) real-time process is to remain as a row process, as that matches the video scanning of a display monitor or camera, and then next using a raster scan that is already occurring, to double as the PCM process. See detail section “HW Accel: GPU Real-time, Feedback Loop: Old/New Frames = Pixel Change Map”
I-Frame hardware acceleration is the process of pushing all macroblocks as narrow columns, and using vertical blank, as a trigger to update a new column start address (in video memory, such that a dual-head COTS GPU IC that normally would scan large display areas, as in typical use, but rather a series of narrow rasters (typically 8 or 16 pixels wide). And then round robin repeat, to cover a whole display screen area equivalent. See detail section “HW Accel: Macroblock Push, I-Frame (whole Display) toward DCT-Math”
P and B Frames hardware acceleration is the process of pushing ONLY macroblocks that have changed, as directed by the Pixel Change Map (PCM). The PCM process creates the set of Start Address’s (SA) and the Vertical-Total (VT)’s as line counts of how tall each macroblock zone is to be, as it may be as small as 1 macroblock, or as tall as a whole column of macroblocks. And also, a full column in a raster may have some changed macroblocks that are not contiguous (i.e. skipping some macroblocks in a macroblock column, of a larger frame). See detail section “HW Accel: I, P, B Frames Macroblock Push toward DCT-Math”
Interframe prediction, is done by comparing the changed macroblocks, against the set of precisely buffered changed macroblocks, to identify a set of contiguous macroblock zones in a new frame that can then be directed at the de-compression buffer, to blitter copy that set of re-subscale de-compressed MBs to that zone. This is the core portion of the motion estimation process.
As such Interframe prediction is a combination of the above process to obtain the motion vector, that can then be used a motion-estimation and the reverse at the de-compression stage, where groups of Macroblocks are re-used for future frame predictions. See previous section “Interframe Prediction, Macroblocks, Motion Estimation”. See detail section “HW Accel PCM controls Only-Changed Macroblock Push, P-B-Frames toward DCT”
Multiple Virtual Machines Desktop Compression, is the application method, of how to apply the previous portions of this paper as how raster HW acceleration methods of PCM and Macroblock processes can use a single dual head GPU for many displays. Typically, the many displays are not seen locally, but are the displays of “virtual machines” (VM’s), or “virtual containers” (VCs), one each per user. See detail section “Multiple Virtual Machines Desktops Video Compression w/ one GPU-IC”
An example of a powerful server motherboard with multiple cores, can handle the load of the 10 VM’s or VCs, but not the video compression work of 10 virtual screens, each needing a remodel desktop compression. The one dual-head GPU jumps its compression process from VM/VC user-1, to then 2, 3,..10, in a round robin fashion, by using Vertical Total event or Virtual Blank or Vertical Interrupt (all roughly the same event per hardware raster frame).
3.1 List of Hardware IP for new IC and COTS GPU’s (some free)
Many of the processes discussed in the paper require raster period phase control, where pixels of multiple rasters, need to be in time (temporal) control to as accurate as 1 pixel (often in the range of 20 nanoseconds to as fast as 500 picoseconds). That raster phase locking technology is free as described in US-6,262,695, and is free (expired patent).
MiMax technology with abbreviated description and hyper-links to PDF’s
US-6262695 Method phase-locking a plurality of display devices, multi-level driver
[phaselock COTS GPU Rasters] ( free-expired )
US-8139072 Network hardware graphics adapter compression
[real-time pixel change map (PCM) phaselocked dualhead GPU]
US-8441493 Network hardware graphics adapter compression
[realtime PCM, 6 virtual machine displays phaselocked dualhead GPU]
US-10499072 MacroCell (macroblock) display compression multi-head raster GPU
[display list applied to COTS GPU for macroblock creation]
These methods make repurposed use of Commercial off the Shelf (COTS) GPUs to stream the video through a feedback loop of phase locked rasters, to provide the old and new frames, with the lowest cost of hardware, and wattage, and is real-time, ie ZERO latency.
Technically DCT is a lossless process. But in most cases of actual use of DCT in video compression, it is used a Low-Loss manner, as the programmer or hardware engineer averages some pixels, or limits the numbers of possible values of pixels before the multi-pixel macroblock of data is DCT processes, to reduce the bytes courts of the compressed result.
3.2 HW Accel: I, P, B Frames Macroblock Push toward DCT-Math
The steps listed in “Typical Compression Steps Section” were first done with software when video encoding was first perused in lab environments, and to a limited degree in 1980’s personal computers, mostly with assembly code running on the main processor. Proprietary codecs (AV compression algorithms) are typically patented or trade-secret variations of the public domain compression flow charts.
This paper’s video compression variation is specially tilted toward maximizing raster hardware acceleration methods that can be applied to existing COT’s GPU (video controllers IC’s) and COT’s video raster engines free or sold for use in new IC’s or FPGA’s. These HW-accel methods should reduce costs of circuits in products and wattage, and latency, while improving the final imagery viewed by users at the decompression side for both use case types,
Lossy-movies-videophone and, B) Low-Loss-remote-desktop
Whether data streams of motion picture, or video phone call, or remote desktop, or YUV or RGB pixel data, these hardware (HW) acceleration methods reduce latency to calculate the compression data, and improve error resolution, for a more accurate stored or transmitted series of frame images.
The following are two HW-raster-acceleration compression process examples of using
A) Narrow column raster Macroblock (MB)-push-to-DCT-Math, HW- acceleration and, two frames, that are temporally phase locked, used in a feedback loop, for pixel change detection, HW-acceleration.
Method (A) is for the I-frame to push the entire display frame into MB columns, to feed the DCT process circuits. Method (B) pushes only changed MBs to the DCT process of the P-frames and B-frames.
3.2.1 HW Accel: Macroblock Push, I-Frame (whole Display) toward DCT
This next HW-accel stage that forces a GPUs raster hardware to scan the video raster in narrow columns, such as 16 pixels wide, as to real-time data push-flows (push-stream) as raw uncompressed macroblocks, toward the DCT math. Vertical Blank triggers Start-Address (SA) to trigger, and load a new “SA”, the process is round-robin, typically at the video frame rate, as to push all narrow columns of macroblocks per frame.
Drawing 16: HW-Acceleration I-Frame Pixel Data Push of MB’s to DCT Process
The above process of the narrow columns that are 16 pixels white act as narrow rasters, where they flow out in order, left to right. In most GPUs the rasters can also operate at 90-degrees spatially, in this same process. The 90-degree version of operation will allow for Macroblocks to be built, all the way through the DCT process, to see if compression at a 90-degree reduces, reduces the size of the list of the DCT’s of a full frame.
Below, is a zoom-in of 1 macroblock, 16 lines tall, and 16 pixels wide (represented as a rectangle, as many deeply systems and full display rasters often use non-square pixels, and for purpose of these drawings, fits smaller drawings.
Drawing 17: Zoom-in 16 lines of 16×16 MB push to DCT
The example above is a full “I” frame (I-frame) of MBs are pushed in narrow column order, small 16 pixel row, then next small pixel scan row.. And so on, toward the DCT Radix process, to make DCT-Macroblocks (DCT-MBs). The data push processing for an “I” frame, being very orderly, can be considered.
3.2.2 HW Accel: GPU Real-time, Feedback Loop: Old/New Frames = Pixel Change Map
Below is a link of MiMax technology as to Phase Lock control of rasters, see free IP from MiMax Inc, covered in this old, expired patent from year 2000. Also used herein, is US-10499072 Macros cell display compression multi-head raster GPU.
The feedback loop of two rasters, requires phase control, (recommended to use of free Phase-Lock (PL) process, free expired patent US-6,262,695), processor can be reviewed in other MiMax white papers, and in the patent PDF.
PL process is also described at “AVR” site for users of Atmel/Microchip MCU IC products
However phase locking (aka slave rasters that track a master for timing) display or camera rasters for video types of: HDMI, or DisplayPort, or SMPTE video streams, is a current useful subject.Often camera systems, like on an automobile need graphics to be over-layed on top.
This method of pixel clock subtraction will also work for two similar cameras to create a stereo camera system. You can have in theory an unlimited number of slave rasters or multiple displays or overlay, or even to compress video real-time, or to display all changing pixels, such as on real-time military radar display, with very little work on the circuits.
Once rasters are phase locked, then a feedback loop of two rasters in a single GPU IC, can be set up to real-time, detect all changed pixels. The Pixel-Change-Map PCM, that flows out of the feedback loop process is then used to guide the MB data push (to the DCT process), can be limited to ONLY the changed MBs. The below drawing shows this process in 1 unit of a dual head Graphic Processor Unit IC.
Next is a detailed drawing showing the Pixel Change Map, selecting which macro blocks are pushed to the DCT process. The Dual head GPU IC in the above drawing, has two video streams exiting two LVDS ports (aka graphics IC industry nicknames as “raster-heads”, aka “raster-engines”).
The two raster heads (below drawing) must be in phase-lock to an accuracy of a few nanoseconds (less than 1 pixel clock). These have been implemented with 3 nanosecond operating gates, and were common in IC’s prior to year 2000 on SVGA display systems. Now in the year 2022, this pixel clock subtraction method, to force multiple raster-heads to phase-lock, can be done with 4K video with the present picosecond gates.
Drawing 18: Two PhaseLocked Rasters Feedback Loop HW-Accel Create Pixel Change Map
The second raster data (created by the feedback loop) acts to re-create the video stream, but one frame old (temporally old) in time.
3.2.2.1 Detail Connections for PhaseLocking Two Rasters
Next is Detail of the phase locking the two raster engine heads in the COTS GPU. This was done in lab and in shipped graphics card products, with a dual-head COTS GPU-IC that had an external pixel clock input pin. When making new GPU IC’s it is a simple matter to offer an external pixel clock input pin, and can control its optional use with a GPU register bit.
Below is a zoom-in detail of pixel clock subtraction used in a dual head GPU IC to phase lock two rasters, of current video frame, and a second raster that is temporally period locked. The second raster acts to create the video stream, but one frame old (temporally old) in time.
Below is zoom-in detail, and deep detail of pixel clock subtraction used in a dual head GPU IC to phase lock two rasters of current video frame, and a second raster that is temporally period locked.
Drawing 19: Detail PhaseLock Two Rasters via Pixel Clock Subtractor Connections
Drawing 20: Deep Detail PhaseLock Pixel Clock Subtractor
This example shows the two vertical syncs are two-pixel clocks out of sync (such as during system boot-up). Normally everything is in full phase lock, and no pixel clocks need be subtracted. The system syncs into full phase lock in several hundred milliseconds, on most systems of HD, SVGA etc.
If the two rasters are of the same timing and resolution, then only the “XOR” gate and “AND” gate are needed, with no polarity, phase-lag (edges less than 1 pixel clock) and width, correction would be required. Then, when the rasters are phase locked, new-frame-raster and the one-frame-old-raster outputs of the GPU, with upper left pixel 1, would be starting at the same instant in time, thus making all the following HW acceleration video compression process temporally coherent.
3.2.2.2 HW-Accel Pixel Change Map Detail
Below is a detail drawing showing the Pixel Change Map, select which macroblocks are pushed to the DCT process, with zoom-in detail of the SA and VT values in the display-list that managers the values in the hardware raster control registers of the GPU (aka video controller, aka raster engine).
Drawing 21: Pixel Change Map Controls SA & VT values in HW Raster Control Registers
The Macroblock data push to DCT is detailed in the drawing of the raster showing a screen where the real-time PCM data selects addressed macroblocks, where the GPU hardware raster scan is used. This is where the display-list stack of Start Address (SA) and Line Count, aka Vertical Total (VT) register values list is used, as directed by the real-time PCM process, also occurring typically at the same frame rate as the rate to be compressed.
3.2.3 HW Accel PCM controls Only-Changed Macroblock Push, P-B-Frames toward DCT
Interframe (inter-frame) Prediction detail is inclusive in this section, as P and B frames, that are constructed in time, with prediction by detected motion vectors, uses this process of tracking changed macroblocks, and as directed by the tracking the real-time changed pixel map.
Drawing 22 : Real-Time Pixel Change Map (PCM), Controls ONLY changed MB’s push to DCT
Check marks designate Maroblocks that will be designated for Pixels to be pushed to the DCT process. The next drawing details the Start Address (SA) register and the Vertical Total (VT) Register values.
Drawing 23: HW-Acceleration P-Frame Pixel-Data-Push Only Changed Macroblocks
By comparing the changing macroblock groups and finding matching groups of historic macroblocks produces the motion vector, to drive the interface predictive build, even before the new frame has arrived, as to compare for error checking of compression accuracy.
The example drawings:
- “HW-Accel Real-Time Pixel Change Map (PCM), Controls that ONLY changed macroblocks are push”
- “HW-Acceleration P-Frame Pixel-Data-Push Only Changed Macroblocks”,
of a “P” or “B” frame, typically produce less than a full frame of raw-uncompressed MBs that require to be pushed to the DCT process. In hardware acceleration terms, this can be nicknamed Random-Access-Hardware-Accel-Macroblock-Push (RAHAMP).
The Pixel-Change-Map (PCM) can instruct the Pixel Data uncompressed MB pushes, not just in all columns as in an “I” frame 16×16 MB push, but ONLY the MBs that are important, because they have changed pixels (from the previous uncompressed frame).
Normally a GPU raster will have bit one SA value and VT value (often 12 bits and 11 bits) that informed the raster engine the start of the upper left hand corner of the display (consider it pixel “1”), and the lower right hand corner of the same display (consider it pixel “last”). It’s a bit of an oversimplification, however, the raster scanning of a whole display frame begins (after each vertical blank) at SA, and ends when the VT (the line count) matches VT. The next frame repeats the process.
The significant difference from I-Frame (where the whole frame is MB) pushed to DCT-Math, now P and B frames push to DCT process only the MBs that have changed pixels, and where the pixel change map (PCCM) guides that process. See section
In this hardware acceleration, SA, and VT are updated many times throughout a full frame. In effect making the GPUs raster engine, scan the pixels of not just narrow columns, but of only the macroblocks that changed, whereas new values of SA and VT changed rapidly, making the raster hardware pump only the desired (changed) macroblocks.
SA and VT are updated into the single hardware registers for each. This updating register values is done by gating (stopping) the raster-scan pixel clock so as to not waste memory bandwidth or wattage, while Start Address (SA) Register is updated, for any location macroblock to be pumped. It has similarities to a display-list process, much like how 1980’s Atari and Amiga gaming hardware triggered on vertical line counts to update video control registers, to effect more complex visual displays, with lower cost hardware.
3.3 HW Accel: Multiple Virtual Machines Desktops Video Compression w/ one GPU-IC
Multiple Virtual Machines Desktop Compression is the method of how HW acceleration methods of Pixel-Change-Map-PCM and Macroblock processes can use a single dual head GPU for many virtual displays. Typically, the many displays are not seen locally, but rather are the displays of “virtual machines” (VM’s), or “virtual containers” (VCs), one each, per user (aka “virtual user”).
An example of a powerful server motherboard with multiple cores, can handle the load of the 10 VM’s or VCs, but not the video compression work of sixteen virtual screens, each needing a remodel desktop compression.
The one dual-head GPU jumps its compression process from VM/VC user-1, to then 2, 3…10, in a round robin fashion, by using Vertical Total event or Virtual Blank or Vertical Interrupt (all roughly the same event per hardware raster frame). Each VM/VC user display is located in video memory of the GPU-IC in a separate zone of memory. Higher-end GPU cards often have 8 or more gigabytes of memory making this ample for each user to have an HD display of 16-bit color depth.
The next drawing shows 1 to 16 virtual machine user’s desktops, and the relationship of each desktop having a contiguous raster scan, then the next user desktop… up to 16, using one GPU raster, for I-frame macroblock data push to the DCT process. In theory the two raster heads could push two different users’ rasters, but that would not result in an overall speed up of compression.
The better compression temporal-spatial accuracy effect is to use two rasters to move two narrow columns within any chosen user’s desktop. For simplicity, in the next drawing, whole I-frames are data pushed toward either pixel change map process, or the multi-narrow columns MB data push to DCT process, as to show the display list method.
Drawing 24: HW-Accel 16 VM’s share One GPU-IC for Pixel Chang Map Process
16 Displays of HD = 1920×1080 = 2,073,600 pixels = 4,147,200 Bytes per virtual user (4 Meg-Bytes), whereas 16 users would require a shared server’s baseline minimum graphics card memory of 16 x 4Mb = 64MB. Thus, the now common 1 GB Graphics (GPU) cards are quite ample, for the user rasters, hardware-sprite mouse, and application hardware windows. See links to this paper for US-8,441,493.
The next drawing shows the video memory accesses for this system of 16 VM’s sharing one dual head GPU-IC being shared to create the 16 Pixel Change Map’s (PCM) as the most memory access intensive portion of the compression first step process for remote desktop function.
HW Pixel change map controls the P-Frame macroblock block push, to DCT to push only changed macroblocks. Memory is accessed in contiguous bursts, the same size (such as HD 1920 x 1080) as the desktop frames, in two raster heads for old and new frame XOR (per pixel) change-compare hardware.
Drawing 25: HW-Accel 16 VM’s share One Dual-Head GPU-IC, Contiguous Memory Burst,Pixel Change Map Process for each Display’s P-Frame
The above drawing illustrates 16 same resolution virtual displays, representing up to 16 virtual machines. Of 16 compressed distinct data flows to 16 remote users, P-Frame process. Vertical blank interrupt, (from vertical sync), in the dual head video controller (aka heal-head-GPU-IC) directs the raster scan to step through a series of new Start-Addresses (SA’s) for each. Vertical Total register (VT) needs no updating as the rasters all share the same line count value. All 16 users’ virtual displays are processed by the dual-head phase locked rasters for Pixel Change Map process, a round robin process.
Yet more sub-functions in multiple desktops compression, can be combined to make overall server systems renting out virtual machines. Provide for priority users, mouse movements on their remote systems, to cause the round-robin loop on the display list to provide additional contiguous video memory time periods, for their priority PCM or MB push to DCT for I or P-frames.
4. Implementing Video Compression in End Products, More factors
Ultimately software management of the hardware acceleration pixel and macroblock processes is always required, and the same is true for the hardware data transfer circuits that move the video and audio data through networks. Radio and Ethernet, for this review for network behavior, are closely comparable and strongly affect the design of video transmission (compression side) and reception (playback side).
4.1 Open & Closed Loop Data Transfer Physical Layer Issues
When TV and Radio grew rapidly in the first half of the 20th century, it was quickly accepted by the public, many of the commercial sectors consuming audio and video, and would have to be accepted as an imperfect technology. Sound would not always be intelligible, and images would not always be viewable for brief moments of seconds, if signal noise events occur. When radio signal noise issues happen from weather, or being at greater distances from a transmitter, then glitches (video or audio) were just accepted. Similarly, Ethernet and USB both have protocols that imitate acceptable lossy signal reception, just like radio.
The Ethernet-TCP and USB-Control are data packet types suited for managing the larger Ethernet-UDP and USB-Isochronous packets that carry YUV or RGB compressed video data over networks. These smaller TCP and USB-Control, packet types have built in error testing and re-transmission at the hardware level of the communication circuits.
For compressed video streams, Ethernet TCP and USB Control packet types maintain connection control, whereas these signal types (of Ethernet and USB) have built-in packets error checking arriving at the end of the receive system, in a low-latency time fashion and without errors. Ethernet-TCP and USB-Control packets automatically re-transmit upon errors. Logical layers above the PHY layer of Ethernet and USB do the TCP and Control packet auto-corrections. Operating systems, hardware drivers, and user application software, handle the frame-data, macroblock-data, and corrections if an Ethernet-UDP or USB-Isochronous large data packet is lost or damaged. (See next section).
4.2 Ethernet-UDP & USB-Isochronous Packets (similar to lossy radio data)
For the very high data volumes of audio and video, data even when compressed, is handled very much like TV and popular AM-FM-audio-radio over unidirectional radio signal transmission. The most popular packet types for the large data blocs of compressed video or audio are Ethernet-UDP and USB-Isochronous. This if a packet is least due to lighting interference, or signal blockage for briefed instants, the remote users see or hear a glitch.
Video or audio glitches are meant to be a more viable use case, rather than for the stream to restart, thus losing a constant frame rate the human brain would prefer to process. Considering viewing video security cameras, if this glitch acceptance concept was not used, the cameras would forever build more and more latency up to house or days, unless some rather expensive FIFO process to accommodate a variable frame rate, or frame skips to bring the video back into sync with real-time. Likewise for audio, listeners would be far less pleased with the audio rate stalling, as to catch up with brief data losses. In current years, with digital compressed audio and video, even over radio, there is a small amount of time-forward buffering, to remove data loss glitches.
To put in context of compressed and viewed, video stream transmission, of a full end-to-end process, requires further review of the TCP and Control packs of Ethernet and USB respectively, these outer packets, are used by the transmission receptions software’s working together to monitor for loss of the lossy (missing, or damaged) USP and Isochronous Ethernet or USB packets respectively.
4.3 Mix of Open, Semi Closed & Closed Loop Processes
Video commonly experienced today is carried out with Inner loop transport packets, outer-loop transport packets, outer-outer loop management of the compression and decompression engines, so often used today, is the culmination of thousands of man years of engineer’s work.
To sum up this transport process of how it interacts with hardware compression and decompression this next drawing shows the arrangement of the loop processes, or better yet call it loop control.
Model Predictive Control (MPC) which coincidentally shares the same abbreviation with Media Player Classic (MPC), and both share the same tasks of loop control, with inner and outer process. The drawing below shows two types of uncompressed video can enter (and exit) the system, on two types of hardware, Ethernet based networks or USB hardware transport.
The drawing below barrows portions of the classic control system (applicable to manufacturing plants and video streaming) drawing from:
https://genomics.lbl.gov/~amenezes/papers/ifac2011mpc.pdf
“Stable Hierarchical Model Predictive Control Using an Inner Loop Reference Model”
Drawing 26: Real-Time Ethernet or USB Video Stream, overall Model Predictive Control Loops
Additionally, reference source code for Media Place Classic (MPC) can be found at the next link. This code will interact with the Ethernet packets, when used over a network.
Additional information on video compression with MCP interaction at neurips-cc link below
https://proceedings.neurips.cc/paper/2021/file/2eb5657d37f474e4c4cf01e4882b8962-Paper.pdf
“Data Sharing and Compression for Cooperative Networked Control……
(Cheng, Pavone, Katti, Chinchali, Tang1)
Our work is broadly related to information-theoretic compression for control as well as task-driven representation learning ….
…. also inspired by Shannon’s rate-distortion theory….
…..In this application, a mobile video client stores a buffer of video segments and must choose a video quality to download for the next segment of video. The goal is to maximize the quality of video while minimizing video stalls, which occur when the buffer under-flows while waiting for a segment to be downloaded.”
4.4 Interlace vs Progressive Scan, Old Method for Real-time Bandwidth Compression
Interlace is not much discussed in this paper, as it is a form of real-time analog radio data rate bandwidth compression, for video transmission. It can be data lossy damage to the video, depending on the quality of the camera. Overall Interlace does not reduce the total data count of bytes of a video frame, still image or a set of frames.
Most video of modern times is stored as progressive (non-interlaced) data. Most old video from the 20th century, that was interlaced, is converted to non-interlace when re-sampled for re-storage or playback on the ubiquitous progressive display systems of modern times.
Interlacing of live action scenes by low-cost video camera’s produced considerable movement artifacts, of what is called the “comb” effect, due to having two shutter events per whole frame (the odd lines, then next the even lines).
High-quality camera’s captured video in a single shutter event per frame, then transmitted the data interlaced. Raw video from a high-quality camera with a single shutter per frame design (as it should be), would have no compression benefits (reduction of total data for the frame), by using interlace transmission, via radio or wire-cable. Thus, interlace methodology for video is rapidly becoming a rare thing. Interlace may remain in a small percentage of applications for radio use cases. It is not illogical to consider/speculate that macroblocks in a frame, or multi-frame jumps in a long video movie, could be radio transmitted in interlace fashion, where severe radio bandwidth is an issue.
Typically, the analog interlaced video is completely re-assembled to progressive frames, before the digital re-compression of video is performed. Interlaced video again, like YUV, was a highly considered decision of the TV development engineers, due to the severe technical radio-signal bandwidth limitations of those times.
https://en.wikipedia.org/wiki/YUV
YUV was invented when engineers needed color television in a black-and-white infrastructure.[7]
They needed a signal transmission method that was compatible with black-and-white (B&W) TV while being able to add color…. …. standard NTSC TV transmission format of YUV420
https://www.mathworks.com/matlabcentral/answers/1634390-convert-rgb-to-yuv-and-convert-yuv-to-rgb
5. Audio, Sampling Periods for Compression, Much longer than for Video
Audio sampling for the popular MP3 format can be quite long as the cosines compressions are examining sample data that can be represented by frequency as low as 20 HZ. Thus,keeping audio voice of a call, in sync with the faster video compression techniques, requires a mix of methods, and added functions to keep the audio and video playback synchronized, often involving software timers, that repeat a mostly low frequency audio clock.
5.1 Audio: Low Loss, Low Compression Ratio, Fast (Low Latency)
Wav-files, a type of low-loss compression of audio stream data. Philips and Sony corporations invested considerable funds and work in the early 1980’s to create the audio CD of approx. 45 minutes of audio wave files. Sample rate and bit depth research for the very best trade-offs, along with the work on D/A circuits and the analog filters for playback, opened a whole new world for digital storage of music and music listening and editing.
By the mid 1990’s audio wav files were considered kind of primitive, and MP3 compression, with FFT Trig methods (summation of a group of cosines with corresponding Freq, Phase and Amplitude data), has taken over that provides roughly the same listening quality, at roughly 10X file size reduction.
https://pythonnumericalmethods.berkeley.edu/notebooks/chapter24.02-Discrete-Fourier-Transform.html
“The Fourier Transform can be used for this purpose, which it decomposes any signal into a sum of simple sine and cosine waves that we can easily measure the frequency, amplitude and phase”
5.1.1 Synchronization of Audio and Video
Overall, synching of audio to video as commonly done today over networks and even how TV equipment plays back a video disk, this audio de-compression process is largely an asynchronous process. The resultant audio streams use markers to be later re-aligned with the video. This video-audio sync methods area would require this paper to be yet even longer. The link below can serve that for context.
https://mimax.com/optimizing-embedded-hardware-using-interleave-memory-phase-locking/
“White Paper: Optimizing Embedded Hardware Using Interleave Memory & Phase Locking
Synchronize Simple Sample Audio to Other Bus Processes”
The MiMax paper covers the less common video with audio transport method, but dramatically simpler, whereas audio samples, like wav-file samples, being tagged to the video horizontal blank periods, there the gap in horizontal video data, runs coincidentally at a rate of typically 14 to 30 thousand times per second, depending on video resolution and frame rates. For systems, like automotive backup cameras that also may need to sample outside audio to aid in safe driving, this simpler synchronous method of audio transport, that keeps the video and audio inherently in-synch.
6. Summary
The Hardware Video Acceleration methods herein, were described with enough detail to access general value for placement in new video IC designs, or new various Graphics Processor IC’s or embedded in glue logic between existing GPU IC’s. The paper also implicitly indicates how software compression would make calls to routines to load hardware register values and/or act on vertical blank interrupts and similar signals from the video hardware.
Multiple Dual-Head GPUs can be further linked with phase locking of all their raster engines as parallelism speeds-up processes of compression even more with simultaneous compression sub-tasks. If the two or more rasters working on video compression sub-tasks are not phase locked, then the sub-tasks completion events will be temporally in-coherent. The usual fix for this is in an outer control loop of all the sub-tasks is to wait for the slowest subtask to complete. Unfortunately, this wastes resources.
When multiple GPUs are phase locked, as a hardware raster engine group, for working on a main task of particular or set of temporally located frame’s, the multiple sub-tasks results will have a time-coordinated temporal relationship (and that is a good thing).
The paper reviewed hardware raster acceleration circuits for:
- Dual Raster-Feedback-Loop Pixel Change Map (PCM) tracking real-time,
- Raster Macroblock creation for I-Frames (whole display),
- Raster Macroblock creation for only changed P and B Frames,
- Motion Vector “Interframe Prediction” ,
- Multiple virtual machine or container displays on 1 server, compressed on 1 GPU IC,
- Phase Locking any two similar Rasters via Pixel Clock Subtraction (free)
7. Index of Links
https://www.gamasutra.com/view/feature/3090/image_compression_with_vector_.php?print=1
https://www.cse.unr.edu/~bebis/CS474/Handouts/ImageCompression.pdf
https://en.wikipedia.org/wiki/Quantization_(image_processing)
http://paulbourke.net/dataformats/compress/
http://paulbourke.net/dataformats/
https://en.wikipedia.org/wiki/Lossless_compression
DESIGN-OF-IMAGE-COMPRESSION-ALGORITHM-USING-MATLAB.pdf
http://dvd-hq.info/data_compression_1.php
https://helpful.knobs-dials.com/index.php/VNC_notes#VNC
https://blog.tedd.no/2011/04/28/optimizing-rdp-for-casual-use-windows-7-sp1-remotefx/
https://www.ijettjournal.org/volume-4/issue-8/IJETT-V4I8P124.pdf
https://www.helpwire.app/blog/remote-desktop-protocol/
https://discover.realvnc.com/what-is-vnc-remote-access-technology
https://superuser.com/questions/1583138/improve-mouse-input-lag-over-remote-desktop-connection
https://retrogamecoders.com/amos-basic-bobs-and-sprites/
https://patentcenter.uspto.gov/#!/applications/08176128
https://www.freepatentsonline.com/10276131.pdf
http://www.libpng.org/pub/png/book/chapter08.html
https://pi.math.cornell.edu/~web6140/TopTenAlgorithms/JPEG.html
https://www.ijettjournal.org/volume-4/issue-8/IJETT-V4I8P124.pdf
https://www.sciencedirect.com/topics/computer-science/signal-denoising
https://www.semanticscholar.org/
https://www.x265.org/compare-video-encoders/
https://ottverse.com/i-p-b-frames-idr-keyframes-differences-usecases/
https://web.stanford.edu/class/ee398a/handouts/lectures/EE398a_MotionEstimation_2012.pdf
https://www.fourcc.org/yuv.php
https://forum.videohelp.com/threads/359657-If-I-convert-from-YUV-to-RGB-I-loss-quality
https://github.com/joedrago/colorist/issues/26
https://forums.ni.com/t5/Machine-Vision/yuv-histogram-compared-with-rgb/td-p/1003576
https://www.jai.com/products/line-scan-cameras/3-sensor-r-g-b-prism/
https://docs.adaptive-vision.com/current/avl/functions/ImageColorSpaces/index.html
https://www.elitedaily.com/news/world/quarter-population-can-see-all-colors-chart/953812
https://petapixel.com/2018/09/19/8-12-14-vs-16-bit-depth-what-do-you-really-need/
https://forums.windowscentral.com/windows-10/370852-windows-10-no-16-bit-color-depth-options.html
https://blog.dhampir.no/content/remote-desktop-does-not-support-colour-depth-24-falling-back-to-16
https://www.amd.com/en/support/kb/faq/dh-007#faq-Pixel-Format-Overview
https://www.avsforum.com/threads/determining-display-panel-bit-depth.2424330/
https://www.sciencedirect.com/topics/computer-science/1-compression-ratio
https://www.maketecheasier.com/how-video-compression-works/
https://www.maketecheasier.com/how-video-compression-works/
https://ottverse.com/i-p-b-frames-idr-keyframes-differences-usecases/
https://www.maketecheasier.com/how-video-compression-works/
https://en.wikipedia.org/wiki/Motion_compensation
https://www.cmlab.csie.ntu.edu.tw/cml/dsp/training/coding/motion/me1.html
https://web.stanford.edu/class/ee398b/handouts/lectures/02-Motion_Compensation.pdf
https://www.sciencedirect.com/topics/engineering/motion-compensation
https://web.stanford.edu/class/ee398b/handouts/lectures/02-Motion_Compensation.pdf
https://web.stanford.edu/class/ee398b/handouts/lectures/02-Motion_Compensation.pdf
https://myeasytek.com/blog/ip-camera-configuration/
https://ipvm.com/reports/video-quality
https://www.howtogeek.com/745906/what-is-dithering-in-computer-graphics/
https://www.creativebloq.com/advice/how-to-colour-comics
https://visual.ly/community/Infographics/entertainment/comic-book-color-palettes
https://www.lighterra.com/papers/videoencodingh264/
US-6262695 Method phase-locking a plurality of display devices, multi-level driver
[phaselock COTS GPU Rasters] ( free-expired )
US-8139072 Network hardware graphics adapter compression
[real-time pixel change map (PCM) phaselocked dualhead GPU]
US-8441493 Network hardware graphics adapter compression
[realtime PCM, 6 virtual machine displays phaselocked dualhead GPU]
US-10499072 MacroCell (macroblock) display compression multi-head raster GPU
[display list applied to COTS GPU for macroblock creation]
https://genomics.lbl.gov/~amenezes/papers/ifac2011mpc.pdf
https://github.com/clsid2/mpc-hc
https://proceedings.neurips.cc/paper/2021/file/2eb5657d37f474e4c4cf01e4882b8962-Paper.pdf
https://en.wikipedia.org/wiki/YUV
https://www.mathworks.com/matlabcentral/answers/1634390-convert-rgb-to-yuv-and-convert-yuv-to-rgb
https://pythonnumericalmethods.berkeley.edu/notebooks/chapter24.02-Discrete-Fourier-Transform.html
https://mimax.com/optimizing-embedded-hardware-using-interleave-memory-phase-locking/
7.1 Index of drawings
Drawing 1: RLE/Huffman & Macroblock Encoding (low latency) RGB Desktop applicable
Drawing 2: Text w/plain Backgrounds, Row/Column Scan to Create Re-usable Macroblocks
Drawing 3: “One frame delay reference” (ijettjounal vol4 issue8)
Drawing 4: Camera Color Image Sensor Bayer Filter Arrangement, translates to RGB or YUV
Drawing 5: Sprite Mouse-Cursor, ColorKey Transparency, & X-Y Data for Desktop Compression
Drawing 6: YUV Frames applicable for DCT Macroblocks & Motion Vectors (high latency compression)
Drawing 7: Diagram of Proposed Video Denoising Algorithm ijettjournal.org/vol-4
Drawing 8: DCT Discrete Cosine Transform Used in a Lossy Compression
Drawing 9: (Intel) Deep Color Support, Applications Convert Display to Desktop Color Depth
Drawing 10: Mixed Color Depth Mode Remote Desktop Compression and Transmission.
Drawing 11: Three Types of Frames used in Inter-Frame Prediction (A.Fox, how-comp-works)
Drawing 12: I-Frames are Full Frame of Macroblocks, P & B Frames Re-Use Some MB’s
Drawing 13: B-Frames Built from MB Copies of Frames both Earlier & Later in Time
Drawing 14: Stanford-EDU, History of motion-compensated coding
Drawing 15: From MyEasyTek IP Camera Settings, Compression “Quality” Control
Drawing 16: HW-Acceleration I-Frame Pixel Data Push of MB’s to DCT Process
Drawing 17: Zoom-in 16 lines of 16×16 MB push to DCT
Drawing 18: Two PhaseLocked Rasters Feedback Loop HW-Accel Create Pixel Change Map
Drawing 19: Detail PhaseLock Two Rasters via Pixel Clock Subtractor Connections
Drawing 20: Deep Detail PhaseLock Pixel Clock Subtractor
Drawing 21: Pixel Change Map Controls SA & VT values in HW Raster Control Registers
Drawing 22 : Real-Time Pixel Change Map (PCM), Controls ONLY changed MB’s push to DCT
Drawing 23: HW-Acceleration P-Frame Pixel-Data-Push Only Changed Macroblocks
Drawing 24: HW-Accel 16 VM’s share One GPU-IC for I-Frame Data-push to DCT Process
Drawing 25: HW-Accel 16 VM’s share One Dual-Head GPU-IC, Contiguous Memory Burst, Pixel Change Map Process for each Display’s P-Frame
Drawing 26: Real-Time Ethernet or USB Video Stream, overall Model Predictive Control Loops