A look ahead—can virtual be faster than real?

Performance potential doesn't just end with successfully removing or slimming down unnecessary layers. If we control the operating system layer and the JVM knows how to talk to it, several previously unavailable pieces of information can be propagated to the JVM, vastly extending the power of an adaptive runtime.

Note

This section is more speculative in nature than the rest of the chapter. While not all of the techniques described herein have been proven feasible in the real world, they still form part of the basis of our belief that high performance virtualization has a bright future indeed.

Quality of hot code samples

One example of where performance potential can be had would be in the increased quality of samples for hot code. Recall from Chapter 2, Adaptive Code Generation that the more samples; the better the quality of code optimizations. When we run our own scheduler and completely control all threads in the system, such as is the case in the JRockit VE kernel, the sampling overhead goes down dramatically. There is no longer a need to use expensive OS calls (and ring transitions) to halt all threads in order to find out where in Java their instruction pointers are. Compare this to using green threads for the thread implementation in the JVM, as introduced in Chapter 4. Starting and stopping green threads carries very little overhead as OS threads are not part of the equation. That way, the JRockit VE thread implementation has a lot in common with a green thread approach.

The quality of samples for JRockit VE is potentially comparable to hardware-based sampling, also discussed in the chapter on code generation. This helps the JVM make more informed optimization decisions and enables it to work with higher precision in the number of samples required for different levels of reoptimization.

Adaptive heap resizing

Another example of performance potential in the virtual stack would be enabling adaptive heap resizing for the JVM.

Most hypervisors support the concept of ballooning. Ballooning provides a way for the hypervisor and the guest to communicate about memory usage, without breaking the sandboxing between several guests running on one machine. This is typically implemented with a balloon driver showing up to the guest as a fake virtual device. This can be used by the hypervisor to hint to a guest that it needs more memory. This can be done by "inflating" the balloon driver, making it take up more memory. Through the balloon driver, the guest can efficiently get and interpret the message, "release some more memory, or I'll swap you" from the hypervisor, when memory is scarce and needs to be claimed for other guests.

Ballooning may also enable overcommitment of memory, i.e. the appearance that the guests together actually use more memory than is physically available to the hardware. This can be a powerful mechanism given that this leads to no actual swapping.

As the Java heap part of the total memory of the virtual machine is orders of magnitude larger than the native memory taken up by the JRockit VE kernel, it follows that the most efficient way to release or claim memory from the hypervisor is to shrink or grow the Java heap. If our hypervisor reports, via the balloon driver, that memory pressure is too high, the JVM should support shrinking its heap through an external API call ("external" meaning exported to the JRockit VE kernel from the JVM). Possibly, this needs to involve triggering a heap compaction first.

The other way around, if too much time is spent in GC, the JVM should ask the kernel if it is possible to claim more memory from the hypervisor. These "memory hint" library calls that are no-ops on platforms outside JRockit VE, are unique to JRockit and JRockit VE. They will be part of the platform abstraction layer that the JVM uses for the JRockit VE platform.

Traditional operating systems have no way of hinting to a process that it should release memory or use more. This opens a whole new chapter in adaptive memory management. JRockit VE is thus able to make sure that the running JVM (its single process) uses exactly the right amount of memory, returns unused memory quickly so other guests on the same hardware can claim it, and avoids swapping by dramatically reducing heap size if resources start to be scarce. This makes JRockit VE ideal for Java in a virtualized environment—it quickly adapts to changing situations and maximizes memory utilization even between different guests.

Inter-thread page protection

Removing a traditional operating system from the layer between the JVM and the hardware can also bring other, perhaps rather surprising, benefits.

Consider the standard OS concepts of threads versus processes. It is, by definition, the case that threads share the same virtual memory in a process. There is no inherent memory protection between threads in the same process. Different processes, however, cannot readily access each others' memory. Now, assume that instead, each thread could reserve memory that would be protected from other threads in the same process as well as from other processes. If a thread tried to access another process-local thread's protected memory with such a mechanism in place, a page fault could be generated. This is similar to when trying to access protected memory in a standard OS. This, more fine-grained, level of page protection is not available in any normal operating system. However, JRockit VE can easily implement it by changing the concept of what a thread is.

Implementing a quick and transparent process-local page protection scheme in the JVM is impossible in a standard operating system, but quite simple when the JVM is tightly integrated with an OS layer like JRockit VE kernel. Oracle has filed several patents on this technology.

To illustrate why this would be useful, we can come up with at least two use cases, where inter-thread (intra-process) page protection would be a very powerful feature for a Java platform.

Improved garbage collection

As we have already discussed in Chapter 3, there are plenty of benefits to thread local object allocation in Java, partly because we avoid repeatedly flushing out new Java objects to the heap, which requires synchronization.

Also recall that many objects in Java die young and can be kept in a nursery for added garbage collection throughput. However, it turns out that several Java applications also tend to exhibit a behavior where many objects are allocated locally in one thread, and then garbage collected before they are seen by (or made available to) other threads in the executing Java program.

It follows that if we had a low-overhead way of extending the thread local allocation areas to smaller self-contained thread local heaps, for objects that have not been seen by other threads yet, immense performance benefits might theoretically be gained. Trivially, a thread local heap could be garbage collected in a lock free manner—the problem is maintaining the contract that only thread local objects may exist inside it. If all objects were thread local, a complete latency-free and pauseless GC would be possible. Obviously, this is not the case. Thread local heaps could also be garbage collected independently of each other, which would further decrease latency.

Along with the thread local heaps, a global heap (as usual, taking up the largest part of the system memory) would exist for objects that can be seen by more than one thread at the same time. The global heap would be subject to standard garbage collection. Objects in the global heap would be allowed to point out objects in the thread local heaps as long as the GC is extended to keep track of any global to local references.

The main problem with this approach would be to detect if an object changes visibility. Any field store or field load involving a thread local object can make it visible to another thread, and consequently to the rest of the system. This would disqualify it from its thread local heap and it would have to be promoted to the global heap. For necessary simplicity, no two objects on different thread local heaps can be allowed to refer to each other.

In order to maintain this contract in a standard JVM, running on a standard OS, we would need some kind of expensive read and write barrier code each time we try to access a field in an object. The barrier code would check if the accessor is a different thread than the object creator. If this is the case, and if the object has not been seen by other threads before, it would have to be promoted to the common global heap. If the object is still thread local, that is just being accessed by its creating thread, it can still remain in the thread local heap.

The pseudocode for the barriers might look something like this:

//someone reads an object from "x.field"
void checkReadAccess(Object x) {
int myTid = getThreadId();
//if this object is thread local & belongs to another
//thread, evacuate it to global heap
if (!x.isOnGlobalHeap() && !x.internalTo(myTid)) {
x.evacuateToGlobalHeap();
}
}
//someone writes object "y" to "x.field"
void checkWriteAccess(Object x, Object y) {
if (x.isOnGlobalHeap() && !y.isOnGlobalHeap()) {
GC.registerGlobalToLocalReference(x, y);
}
}

At least on a 64-bit machine, where address space is vast and readily available, a simple way to identify objects belonging to a particular thread local heap would be to use a few bits in the virtual address of an object to tag it with a thread ID. Read and write barriers would then be only a few short assembly instructions for the fast case—check that the object is still thread local. However, even if all accesses were thread local, the check would still incur a code overhead, and make use of precious registers. Each read and write barrier, i.e. each Java field access, would require the execution of extra native instructions. Naturally, the overhead for the slow case would be even more significant.

Research by Österdahl and others has proven that the barrier overhead makes it impractical to implement thread local garbage collection in a JVM running in a general purpose OS. However, if we had access to a page protection mechanism on a thread level instead of a process level, at least the read barrier would become extremely lightweight. Accessing an object on a thread local heap from a different thread could be made to trigger a fault that the system can trap. This would require no explicit barrier code.

Note

Naturally, even with much more efficient read and write barriers, thread local GC would also increase the total performance overhead in applications where objects need to be frequently promoted to the global heap. The classic producer/consumer example, where objects created by one thread are continuously exported to another would be the simplest, completely inappropriate application for thread local GC.

However, hopefully, it turns out that in the same way that many applications lend themselves well to generational GC, many applications contain large numbers of thread local objects that are never exposed to the rest of the system before being garbage collected.

The approach described in this section seems nice, in that it fits well with the "gambling" approach used in many areas of an adaptive runtime—assume thread locality that is cheap, and take the penalty if proven wrong. Although this all sounds well and good, an industrial strength implementation of thread local garbage collection would be fairly complex, and not enough research has been done to determine if it would be of practical use.

Concurrent compaction

Another application of inter-thread memory protection would be for jobs that are hard to parallelize without massive amounts of synchronization. One example would be heap compaction in the garbage collector. Recall that compaction is an expensive operation as it involves working with objects whose references potentially span the entire heap. Compaction using several threads also requires synchronization to do an object trace, and thus is hard to parallelize properly. Even if we split the heap up into several parts and assign compaction responsibilities to different threads, continuous checks are needed when tracing references to see that one compacting thread doesn't interfere with the work of another.

A concurrent compaction operation would potentially be a lot easier and faster if the interference check was handled implicitly by inter-thread page protection. In the event that one compacting thread tried to interfere with the work of another, this could be communicated by a page protection fault rather than with an explicit check compiled into the GC code. Then the compaction algorithm would potentially require less synchronization.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset