Discussion
Search code, repositories, users, issues, pull requests...
whalesalad: I feel like this could unlock some huge performance gains with Python. If you want to truly "defeat the GIL" you must use processes instead of threads. This could be a middle ground.
fwsgonzo: I actually just published a paper about something like this, which I implemented in both libriscv and TinyKVM called "Inter-Process Remote Execution (IPRE): Low-latency IPC/RPC using merged address spaces".Here is the abstract: This paper introduces Inter-Process Remote Execution (IPRE), whose primary function is enabling gated persistence for per-request isolation architectures with microsecond-latency access to persistent services. IPRE eliminates scheduler dependency for descheduled processes by allowing a virtual machine to directly and safely call, execute functions in a remote virtual machines address space. Unlike prior approaches requiring hardware modifications (dIPC) or kernel changes (XPC), IPRE works with standard virtualization primitives, making it immediately deployable on commodity systems. We present two implementations: libriscv (12-14ns overhead, emulated execution) and TinyKVM (2-4us overhead, native execution). Both eliminate data serialization through address-space merging. Under realistic scheduler contention from schbench workloads (50-100% CPU utilization), IPRE maintains stable tail latency (p99<5us), while a state-of-the-art lock-free IPC framework shows 1,463× p99 degradation (4.1us to 6ms) when all CPU cores are saturated. IPRE thus enables architectural patterns (per-request isolation, fine-grained microservices) that incur millisecond-scale tail latency in busy multi-tenant systems using traditional IPC.Bottom line: If you're doing synchronous calls to a remote party, IPRE wouldn't require any scheduler mediation. The same applies to your repo. Passing allocator-less structures to the remote is probably a landmine waiting to happen. If you structure both parties to use custom allocators, at least for the remote calls, you can track and even steal allocations (using a shared memory area). With IPRE there is extra risk of stale pointers because the remote part is removed from the callers memory after it completes. The paper will explain all the details, but for example since we control the VMM we can close the remote session if anything bad happens. (This paper is not out yet, but it should be very soon)The best part about this kind of architecture, which you immediately mention, is the ability to completely avoid serialization. Passing a complex struct by reference and being able to use the data as-is is a big benefit. It breaks down when you try to do this with something like Deno, unfortunately. But you could do Deno <-> C++, for example.For libriscv the implementation is simpler: Just loan remote-looking pages temporarily so that read/write/execute works, and then let exception-handling handle abnormal disconnection.
jeffbee: Looks like you forgot the URL. Interested.
short_sells_poo: How would this really help python though? This doesn't solve the difficult problem, which is that python objects don't support parallel access by multiple threads/processes, no? Concurrent threads, yes, but only one thread can be operating on a python object at a time (I'm simplifying here for brevity).There are already means of passing around bulk data with zero copy characteristics in python, but there's a lot of bureaucracy around it. A true solution must work with the GIL (or remove it altogether), no?
tombert: Interesting.I gotta admit that my smell test tells me that this is a step backwards; at least naively (I haven't looked through the code thoroughly yet), this just kind of feels like we're going back to pre-protected-memory operating systems like AmigaOS; there are reasons that we have the process boundary and the overhead associated with it.If there are zero copies being shared across threadprocs, what's to stop Threadproc B from screwing up Threadproc A's memory? And if the answer to this is "nothing", then what does this buy over a vanilla pthread?I'm not trying to come off as negative, I'm actually curious about the answer to this.
jer-irl: Not negative at all, thanks for commenting. You're right that the answer is "nothing," and that this is a major trade-off inherent in the model. From a safety perspective, you'd need to be extremely confident that all threadprocs are well-behaved, though a memory-safe language would help some. The benefit is that you get process-like composition as separate binaries launched at separate times, with thread-like single-address space IPC.After building this, I don't think this is necessarily a desirable trade-off, and decades of OS development certainly suggest process-isolated memory is desirable. I see this more as an experiment to see how bending those boundaries works in modern environments, rather than a practical way forward.
tombert: It's certainly an interesting idea and I'm not wholly opposed to it, though I certainly wouldn't use it as a default process scheduler for an OS (not that you were suggesting that). I would be very concerned about security stuff. If there's no process boundary saying that threadproc A can't grab threadproc B's memory, there could pretty easily be unintended leakages of potentially sensitive data.Still, I actually do think there could be an advantage to this if you know you can trust the executables, and if the executables don't share any memory; if you know that you're not sharing memory, and you're only grabbing memory with malloc and the like, then there is an argument to be made that there's no need for the extra process overhead.
fwsgonzo: Best I can do is reply to an e-mail if someone asks for the paper, since it's not out yet. The e-mail ends with hotmail.
philipwhiuk: This is basically 'De-ASLR' is it not?
jer-irl: Could you clarify what you mean by that? This does heavily rely on loaded code being position-independent, because the memory used will go into whatever regions `mmap(..., ~MAP_FIXED)` returns.
PaulDavisThe1st: Are you familiar with the Opal OS from UW CS&E during the 90s ? All tasks in a single system-wide address space, with h/w access control for memory regions.> The Opal project is exploring a new operating system structure, tuned to the needs of complex applications, such as CAD/CAM, where a number of cooperating programs manipulate a large shared persistent database of objects. In Opal, all code and data exists with in a single, huge, shared address space. The single address space enhances sharing and cooperation, because addresses have a unique (for all time) interpretation. Thus, pointer-based data structures can be directly communicated and shared between programs at any time, and can be stored directly on secondary storage without the need for translation. This structure is simplified by the availability of a large address space, such as those provided by the DEC Alpha, MIPS, HP/PA-RISC, IBM RS6000, and future Intel processors.> Protection in Opal is independent of the single address space; each Opal thread executes within a protection domain that defines which virtual pages it has the right to access. The rights to access a page can be easily transmitted from one process to another. The result is a much more flexible protection structure, permitting different (and dynamically changing) protection options depending on the trust relationship between cooperating parties. We believe that this organization can improve both the structure and performance of complex, cooperating applications.> An Opal prototype has been built for the DEC Alpha platform on top of the Mach operating system.https://homes.cs.washington.edu/~levy/opal/opal.html
kgeist: From what I remember, in the Linux kernel, there's already barely any distinction between processes and threads, a thread is just a process that shares virtual memory with another process, you specify if memory should be shared or not when calling clone()So we already have threads that do exactly what you're trying to do? Isn't it somewhat easier and less risky to just compile several programs into one binary? If you have no control over the programs you're trying to "fuse" (no source), then you probably don't want to fuse them, because it's very unsafe.Maybe I don't understand something. I think it can work if you want processes with different lib versions or even different languages, but it sounds somewhat risky to pass data just like that (possible data corruption)
otterley: The code uses Linux's clone3() syscall under the hood: see https://github.com/jer-irl/threadprocs/blob/main/docs/02-imp...
wswin: Cool project. I think most real life problems would be solved with shared memory objects, c.f. shm_open syscall [1]. Python has a wrapper on in the standard library [2], not sure about other languages1. https://www.man7.org/linux/man-pages/man3/shm_open.3.html2. https://docs.python.org/3/library/multiprocessing.shared_mem...
kjok: > I actually just published a paper...This gives me an impression that the paper has already been published and is available publicly for us to read.
fwsgonzo: Sorry about that, the conference was on Feb 2, and it's supposed to be out any day/week now. I don't have a date.
jer-irl: > I think it can work if you want processes with different lib versions or even different languagesThis is exactly right, unrelated binaries can coexist, or different versions of the same binary, etc.> it sounds somewhat risky to pass data just like thatThis is also right! I started building an application framework that could leverage this and provide some protections on memory use: https://github.com/jer-irl/tproc-actors , but the model is inherently tricky, especially with elaborate data structures where ABI compatibility can be so brittle
kgeist: I agree it's a cool hack. I was just trying to understand why plain threads weren't enough.An interesting evolution could be some kind of very cheap memory access instrumentation so that you could whitelist only specific memory ranges for cross-process access (say, ring buffers). Not sure if it's possible though without considerable overhead.