Understanding Swap Use
(aka "swap use for mere mortals, not kernel programmers")
A frequent set of questions asked - "Why doesn't my swapfile reduce in size? I've looked in /proc/<PID>/smaps with various scripts/tools and they all report 0 use? Why is there still swap allocated?"
Lots of arms and legs to these questions; first, understand that it is more appropriate to use sar -B (paging statistics) and sar -r (memory utilization) to look at the Major faults/sec (loading a memory page from disk/swap) to find out if it's actively being used or not (as well as /proc/meminfo, see below). Allocation is a red-herring, active use is what counts.
When we're entering a swap situation, the kernel first uses an algorithm to determine what to page out. Remember that you can't actually work with the memory page out of the swapfile, it has to be paged back into RAM for operations - but it is not deleted from the swapfile at that page-in time as this is expensive for no gain, especially if you have to page back out the exact same data later. Thrashing occurs when you're paging in and out rapidly because the active processes need to operate on ALL those pages and can't throw any of them away. Thrash too much and our friend the OOM Killer kicks in and starts killing things.
When the swap page is paged back in, it's possible that there exists a copy in RAM and a copy in swap at this point; if the PTE structure changes in RAM (the process does something and alters it's data), then the swap space blocks are deleted (called a delete on write operation) as the RAM copy is now unique and new, the swap data is truly old and invalid. But, in the case that the PTE hasn't changed and it needs to swap out again then it's already in swap thereby reducing I/O - score! This is what the kernel is counting on - it will not have to physically page out unchanged data.
If you examine the /proc/meminfo for SwapCached that will show you how much of swap is actually in this state, especially when you can't find it using common tools. So why can't you find it? Because this page has to be swapped back into RAM in order for the kernel to determine who is truly left using it, and that's too expensive (and not useful). The best writeup on this complicated procedure is via the kernel's swapoff system call doc. By running swapoff on a swap partition/file you're performing this "page in the data. used? nope, delete." procedure over and over on every page in the swap file. This is why it's safer to add another swapfile (of same or larger size) temporarily when running swapoff, just in case you truly don't have enough RAM to hold what's in swap and accidentally trigger the OOM Killer.
This should help explain why until the process that allocated that PTE to swap quits/exits (thereby freeing all memory pointers/reference counts) it's still used per se; the kernel at one point swapped it to disk, it has since been swapped back into RAM (which is now what the process sees) but an identical copy still exists in swap. Upon quit, the in-RAM page is freed which is a change to the PTE, which triggers an invalidation of the page in swap so the delete on write is triggered (here I guess you could say technically it's not a write) and bye-bye goes that allocated swap space. UNLESS it was a shared memory structure and someone else still has a reference/map to it (another can of worms).
This is what I call "opportunistic swapping", the kernel has paged out once and is banking on the possibility it will happen again -- if the page is already in swap, no need to incur I/O to write it again. Only the kernel sees this mapping (RAM PTEs to swap PTEs), not the actual processes themselves - reduce unnecessary disk I/O as much as possible while remaining abstract to the user level applications.