Picture-in-Picture support on Wayland

>[!warning]- Info > Author: [email protected] > Status: WIP > Published date: 2025/04/09 ## Protocol proposals Main proposal: https://gitlab.freedesktop.org/wayland/wayland-protocols/-/merge_requests/132 There has been (long) discussions and opinions seem divided between 2 different lines of thoughts. Quoting a [@jadahl's comment](https://gitlab.freedesktop.org/wayland/wayland-protocols/-/merge_requests/132#note_2050883) in the above linked MR: > Generally there are two opinions of what xdg_pip should be: a output-only window for specific use cases, where interaction is handled by the compositor, or roughly a toplevel that should have special window management behavior. What is in the current XML in this MR is the latter, meaning it contains most of the interaction related requests as xdg_toplevel; if it'd be the former, it'd wouldn't contain things such as 'resize' or 'move', etc. - Main concerns are about abuse of such functionality and potential security implications. - Thus challenge is how to mitigate them while providing the functionalities required by most desktop applications, including advanced clients, such as, web browsers. - Focus here is to enumerate and double-check web browser/platform requirements and security mitigations to be able to provide helpful input for the protocol discussion. ## Web Specs ### W3C's (video) Picture-in-Picture - Explainer: https://github.com/w3c/picture-in-picture/blob/main/explainer.md - Spec: https://www.w3.org/TR/picture-in-picture/ - Status: - Spec: [W3C Working draft](https://www.w3.org/standards/types/#x4-1-working-draft) - Impl: [Suported everywhere except on Firefox](https://developer.mozilla.org/en-US/docs/Web/API/Picture-in-Picture_API#browser_compatibility) - Video-only. - Input limited to a predefined set of actions, handled by the platform (not the application). ### Document Picture-in-Picture - Explainer: https://github.com/steimelchrome/document-pip-explainer/blob/main/explainer.md - Spec: https://wicg.github.io/document-picture-in-picture/) - Status: - Spec: [Draft Community Group Report](https://www.w3.org/standards/types/#x2-1-w3c-community-group-report-or-w3c-business-group-report) - Impl: [Chomium-only and Desktop-only](https://developer.mozilla.org/en-US/docs/Web/API/Document_Picture-in-Picture_API) - Chromium design doc: https://docs.google.com/document/d/1zZwiNkLn24SvTMmnXj6AgGK88jZhkaxexUiZr-hOfsU/edit?usp=sharing - Proposal to expand functionality such that PiP windows could render arbitrary content. #### Support and discussions - Chromium: Available as origin trial since Chrome 111 on Desktop - WebKit: [Position and discussions][dpip-webkit-position] - Implement-ability on iOS is questioned at [here][dpip-impl-concern-ios] - Portability concerns were also mentioned in [Chromium Intent to Ship thread][dpip-intent-to-ship], including questions on how important the feature is for mobile in general or if it's a desktop-only API, etc. Below the reply from the spec editor: > In the future, we may be able to do something on Android that doesn't support input without any change to Android APIs. If we wanted input in the PiP window, it would require changes to Android itself. That said, we haven't seen any interest from web developers for an inputless document PiP on Android, and there isn't really anything you can do with an inputless document PiP that you can't do with a canvas-back video PiP, so we haven't pursued anything on Android. - Conclusion: API is currently unfeasible on major mobile platforms, namely Android and iOS, in which case the only alternative available for app developers (both native and web) is the video-only input-less API[^1]. [^1]: Video-only input-less platform APIs provide limited list of actions which have a corresponding visual representation (icon) and are handled by the platform, both rendering, compositing and input. #### Security - Both in the Chromium's [Intent to Implement mail thread][dpip-intent-to-impl] and in its [Design document][dpip-design-doc-sec], security concerns and possible mitigations are mentioned. Eg: _From the intent-to-implement mail thread:_ > We've been chatting already, and have enumerated several risks, some of which are already present in the existing video PIP API: > - If the website can entice the user to drag the PIP window over browser UI, the PIP content could spoof browser UI, like fake a URL in the omnibox (covering the real omnibox). This is partially mitigated by having a drop shadow border around the PIP window. > - Allowing arbitrary user interaction with the PIP window could definitely be a spoofing/phishing risk, but it would probably be feasible to limit interactions to clicks to reduce the risk. > - There is also a reverse-clickjacking risk: the PIP window could float over a cross-origin window and entice the user to click, and then disappear at the right moment so that the click passes through to the page underneath. We are not sure how to mitigate this. It would be difficult to pull off, like the first attack in this list (spoofing browser UI), because the attacker would have to get the user to drag the window into the right place, and the window doesn't know its position (just its size). > - There is a general risk of confusion/annoyance about where the content in the PIP window comes from. There should be a way to show the origin and window controls (like to close the window) in some ideally non-spoofable, always-accessible way. _From de design doc:_ > - For interactive Picture-in-Picture there are concerns around impersonating system UI. Therefore, we will ensure the UX of the Picture-in-Picture window is distinct enough by adding a border (and maybe an indicator of the origin). > - We will disable trusted UI (e.g. permission prompts, autofill) and also remove regular keyboard events to reduce the attack surface of the Picture-in-Picture window. >[!todo] Confirm with spec editor / devs, which ones are in-place: > Filed an issue on their github to confirm them > https://github.com/WICG/document-picture-in-picture/issues/136 > - [ ] User gestures required to open a pip window > - [ ] No keyboard events? > - [ ] Only mouse events? (what about touch?) > - [ ] Not resizable? Size constraints? > - Enforce at Wayland compositor side? > - [ ] Decorations enforced by the browser? > - From Wayland perspective, how to enforce this? Server side decorations? eg: drop shadows. [dpip-webkit-position]: https://github.com/WebKit/standards-positions/issues/41 [dpip-impl-concern-ios]: https://github.com/WebKit/standards-positions/issues/41#issuecomment-1257253160 [dpip-intent-to-ship]: https://groups.google.com/a/chromium.org/g/blink-dev/c/JTPl7fM64Lc [dpip-intent-to-impl]: https://groups.google.com/a/chromium.org/g/blink-dev/c/uK0hyACy_fg/m/JVXGUVylAAAJ [dpip-design-doc-sec]: https://docs.google.com/document/d/1zZwiNkLn24SvTMmnXj6AgGK88jZhkaxexUiZr-hOfsU/edit?tab=t.0#heading=h.k389oryrnj5o ### Relevant links - Document types (and status): https://www.w3.org/standards/types - Document Picture-in-Picture feature on chromestatus: https://chromestatus.com/feature/5755179560337408 - Document Picture-in-Picture intent to implement thread: https://groups.google.com/a/chromium.org/g/blink-dev/c/uK0hyACy_fg/m/JVXGUVylAAAJ - Chromium's design doc for Document Picture-in-Picture: https://docs.google.com/document/d/1zZwiNkLn24SvTMmnXj6AgGK88jZhkaxexUiZr-hOfsU/edit?usp=sharing - Picture-in-Picture feature on chromestatus: https://chromestatus.com/feature/5729206566649856