Andrew Cooper [Thu, 15 May 2025 18:01:33 +0000 (19:01 +0100)]
x86/emul: Fix extable registration in invoke_stub()
For exception recovery in the stubs, the registered address for fixup is the
return address of the CALL entering the stub.
In invoke_stub(), the '.Lret%=:' label is the wrong side of the 'post'
parameter. The 'post' parameter is non-empty in cases where the arithmetic
flags of the operation need recovering.
Split the line to separate 'pre' and 'post', making it more obvious that the
return address label was in the wrong position.
However, in the case that an exception did occur, we want to skip 'post' as
it's logically part of the operation which had already failed. Therefore, add
a new skip label and use that for the exception recovery path.
This is XSA-470 / CVE-2025-27465
Fixes: 79903e50dba9 ("x86emul: catch exceptions occurring in stubs") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
scripts/add_maintainers.pl: set double dashes for long options
Current script shows message:
Don't forget to add the subject and message to ...
Then perform:
git send-email -to xen-devel@lists.xenproject.org ...
which has wrong option '-to'.
It may be confused for user.
Set double dashes for longer options to avoid that.
Fixes: e1f912cbf717 ("scripts/add_maintainers.pl: New script") Signed-off-by: Dmytro Prokopchuk <dmytro_prokopchuk1@epam.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
Jan Beulich [Tue, 1 Jul 2025 09:24:47 +0000 (11:24 +0200)]
x86/EFI: restrict use of --dynamicbase
At least GNU ld 2.35 takes this option to (also) mean what newer
versions have controllable by --enable-reloc-section. From there being
no relocations in check.efi (as we don't pass the option there) we infer
that we need to involve mkreloc, we'd end up with two sets of
relocations, which clearly isn't going to work. Furthermore the
relocations ld emits in this case also aren't usable: For bsp_idt[] we
end up with PE_BASE_RELOC_LOW ones, which efi_arch_relocate_image()
(deliberately) doesn't know how to deal with. (Related to that is also
why we check the number of relocations produced: The linker simply
didn't get this right there, yet.)
We also can't add the option to what we use when linking check.efi: That
ld version then would produce relocations, but 4 of them (instead of the
expected two). That would make us pass --disable-reloc-section, which
however only ld 2.36 and newer understand.
For such older binutils versions we therefore need to accept the slight
inconsistency in DLL characteristics that the earlier commit meant to
eliminate.
Fixes: f2148773b8ac ("x86/EFI: sanitize DLL characteristics in binary") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 1 Jul 2025 09:23:59 +0000 (11:23 +0200)]
xen/build: pass -fzero-init-padding-bits=all to gcc15
See the respective bullet point in the Caveats section of
https://gcc.gnu.org/gcc-15/changes.html.
While I'm unaware of us currently relying on the pre-gcc15 behavior,
let's still play safe and retain what unknowingly we may have been
relying upon.
According to my observations, on x86 generated code changes
- somewhere deep in modify_bars(), presumably from the struct map_data
initializer in apply_map() (a new MOVQ),
- in vpci_process_pending(), apparently again from the struct map_data
initializer (and again a new MOVQ),
- near the top of find_cpio_data(), presumably from the struct cpio_data
initializer (a MOVW changing to a MOVQ).
Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Michal Orzel [Fri, 27 Jun 2025 07:06:04 +0000 (09:06 +0200)]
docs: cmdline: Update serial_tx_buffer default value
After commit 4df2e99d7314 ("console/serial: set the default transmit
buffer size in Kconfig"), the default value is set by Kconfig option
CONFIG_SERIAL_TX_BUFSIZE. Moreover it was bumped to 32KB by commit d09e44e5d8fd ("console/serial: bump buffer from 16K to 32K").
Signed-off-by: Michal Orzel <michal.orzel@amd.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Nicola Vetrini [Mon, 30 Jun 2025 08:06:54 +0000 (10:06 +0200)]
xen: fix unspecified behavior in tr invocation
The result of the command is undefined according to the specification if
the "string2" argument in tr is shorter than "string1". GNU tr behaves
correctly by extending "string2" to repeat the last character.
Fixes: eb61a4fb14d2 ("xen: fix header guard generation for asm-generic headers") Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Nicola Vetrini <nicola.vetrini@bugseng.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Mon, 30 Jun 2025 08:05:42 +0000 (10:05 +0200)]
tools/libxenguest: fix build in stubdom environment
With introduction of the new byteswap infrastructure the build of
libxenguest for stubdoms was broken. Fix that again.
Fixes: 60dcff871e34 ("xen/decompressors: Remove use of *_to_cpup() helpers") Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com> Acked-by: Anthony PERARD <anthony.perard@vates.tech>
Penny Zheng [Mon, 30 Jun 2025 08:04:20 +0000 (10:04 +0200)]
xen/sysctl: make CONFIG_COVERAGE depend on CONFIG_SYSCTL
Users rely on SYSCTL_coverage_op hypercall to interact with the coverage data,
that is, according operations shall be wrapped around with CONFIG_SYSCTL.
Right now, it is compiled under CONFIG_COVERAGE, so we shall make
CONFIG_COVERAGE depend on CONFIG_SYSCTL.
Penny Zheng [Mon, 30 Jun 2025 08:02:35 +0000 (10:02 +0200)]
xen/sysctl: wrap around XEN_SYSCTL_perfc_op
perfc_control() and perfc_copy_info() are responsible for providing control
of perf counters via XEN_SYSCTL_perfc_op in DOM0, so they both shall
be wrapped.
We introduce a new Kconfig CONFIG_SYSCTL, which shall only be disabled
on some dom0less systems or PV shim on x86, to reduce Xen footprint.
Making SYSCTL without prompt is transient and it will be adjusted in the final
patch. And the consequence of introducing "CONFIG_SYSCTL=y" in .config file
generated from pvshim_defconfig is transient too, which will also be adjusted
in the final patch.
Jan Beulich [Thu, 26 Jun 2025 12:59:05 +0000 (14:59 +0200)]
x86/boot: move l<N>_bootmap
Having them in the general .init.data section is somewhat wasteful, due
to involved padding. Move them into .init.data.page_aligned, and place
that right after .init.bss.stack_aligned.
Overall .init.data* shrinks by slightly over 2 pages in the build I'm
looking at.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Frediano Ziglio [Thu, 26 Jun 2025 12:58:10 +0000 (14:58 +0200)]
xen/efi: Handle cases where file didn't come from ESP
A boot loader can load files from outside ESP.
In these cases device could be not provided or path could
be something not supported.
In these cases allows to boot anyway, all information
could be provided using UKI or using other boot loader
features.
Signed-off-by: Frediano Ziglio <frediano.ziglio@cloud.com> Acked-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Mykola Kvach [Thu, 26 Jun 2025 12:57:22 +0000 (14:57 +0200)]
xen/char: wrap suspend/resume console callbacks with CONFIG_SYSTEM_SUSPEND
This patch wraps the suspend/resume console callbacks and related code within
CONFIG_SYSTEM_SUSPEND blocks. This ensures that these functions and their
calls are only included in the build when CONFIG_SYSTEM_SUSPEND is enabled.
This addresses Misra Rule 2.1 violations.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Mykola Kvach <mykola_kvach@epam.com>
Roger Pau Monne [Mon, 26 May 2025 17:32:21 +0000 (19:32 +0200)]
x86/pdx: simplify calculation of domain struct allocation boundary
When not using CONFIG_BIGMEM there are some restrictions in the address
width for allocations of the domain structure, as it's PDX truncated to 32
bits it's stashed into page_info structure for domain allocated pages.
The current logic to calculate this limit is based on the internals of the
PDX compression used, which is not strictly required. Instead simplify the
logic to rely on the existing PDX to PFN conversion helpers used elsewhere.
This has the added benefit of allowing alternative PDX compression
algorithms to be implemented without requiring to change the calculation of
the domain structure allocation boundary.
As a side effect introduce pdx_to_paddr() conversion macro and use it.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 25 Jun 2025 09:00:25 +0000 (10:00 +0100)]
x86/boot: Improve paging mode diagnostics in create_dom0()
I was presented with this:
(XEN) NX (Execute Disable) protection active
(XEN) d0 has maximum 416 PIRQs
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Error creating d0: -95
(XEN) ****************************************
which is less than helpful. It turns out to be the -EOPNOTSUPP from
shadow_domain_init().
The real bug here is create_dom0() unconditionally assuming the presence of
SHADOW_PAGING. Rework it to panic() rather than choosing a dom0_cfg which is
guaranteed to fail. This results in:
(XEN) NX (Execute Disable) protection active
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Neither HAP nor Shadow available for PVH domain
(XEN) ****************************************
which is rather more helpful.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 24 Jun 2025 14:20:52 +0000 (15:20 +0100)]
Revert part of "x86/mwait-idle: disable IBRS during long idle"
Most of the patch (handling of CPUIDLE_FLAG_IBRS) is fine, but the
adjustements to mwait_idle() are not; spec_ctrl_enter_idle() does more than
just alter MSR_SPEC_CTRL.IBRS.
The only reason this doesn't need an XSA is because the unconditional
spec_ctrl_{enter,exit}_idle() in mwait_idle_with_hints() were left unaltered,
and thus the MWAIT remained properly protected.
There (would have been) two problems. In the ibrs_disable (== deep C) case:
* On entry, VERW and RSB-stuffing are architecturally skipped.
* On exit, there's a branch crossing the WRMSR which reinstates the
speculative safety for indirect branches.
All this change did was double up the expensive operations in the deep C case,
and fail to optimise the intended case.
I have an idea of how to plumb this more nicely, but it requires larger
changes to legacy IBRS handling to not make spec_ctrl_enter_idle() vulnerable
in other ways. In the short term, simply take out the perf hit.
Fixes: 08acdf9a2615 ("x86/mwait-idle: disable IBRS during long idle") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 1 Apr 2025 14:55:29 +0000 (15:55 +0100)]
x86/idle: Remove MFENCEs for CLFLUSH_MONITOR
Commit 48d32458bcd4 ("x86, idle: add barriers to CLFLUSH workaround") was
inherited from Linux and added MFENCEs around the AAI65 errata fix.
The SDM now states:
Executions of the CLFLUSH instruction are ordered with respect to each
other and with respect to writes, locked read-modify-write instructions,
and fence instructions[1].
with footnote 1 reading:
Earlier versions of this manual specified that executions of the CLFLUSH
instruction were ordered only by the MFENCE instruction. All processors
implementing the CLFLUSH instruction also order it relative to the other
operations enumerated above.
I.e. the MFENCEs came about because of an incorrect statement in the SDM.
The Spec Update (no longer available on Intel's website) simply says "issue a
CLFLUSH", with no mention of MFENCEs.
As this erratum is specific to Intel, it's fine to remove the the MFENCEs; AMD
CPUs of a similar vintage do sport otherwise-unordered CLFLUSHs.
Move the feature bit into the BUG range (rather than FEATURE), and move the
workaround into monitor() itself.
The erratum check itself must use setup_force_cpu_cap(). It needs activating
if any CPU needs it, not if all of them need it.
Andrew Cooper [Mon, 23 Jun 2025 10:41:39 +0000 (11:41 +0100)]
x86/svm: Revert 1->true conversion in svm_asid_handle_vmrun()
This is literally ASID 1 (of 2^16), not a boolean.
Fixes: 2f09f797ba43 ("x86/svm: Drop the suffix _guest from vmcb bit") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Mykola Kvach [Tue, 24 Jun 2025 12:56:02 +0000 (14:56 +0200)]
xen/common: Guard freeze/thaw_domains functions with CONFIG_SYSTEM_SUSPEND
This patch adds CONFIG_SYSTEM_SUSPEND guards around freeze_domains
and thaw_domains functions.
This ensures they are only compiled into the hypervisor when the system
suspend functionality is enabled, aligning their inclusion with their
specific use case.
This addresses two Misra Rule 2.1 violations.
Signed-off-by: Mykola Kvach <mykola_kvach@epam.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Frediano Ziglio [Tue, 24 Jun 2025 12:55:29 +0000 (14:55 +0200)]
xen/efi: Show error message for EFI_INVALID_PARAMETER error
Show string message instead of code.
This happened trying some different ways to boot Xen, specifically
trying loading xen.efi using GRUB2 "linux" command.
Signed-off-by: Frediano Ziglio <frediano.ziglio@cloud.com> Acked-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
... in alignment with the new coding style on word splitting for type
names.
This aligns its name with the largely duplicate boot_module struct
in x86. While there's no equivalent to "struct bootmodules" in x86,
changing one and not the other is just confusing. Same with various
comments and function names.
Rather than making a long subfield name even longer, remove the
_bootmodule suffix in the kernel, initrd and dtb subfields.
Not a functional change.
Signed-off-by: Alejandro Vallejo <agarciav@amd.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Acked-By: Daniel P. Smith <dpsmith@apertussolutions.com>
CODING_STYLE: Custom type names must be snake-cased by word
There's the unwritten convention of splitting type names using
underscores. Add such convention to the CODINNG_STYLE to make it
common and less unwritten.
Frediano Ziglio [Mon, 23 Jun 2025 08:50:13 +0000 (10:50 +0200)]
xen/efi: Do not check kernel signature if it was embedded
Using UKI it's possible to embed Linux kernel into xen.efi file.
In this case the signature for Secure Boot is applied to the
whole xen.efi, including the kernel.
So checking for specific signature for the kernel is not
needed.
Signed-off-by: Frediano Ziglio <frediano.ziglio@cloud.com> Reviewed-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Jan Beulich [Mon, 23 Jun 2025 08:49:26 +0000 (10:49 +0200)]
x86/pmstat: restore changes lost by "consolidation"
Both c6e0a5539623 ("cpufreq: use existing local var in
cpufreq_statistic_init()") and a1ce987411f6 ("cpufreq: don't leave stale
statistics pointer") were lost in the course of "moving" the code,
presumably due to overly lax re-basing.
Fixes: bf0cd071db2a ("xen/pmstat: consolidate code into pmstat.c") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
The DT spec declares only two number types for a property: u32 and u64,
as per Table 2.3 in Section 2.2.4. Remove unbounded loop and replace
with a switch statement. Default to a size of 1 cell in the nonsensical
size case, with a warning printed on the Xen console.
Suggested-by: Daniel P. Smith" <dpsmith@apertussolutions.com> Signed-off-by: Alejandro Vallejo <agarciav@amd.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
arm/mpu: Enable read/write to protection regions for arm32
Define prepare_selector(), read_protection_region() and
write_protection_region() for arm32. Also, define
GENERATE_{READ/WRITE}_PR_REG_OTHERS to access MPU regions from 32 to 254.
Enable pr_{get/set}_{base/limit}(), region_is_valid() for arm32.
Enable pr_of_addr() for arm32.
The maximum number of regions supported is 255 (which corresponds to the
maximum value in HMPUIR).
arm/mpu: Move the functions to arm64 specific files
prepare_selector(), read_protection_region() and write_protection_region()
differ significantly between arm32 and arm64. Thus, move these functions
to their sub-arch specific folder.
Also the macro GENERATE_{WRITE/READ}_PR_REG_CASE are moved, in order to
keep them in the same file of their usage and improve readability.
Nicola Vetrini [Fri, 6 Jun 2025 21:27:09 +0000 (14:27 -0700)]
xen/x86: add missing noreturn attributes
The marked functions never return to their caller, but lack the
`noreturn' attribute.
Functions that never return should be declared with a `noreturn'
attribute.
The lack of `noreturn' causes a violation of MISRA C Rule 17.11 (not
currently accepted in Xen), and also Rule 2.1: "A project shall not
contain unreachable code". Depending on the compiler used and the
compiler optimization used, the lack of `noreturn' might lead to the
presence of unreachable code.
The usage of the noreturn attribute together with asmlinkage is only for
the benefit of the static analysis tools.
Nicola Vetrini [Fri, 6 Jun 2025 21:27:08 +0000 (14:27 -0700)]
xen/arm: add missing noreturn attributes
The marked functions never return to their caller, but lack the
`noreturn' attribute.
Functions that never return should be declared with a `noreturn'
attribute.
The lack of `noreturn' causes a violation of MISRA C Rule 17.11 (not
currently accepted in Xen), and also Rule 2.1: "A project shall not
contain unreachable code". Depending on the compiler used and the
compiler optimization used, the lack of `noreturn' might lead to the
presence of unreachable code.
Nicola Vetrini [Fri, 6 Jun 2025 21:27:07 +0000 (14:27 -0700)]
xen/keyhandler: add missing noreturn attribute
Function `reboot_machine' does not return, but lacks the `noreturn'
attribute.
Functions that never return should be declared with a `noreturn'
attribute.
The lack of `noreturn' causes a violation of MISRA C Rule 17.11 (not
currently accepted in Xen), and also Rule 2.1: "A project shall not
contain unreachable code". Depending on the compiler used and the
compiler optimization used, the lack of `noreturn' might lead to the
presence of unreachable code.
Roger Pau Monne [Mon, 26 May 2025 11:27:45 +0000 (13:27 +0200)]
x86/hvmloader: select xen platform pci MMIO BAR UC or WB MTRR cache attribute
The Xen platform PCI device (vendor ID 0x5853) exposed to x86 HVM guests
doesn't have the functionality of a traditional PCI device. The exposed
MMIO BAR is used by some guests (including Linux) as a safe place to map
foreign memory, including the grant table itself.
Traditionally BARs from devices have the uncacheable (UC) cache attribute
from the MTRR, to ensure correct functionality of such devices. hvmloader
mimics this behavior and sets the MTRR attributes of both the low and high
PCI MMIO windows (where BARs of PCI devices reside) as UC in MTRR.
This however causes performance issues for users of the Xen platform PCI
device BAR, as for the purposes of mapping remote memory there's no need to
use the UC attribute. On Intel systems this is worked around by using
iPAT, that allows the hypervisor to force the effective cache attribute of
a p2m entry regardless of the guest PAT value. AMD however doesn't have an
equivalent of iPAT, and guest PAT values are always considered.
Linux commit:
41925b105e34 xen: replace xen_remap() with memremap()
Attempted to mitigate this by forcing mappings of the grant-table to use
the write-back (WB) cache attribute. However Linux memremap() takes MTRRs
into account to calculate which PAT type to use, and seeing the MTRR cache
attribute for the region being UC the PAT also ends up as UC, regardless of
the caller having requested WB.
As a workaround to allow current Linux to map the grant-table as WB using
memremap() introduce an xl.cfg option (xen_platform_pci_bar_uc=0) that can
be used to select whether the Xen platform PCI device BAR will have the UC
attribute in MTRR. Such workaround in hvmloader should also be paired with
a fix for Linux so it attempts to change the MTRR of the Xen platform PCI
device BAR to WB by itself.
Overall, the long term solution would be to provide the guest with a safe
range in the guest physical address space where mappings to foreign pages
can be created.
Some vif throughput performance figures provided by Anthoine from a 8
vCPUs, 4GB of RAM HVM guest(s) running on AMD hardware:
Without this patch:
vm -> dom0: 1.1Gb/s
vm -> vm: 5.0Gb/s
With the patch:
vm -> dom0: 4.5Gb/s
vm -> vm: 7.0Gb/s
Reported-by: Anthoine Bourgeois <anthoine.bourgeois@vates.tech> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Oleksii Kurochko<oleksii.kurochko@gmail.com> Acked-by: Jan Beulich <jbeulich@suse.com> # hvmloader Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
Jan Beulich [Wed, 18 Jun 2025 07:25:51 +0000 (09:25 +0200)]
x86/HVM: restrict use of pinned cache attributes as well as associated flushing
We don't permit use of uncachable memory types elsewhere unless a domain
meets certain criteria. Enforce this also during registration of pinned
cache attribute ranges.
Furthermore restrict cache flushing to just
- registration of uncachable ranges,
- de-registration of cachable ranges.
While there, also (mainly by calling memory_type_changed())
- take CPU self-snoop as well as IOMMU snoop into account (albeit the
latter still is a global property rather than a per-domain one),
- avoid flushes when the domain isn't running yet (which ought to be the
common case).
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
min(pmpt->perf.state_count, op->u.getpx.total) == op->u.getpx.total can
be expressed differently as pmpt->perf.state_count >= op->u.getpx.total.
Copying when the two are equal is fine; (partial) copying when the state
count is larger than the number of array elements that a buffer was
allocated to hold is what - as per the comment - we mean to avoid. Drop
the use of min() again, but retain its effect for the subsequent copying
from pxpt->u.pt.
Fixes: aa70996a6896 ("x86/pmstat: Check size of PMSTAT_get_pxstat buffers") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Penny Zheng [Wed, 18 Jun 2025 07:24:24 +0000 (09:24 +0200)]
xen/cpufreq: normalize hwp driver check with hwp_active()
Instead of using hypercall passing parameter to identify hwp driver,
we shall use hwp_active(). Also, we've already used hwp_active() in
do_get_pm_info() in the same file to do hwp driver check, it's
better syncing with same way.
Signed-off-by: Penny Zheng <Penny.Zheng@amd.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Penny Zheng [Wed, 18 Jun 2025 07:24:00 +0000 (09:24 +0200)]
x86/AMD: Expand core frequency calculation for family 1Ah CPUs
AMD Family 1Ah CPU needs a different COF(Core Operating Frequency) formula,
due to a change in the PStateDef MSR layout in AMD Family 1Ah.
In AMD Family 1Ah, Core current operating frequency in MHz is calculated as
follows:
CoreCOF = Core::X86::Msr::PStateDef[CpuFid[11:0]] * 5MHz
We introduce a helper amd_parse_freq() to parse COF(Core Operating Frequency)
from PstateDef register, to replace the original macro FREQ(v).
amd_parse_freq() is declared as const, as it mainly consists of mathematical
conputation.
Signed-off-by: Penny Zheng <Penny.Zheng@amd.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Penny Zheng [Wed, 18 Jun 2025 07:23:33 +0000 (09:23 +0200)]
xen/cpufreq: move "init" flag into common structure
AMD cpufreq cores will be intialized in two modes, legacy P-state mode,
and CPPC mode. So "init" flag shall be extracted from px-specific
"struct xen_processor_perf", and placed in the common
"struct processor_pminfo". Otherwise, later when introducing a new
sub-hypercall to propagate CPPC data, we need to pass irrelevant px-specific
"struct xen_processor_perf" to just set init flag.
Signed-off-by: Penny Zheng <Penny.Zheng@amd.com> Acked-by: Jan Beulich <jbeulich@suse.com>
xen/arm: add support for R-Car Gen4 PCI host controller
Add support for Renesas R-Car Gen4 PCI host controller, specifically
targeting the S4 and V4H SoCs. The implementation includes configuration
read/write operations for both root and child buses. For accessing the
child bus, iATU is used for address translation.
The host controller needs to be initialized by Dom0 first to be properly
handled by Xen. Xen itself only handles the runtime configuration of
the iATU for accessing different child devices.
iATU programming is done similarly to Linux, where only window 0 is used
for dynamic configuration, and it is reconfigured for every config space
read/write.
Code common to all DesignWare PCI host controllers is located in a
separate file to allow for easy reuse in other DesignWare-based PCI
host controllers.
PCI host bridges often have different ways to access the root and child
bus configuration spaces. One of the examples is Designware's host bridge
and its multiple clones [1].
Linux kernel implements this by instantiating a child bus when device
drivers provide not only the usual pci_ops to access ECAM space (this is
the case for the generic host bridge), but also means to access the child
bus which has a dedicated configuration space and own implementation for
accessing the bus, e.g. child_ops.
For Xen it is not feasible to fully implement PCI bus infrastructure as
Linux kernel does, but still child bus can be supported.
Add support for the PCI child bus which includes the following changes:
- introduce bus mapping functions depending on SBDF
- assign bus start and end for the child bus and re-configure the same for
the parent (root) bus
- make pci_find_host_bridge be aware of multiple busses behind the same bridge
- update pci_host_bridge_mappings, so it also doesn't map to guest the memory
spaces belonging to the child bus
- make pci_host_common_probe accept one more pci_ops structure for the child bus
- install MMIO handlers for the child bus for hardware domain
- re-work vpci_mmio_{write|read} with parent and child approach in mind
xen/arm: make pci_host_common_probe return the bridge
Some of the PCI host bridges require additional processing during the
probe phase. For that they need to access struct bridge of the probed
host, so return pointer to the new bridge from pci_host_common_probe.
xen/arm: exclude xen,reg from direct-map domU extended regions
Similarly to fba1b0974dd8, when a device is passed through to a
direct-map dom0less domU, the xen,reg ranges may overlap with the
extended regions. Remove xen,reg from direct-map domU extended regions.
Take the opportunity to update the comment ahead of find_memory_holes().
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> Reviewed-by: Michal Orzel <michal.orzel@amd.com>
Victor Lira [Tue, 17 Jun 2025 16:44:49 +0000 (09:44 -0700)]
automation: disable terminal echo in xilinx test scripts
The default terminal settings in Linux will enable echo which interferes with
these tests. Set the value in the script to avoid failure caused by a settings
reset.
Signed-off-by: Victor Lira <victorm.lira@amd.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Michal Orzel [Tue, 17 Jun 2025 07:19:40 +0000 (09:19 +0200)]
console: Do not duplicate early printk messages on conring flush
Commit f6d1bfa16052 introduced flushing conring in console_init_preirq().
However, when CONFIG_EARLY_PRINTK is enabled, the early boot messages
had already been sent to serial before main console initialization. This
results in all the early boot messages being duplicated.
Change conring_flush() to accept argument listing devices to which to
flush conring. We don't want to send to serial at console initialization
when using early printk, but we want these messages to be send at conring
dump triggered by keyhandler.
Michal Orzel [Mon, 16 Jun 2025 06:56:48 +0000 (08:56 +0200)]
xen/arm: Fix P2M root page tables invalidation
Fix the condition part of the for loop in p2m_invalidate_root() that
uses P2M_ROOT_LEVEL instead of P2M_ROOT_PAGES. The goal here is to
invalidate all root page tables (that can be concatenated), so the loop
must iterate through all these pages. Root level can be 0 or 1, whereas
there can be 1,2,8,16 root pages. The issue may lead to some pages
not being invalidated and therefore the guest access won't be trapped.
We use it to track pages accessed by guest for set/way emulation provided
no IOMMU, IOMMU not enabled for the domain or P2M not shared with IOMMU.
Fixes: 2148a125b73b ("xen/arm: Track page accessed between batch of Set/Way operations") Reported-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> Signed-off-by: Michal Orzel <michal.orzel@amd.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
arm/mpu: Provide and populate MPU C data structures
Modify Arm32 assembly boot code to reset any unused MPU region, initialise
'max_mpu_regions' with the number of supported MPU regions and set/clear the
bitmap 'xen_mpumap_mask' used to track the enabled regions.
Introduce cache.S to hold arm32 cache related functions.
Use the macro definition for "dcache_line_size" from linux.
Change the order of registers in prepare_xen_region() as 'strd' instruction
is used to store {prbar, prlar} in arm32. Thus, 'prbar' has to be a even
numbered register and 'prlar' is the consecutively ordered register.
arm/mpu: Introduce MPU memory region map structure
Introduce pr_t typedef which is a structure having the prbar and prlar members,
each being structured as the registers of the AArch32 Armv8-R architecture.
Also, define MPU_REGION_RES0 to 0 as there are no reserved 0 bits beyond the
BASE or LIMIT bitfields in prbar or prlar respectively.
In pr_of_addr(), enclose prbar and prlar arm64 specific bitfields with
appropriate macros. So, that this function can be later reused for arm32 as
well.
Oleksii Kurochko [Mon, 16 Jun 2025 08:15:41 +0000 (10:15 +0200)]
xen/riscv: introduce register_intc_ops() and intc_hw_ops
Introduce the intc_hw_operations structure to encapsulate interrupt
controller-specific data and operations. This structure includes:
- A pointer to interrupt controller information (`intc_info`)
- Callbacks to initialize the controller and set IRQ type/priority
- A reference to an interupt controller descriptor (`host_irq_type`)
- number of interrupt controller irqs.
Add function register_intc_ops() to mentioned above structure.
Oleksii Kurochko [Mon, 16 Jun 2025 08:15:21 +0000 (10:15 +0200)]
xen/riscv: dt_processor_hartid() implementation
Implements dt_processor_hartid() to get the hart ID of the given
device tree node and do some checks if CPU is available and given device
tree node has proper riscv,isa property.
As a helper function dt_get_hartid() is introduced to deal specifically
with reg propery of a CPU device node.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Oleksii Kurochko [Mon, 16 Jun 2025 08:14:59 +0000 (10:14 +0200)]
xen/riscv: rework asm/mm.h and asm/page.h includes to match other architectures
To align with other architectures where <asm/page.h> is included from <asm/mm.h>
(and not the other way around), the following changes are made:
- Since <asm/mm.h> is no longer included in <asm/page.h>:
- Move the definitions of paddr_to_pte() and pte_to_paddr() to <asm/mm.h>,
as paddr_to_pfn() and pte_to_paddr() are already defined there.
- Move _vmap_to_mfn() to <asm/mm.h> because mfn_from_pte() is defined there and
open-code it inside macros vmap_to_mfn().
- Drop the inclusion of <xen/domain_page.h> from <asm/page.h> to resolve a compilation error:
./include/xen/domain_page.h:63:12: error: implicit declaration of function '__mfn_to_virt'; did you mean 'mfn_to_nid'? [-Werror=implicit-function-declaration]
63 | return __mfn_to_virt(mfn_x(mfn));
This happens because __mfn_to_virt() is defined in <asm/mm.h>, but due to
the current include chain:
<xen/domain.h>
<asm/domain.h>
<xen/mm.h>
<asm/mm.h>
<asm/page.h>
<xen/domain_page.h>
static inline void *map_domain_page_global(mfn_t mfn)
{
return __mfn_to_virt(mfn_x(mfn));
}
...
...
#define __mfn_to_virt() ...
This leads to a circular dependency and the build error above.
As a result, since <xen/domain_page.h> is no longer included in
<asm/page.h>, the flush_page_to_ram() definition cannot remain there.
It is now moved to riscv/mm.c.
Including <asm/page.h> from <asm/mm.h> does not cause issues with the
declaration/definition of clear_page() when <xen/mm.h> is included, and
also prevents build errors such as:
common/domain.c: In function 'alloc_domain_struct':
common/domain.c:797:5: error: implicit declaration of function 'clear_page';did you mean 'steal_page'? [-Werror=implicit-function-declaration]
797 | clear_page(d);
| ^~~~~~~~~~
| steal_page
caused by using clear_page() in common/domain.c.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86: Preinitialise all modules to be of kind BOOTMOD_UNKNOWN
A later patch removes boot_module and replaces its uses with bootmodule.
The equivalent field for "type" doesn't have BOOTMOD_UNKNOWN as a zero
value, so it must be explicitly set in the static xen_boot_info.
arm/gnttab: Remove xen/grant_table.h cyclic include
The way they currently include each other, with one of the includes
being conditional on CONFIG_GRANT_TABLE, makes it hard to know which
contents are included when.
Break the cycle by removing the asm/grant_table.h include.
Andrew Cooper [Wed, 4 Jun 2025 12:56:13 +0000 (13:56 +0100)]
x86/hvm: Process pending softirqs while dumping VMC[SB]s
24 guests with 8 vcpus each is sufficient to hit a 5 second watchdog.
Drop a piece of trailing whitespace while here.
Reported-by: Aidan Allen <aidan.allen1@cloud.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Aidan Allen <aidan.allen1@cloud.com>
Andrew Cooper [Tue, 3 Jun 2025 23:33:46 +0000 (00:33 +0100)]
x86/boot: Fix domain_cmdline_size()
The early exit from domain_cmdline_size() is buggy. Even if there's no
bootloader cmdline and no kextra, there still might be Xen parameters to
forward, and therefore a nonzero cmdline length.
Explain what the function is doing, and rewrite it to be both more legible and
more extendible.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 12 Jun 2025 12:46:23 +0000 (14:46 +0200)]
x86: FLUSH_CACHE -> FLUSH_CACHE_EVICT
This is to make the difference to FLUSH_CACHE_WRITEBACK more explicit.
Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
In file included from ./include/xen/pci.h:72,
from drivers/pci/pci.c:8:
./arch/arm/include/asm/pci.h:131:50: error: ‘struct rangeset’ declared inside parameter list will not be visible outside of this definition or declaration [-Werror]
131 | static inline int pci_sanitize_bar_memory(struct rangeset *r)
| ^~~~~~~~
cc1: all warnings being treated as errors
Fixes: 4acab25a9300 ("x86/vpci: fix handling of BAR overlaps with non-hole regions") Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> Reviewed-by: Michal Orzel <michal.orzel@amd.com>
Mykyta Poturai [Wed, 11 Jun 2025 09:09:00 +0000 (11:09 +0200)]
xen: Introduce system suspend config option
This option enables the system suspend support. This is the mechanism that
allows the system to be suspended to RAM and later resumed.
The patch introduces three options:
- HAS_SYSTEM_SUSPEND: indicates suspend support is available on the platform.
- SYSTEM_SUSPEND_ALWAYS_ON: used for architectures where suspend must always
be enabled.
- SYSTEM_SUSPEND: user-facing option to enable/disable suspend if supported.
Defaults to enabled if SYSTEM_SUSPEND_ALWAYS_ON is set and depends on
HAS_SYSTEM_SUSPEND.
On x86, both HAS_SYSTEM_SUSPEND and SYSTEM_SUSPEND_ALWAYS_ON are selected by
default, making suspend support always enabled. The options are designed to
be easily extensible to other architectures (e.g., PPC, RISC-V) as future
support is added.
Gang Ji [Wed, 11 Jun 2025 09:08:11 +0000 (11:08 +0200)]
xenalyze: Add 2 missed VCPUOPs in vcpu_op_str
The 2 missed ones are: register_runstate_phys_area and
register_vcpu_time_phys_area.
Fixes: d5df44275e7a ("domain: introduce GADDR based runstate area registration alternative") Fixes: 60e544a8c58f ("x86: introduce GADDR based secondary time area registration alternative") Signed-off-by: Gang Ji <gang.ji@cloud.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Penny Zheng [Wed, 11 Jun 2025 09:07:53 +0000 (11:07 +0200)]
xen: make avail_domheap_pages() inlined into get_outstanding_claims()
Function avail_domheap_pages() is only invoked by get_outstanding_claims(),
so it could be inlined into get_outstanding_claims().
Move up avail_heap_pages() to avoid declaration before
get_outstanding_claims().
Penny Zheng [Wed, 11 Jun 2025 09:07:33 +0000 (11:07 +0200)]
xen/pmstat: consolidate code into pmstat.c
We move the following functions into drivers/acpi/pmstat.c, as they
are all designed for performance statistic:
- cpufreq_residency_update()
- cpufreq_statistic_reset()
- cpufreq_statistic_update()
- cpufreq_statistic_init()
- cpufreq_statistic_exit()
Consequently, variable "cpufreq_statistic_data" and "cpufreq_statistic_lock"
shall become static.
We also move out acpi_set_pdc_bits(), as it is the handler for sub-hypercall
XEN_PM_PDC, and shall stay with the other handlers together in
drivers/cpufreq/cpufreq.c.
Various style corrections shall be applied at the same time while moving these
functions, including:
- brace for if() and for() shall live at a seperate line
- add extra space before and after bracket of if() and for()
- use array notation
- convert uint32_t into unsigned int
- convert u32 into uint32_t
Ross Lagerwall [Wed, 11 Jun 2025 09:07:00 +0000 (11:07 +0200)]
libxc/PM: Retry get_pxstat if data is incomplete
If the total returned by Xen is more than the number of elements
allocated, it means that the buffer was too small and so the data is
incomplete. Retry to get all the data.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
Ross Lagerwall [Wed, 11 Jun 2025 09:06:45 +0000 (11:06 +0200)]
libxc/PM: Ensure pxstat buffers are correctly sized
xc_pm_get_pxstat() requires the caller to allocate the pt and trans_pt
buffers but then calls xc_pm_get_max_px() to determine how big they are
(and hence how much Xen will copy into them). This is susceptible to
races if xc_pm_get_max_px() changes so avoid the problem by requiring
the caller to also pass in the size of the buffers.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
Ross Lagerwall [Wed, 11 Jun 2025 09:06:24 +0000 (11:06 +0200)]
cpufreq: Avoid potential buffer overrun and leak
If set_px_pminfo is called a second time with a larger state_count than
the first call, calls to PMSTAT_get_pxstat will read beyond the end of
the pt and trans_pt buffers allocated in cpufreq_statistic_init() since
they would have been allocated with the original state_count.
Secondly, the states array leaks on each subsequent call of
set_px_pminfo.
Fix both these issues by ignoring subsequent calls to set_px_pminfo if
it completed successfully previously. Return success rather than an
error to avoid errors in the dom0 kernel log when reloading the
xen_acpi_processor module.
At the same time, fix a leak of the states array on error.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Ross Lagerwall [Wed, 11 Jun 2025 09:05:42 +0000 (11:05 +0200)]
x86/pmstat: Check size of PMSTAT_get_pxstat buffers
Check that the total number of states passed in and hence the size of
buffers is sufficient to avoid writing more than the caller has
allocated.
The interface is not explicit about whether getpx.total is expected to
be set by the caller in this case but since it is always set in
libxenctrl it seems reasonable to check it and make it explicit.
Fixes: c06a7db0c547 ("X86 and IA64: Update cpufreq statistic logic for supporting both x86 and ia64") Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jiqian Chen [Wed, 11 Jun 2025 09:05:03 +0000 (11:05 +0200)]
vpci/header: Emulate legacy capability list for dom0
Current logic of emulating legacy capability list is only for domU.
So, expand it to emulate for dom0 too. Then it will be easy to hide
a capability whose initialization fails in a function.
And restrict adding PCI_STATUS register only for domU since dom0
has no limitation to access that register.
Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Michal Orzel [Wed, 4 Jun 2025 07:21:28 +0000 (09:21 +0200)]
tests/vpci: Use $(CC) instead of $(HOSTCC)
Depending on the build environment, HOSTCC can be different than CC. With
the recent `install` rule addition, this would put a binary of a wrong
format in the destdir (e.g. building tests on x86 host for Arm target).
Take the opportunity to adjust the `run` rule to only run the test if
HOSTCC is CC, else print a warning message.
Fixes: 96a587a05736 ("tools/tests: Add install target for vPCI") Signed-off-by: Michal Orzel <michal.orzel@amd.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>