1. 04 Oct, 2019 2 commits
    • Eric Auger's avatar
      memory: allow memory_region_register_iommu_notifier() to fail · 549d4005
      Eric Auger authored
      Currently, when a notifier is attempted to be registered and its
      flags are not supported (especially the MAP one) by the IOMMU MR,
      we generally abruptly exit in the IOMMU code. The failure could be
      handled more nicely in the caller and especially in the VFIO code.
      
      So let's allow memory_region_register_iommu_notifier() to fail as
      well as notify_flag_changed() callback.
      
      All sites implementing the callback are updated. This patch does
      not yet remove the exit(1) in the amd_iommu code.
      
      in SMMUv3 we turn the warning message into an error message saying
      that the assigned device would not work properly.
      Signed-off-by: default avatarEric Auger <eric.auger@redhat.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      549d4005
    • Eric Auger's avatar
      vfio: Turn the container error into an Error handle · d7d87836
      Eric Auger authored
      The container error integer field is currently used to store
      the first error potentially encountered during any
      vfio_listener_region_add() call. However this fails to propagate
      detailed error messages up to the vfio_connect_container caller.
      Instead of using an integer, let's use an Error handle.
      
      Messages are slightly reworded to accomodate the propagation.
      Signed-off-by: default avatarEric Auger <eric.auger@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d7d87836
  2. 16 Aug, 2019 2 commits
  3. 13 Jun, 2019 1 commit
  4. 12 Mar, 2019 1 commit
  5. 22 Feb, 2019 2 commits
  6. 11 Jan, 2019 2 commits
  7. 23 Aug, 2018 1 commit
  8. 21 Aug, 2018 1 commit
    • Alexey Kardashevskiy's avatar
      vfio/spapr: Allow backing bigger guest IOMMU pages with smaller physical pages · c26bc185
      Alexey Kardashevskiy authored
      At the moment the PPC64/pseries guest only supports 4K/64K/16M IOMMU
      pages and POWER8 CPU supports the exact same set of page size so
      so far things worked fine.
      
      However POWER9 supports different set of sizes - 4K/64K/2M/1G and
      the last two - 2M and 1G - are not even allowed in the paravirt interface
      (RTAS DDW) so we always end up using 64K IOMMU pages, although we could
      back guest's 16MB IOMMU pages with 2MB pages on the host.
      
      This stores the supported host IOMMU page sizes in VFIOContainer and uses
      this later when creating a new DMA window. This uses the system page size
      (64k normally, 2M/16M/1G if hugepages used) as the upper limit of
      the IOMMU pagesize.
      
      This changes the type of @pagesize to uint64_t as this is what
      memory_region_iommu_get_min_page_size() returns and clz64() takes.
      
      There should be no behavioral changes on platforms other than pseries.
      The guest will keep using the IOMMU page size selected by the PHB pagesize
      property as this only changes the underlying hardware TCE table
      granularity.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      c26bc185
  9. 17 Aug, 2018 2 commits
    • Alex Williamson's avatar
      vfio/ccw/pci: Allow devices to opt-in for ballooning · 238e9172
      Alex Williamson authored
      If a vfio assigned device makes use of a physical IOMMU, then memory
      ballooning is necessarily inhibited due to the page pinning, lack of
      page level granularity at the IOMMU, and sufficient notifiers to both
      remove the page on balloon inflation and add it back on deflation.
      However, not all devices are backed by a physical IOMMU.  In the case
      of mediated devices, if a vendor driver is well synchronized with the
      guest driver, such that only pages actively used by the guest driver
      are pinned by the host mdev vendor driver, then there should be no
      overlap between pages available for the balloon driver and pages
      actively in use by the device.  Under these conditions, ballooning
      should be safe.
      
      vfio-ccw devices are always mediated devices and always operate under
      the constraints above.  Therefore we can consider all vfio-ccw devices
      as balloon compatible.
      
      The situation is far from straightforward with vfio-pci.  These
      devices can be physical devices with physical IOMMU backing or
      mediated devices where it is unknown whether a physical IOMMU is in
      use or whether the vendor driver is well synchronized to the working
      set of the guest driver.  The safest approach is therefore to assume
      all vfio-pci devices are incompatible with ballooning, but allow user
      opt-in should they have further insight into mediated devices.
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      238e9172
    • Alex Williamson's avatar
      vfio: Inhibit ballooning based on group attachment to a container · c65ee433
      Alex Williamson authored
      We use a VFIOContainer to associate an AddressSpace to one or more
      VFIOGroups.  The VFIOContainer represents the DMA context for that
      AdressSpace for those VFIOGroups and is synchronized to changes in
      that AddressSpace via a MemoryListener.  For IOMMU backed devices,
      maintaining the DMA context for a VFIOGroup generally involves
      pinning a host virtual address in order to create a stable host
      physical address and then mapping a translation from the associated
      guest physical address to that host physical address into the IOMMU.
      
      While the above maintains the VFIOContainer synchronized to the QEMU
      memory API of the VM, memory ballooning occurs outside of that API.
      Inflating the memory balloon (ie. cooperatively capturing pages from
      the guest for use by the host) simply uses MADV_DONTNEED to "zap"
      pages from QEMU's host virtual address space.  The page pinning and
      IOMMU mapping above remains in place, negating the host's ability to
      reuse the page, but the host virtual to host physical mapping of the
      page is invalidated outside of QEMU's memory API.
      
      When the balloon is later deflated, attempting to cooperatively
      return pages to the guest, the page is simply freed by the guest
      balloon driver, allowing it to be used in the guest and incurring a
      page fault when that occurs.  The page fault maps a new host physical
      page backing the existing host virtual address, meanwhile the
      VFIOContainer still maintains the translation to the original host
      physical address.  At this point the guest vCPU and any assigned
      devices will map different host physical addresses to the same guest
      physical address.  Badness.
      
      The IOMMU typically does not have page level granularity with which
      it can track this mapping without also incurring inefficiencies in
      using page size mappings throughout.  MMU notifiers in the host
      kernel also provide indicators for invalidating the mapping on
      balloon inflation, not for updating the mapping when the balloon is
      deflated.  For these reasons we assume a default behavior that the
      mapping of each VFIOGroup into the VFIOContainer is incompatible
      with memory ballooning and increment the balloon inhibitor to match
      the attached VFIOGroups.
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      c65ee433
  10. 15 Jun, 2018 1 commit
    • Peter Maydell's avatar
      iommu: Add IOMMU index argument to notifier APIs · cb1efcf4
      Peter Maydell authored
      Add support for multiple IOMMU indexes to the IOMMU notifier APIs.
      When initializing a notifier with iommu_notifier_init(), the caller
      must pass the IOMMU index that it is interested in. When a change
      happens, the IOMMU implementation must pass
      memory_region_notify_iommu() the IOMMU index that has changed and
      that notifiers must be called for.
      
      IOMMUs which support only a single index don't need to change.
      Callers which only really support working with IOMMUs with a single
      index can use the result of passing MEMTXATTRS_UNSPECIFIED to
      memory_region_iommu_attrs_to_index().
      Signed-off-by: default avatarPeter Maydell <peter.maydell@linaro.org>
      Reviewed-by: default avatarRichard Henderson <richard.henderson@linaro.org>
      Reviewed-by: default avatarAlex Bennée <alex.bennee@linaro.org>
      Message-id: 20180604152941.20374-3-peter.maydell@linaro.org
      cb1efcf4
  11. 31 May, 2018 1 commit
  12. 05 Apr, 2018 1 commit
    • Eric Auger's avatar
      vfio: Use a trace point when a RAM section cannot be DMA mapped · 5c086005
      Eric Auger authored
      Commit 567b5b30 ("vfio/pci: Relax DMA map errors for MMIO regions")
      added an error message if a passed memory section address or size
      is not aligned to the page size and thus cannot be DMA mapped.
      
      This patch fixes the trace by printing the region name and the
      memory region section offset within the address space (instead of
      offset_within_region).
      
      We also turn the error_report into a trace event. Indeed, In some
      cases, the traces can be confusing to non expert end-users and
      let think the use case does not work (whereas it works as before).
      
      This is the case where a BAR is successively mapped at different
      GPAs and its sections are not compatible with dma map. The listener
      is called several times and traces are issued for each intermediate
      mapping.  The end-user cannot easily match those GPAs against the
      final GPA output by lscpi. So let's keep those information to
      informed users. In mid term, the plan is to advise the user about
      BAR relocation relevance.
      
      Fixes: 567b5b30 ("vfio/pci: Relax DMA map errors for MMIO regions")
      Signed-off-by: default avatarEric Auger <eric.auger@redhat.com>
      Reviewed-by: default avatarPhilippe Mathieu-Daudé <f4bug@amsat.org>
      Reviewed-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      5c086005
  13. 13 Mar, 2018 3 commits
  14. 07 Feb, 2018 1 commit
  15. 06 Feb, 2018 2 commits
  16. 13 Dec, 2017 3 commits
  17. 17 Jul, 2017 1 commit
    • Alexey Kardashevskiy's avatar
      vfio-pci, ppc64/spapr: Reorder group-to-container attaching · 8c37faa4
      Alexey Kardashevskiy authored
      At the moment VFIO PCI device initialization works as follows:
      vfio_realize
      	vfio_get_group
      		vfio_connect_container
      			register memory listeners (1)
      			update QEMU groups lists
      		vfio_kvm_device_add_group
      
      Then (example for pseries) the machine reset hook triggers region_add()
      for all regions where listeners from (1) are listening:
      
      ppc_spapr_reset
      	spapr_phb_reset
      		spapr_tce_table_enable
      			memory_region_add_subregion
      				vfio_listener_region_add
      					vfio_spapr_create_window
      
      This scheme works fine until we need to handle VFIO PCI device hotplug
      and we want to enable PPC64/sPAPR in-kernel TCE acceleration on,
      i.e. after PCI hotplug we need a place to call
      ioctl(vfio_kvm_device_fd, KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE).
      Since the ioctl needs a LIOBN fd (from sPAPRTCETable) and a IOMMU group fd
      (from VFIOGroup), vfio_listener_region_add() seems to be the only place
      for this ioctl().
      
      However this only works during boot time because the machine reset
      happens strictly after all devices are finalized. When hotplug happens,
      vfio_listener_region_add() is called when a memory listener is registered
      but when this happens:
      1. new group is not added to the container->group_list yet;
      2. VFIO KVM device is unaware of the new IOMMU group.
      
      This moves bits around to have all necessary VFIO infrastructure
      in place for both initial startup and hotplug cases.
      
      [aw: ie, register vfio groups with kvm prior to memory listener
      registration such that kvm-vfio pseudo device ioctls are available
      during the region_add callback]
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      8c37faa4
  18. 14 Jul, 2017 1 commit
  19. 10 Jul, 2017 1 commit
    • Alex Williamson's avatar
      vfio: Test realized when using VFIOGroup.device_list iterator · 7da624e2
      Alex Williamson authored
      VFIOGroup.device_list is effectively our reference tracking mechanism
      such that we can teardown a group when all of the device references
      are removed.  However, we also use this list from our machine reset
      handler for processing resets that affect multiple devices.  Generally
      device removals are fully processed (exitfn + finalize) when this
      reset handler is invoked, however if the removal is triggered via
      another reset handler (piix4_reset->acpi_pcihp_reset) then the device
      exitfn may run, but not finalize.  In this case we hit asserts when
      we start trying to access PCI helpers since much of the PCI state of
      the device is released.  To resolve this, add a pointer to the Object
      DeviceState in our common base-device and skip non-realized devices
      as we iterate.
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      7da624e2
  20. 25 May, 2017 1 commit
  21. 03 May, 2017 2 commits
    • Jose Ricardo Ziviani's avatar
      vfio: enable 8-byte reads/writes to vfio · 38d49e8c
      Jose Ricardo Ziviani authored
      This patch enables 8-byte writes and reads to VFIO. Such implemention
      is already done but it's missing the 'case' to handle such accesses in
      both vfio_region_write and vfio_region_read and the MemoryRegionOps:
      impl.max_access_size and impl.min_access_size.
      
      After this patch, 8-byte writes such as:
      
      qemu_mutex_lock locked mutex 0x10905ad8
      vfio_region_write  (0001:03:00.0:region1+0xc0, 0x4140c, 4)
      vfio_region_write  (0001:03:00.0:region1+0xc4, 0xa0000, 4)
      qemu_mutex_unlock unlocked mutex 0x10905ad8
      
      goes like this:
      
      qemu_mutex_lock locked mutex 0x10905ad8
      vfio_region_write  (0001:03:00.0:region1+0xc0, 0xbfd0008, 8)
      qemu_mutex_unlock unlocked mutex 0x10905ad8
      Signed-off-by: default avatarJose Ricardo Ziviani <joserz@linux.vnet.ibm.com>
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      38d49e8c
    • Jose Ricardo Ziviani's avatar
      vfio: Set MemoryRegionOps:max_access_size and min_access_size · 15126cba
      Jose Ricardo Ziviani authored
      Sets valid.max_access_size and valid.min_access_size to ensure safe
      8-byte accesses to vfio. Today, 8-byte accesses are broken into pairs
      of 4-byte calls that goes unprotected:
      
      qemu_mutex_lock locked mutex 0x10905ad8
        vfio_region_write  (0001:03:00.0:region1+0xc0, 0x2020c, 4)
      qemu_mutex_unlock unlocked mutex 0x10905ad8
      qemu_mutex_lock locked mutex 0x10905ad8
        vfio_region_write  (0001:03:00.0:region1+0xc4, 0xa0000, 4)
      qemu_mutex_unlock unlocked mutex 0x10905ad8
      
      which occasionally leads to:
      
      qemu_mutex_lock locked mutex 0x10905ad8
        vfio_region_write  (0001:03:00.0:region1+0xc0, 0x2030c, 4)
      qemu_mutex_unlock unlocked mutex 0x10905ad8
      qemu_mutex_lock locked mutex 0x10905ad8
        vfio_region_write  (0001:03:00.0:region1+0xc0, 0x1000c, 4)
      qemu_mutex_unlock unlocked mutex 0x10905ad8
      qemu_mutex_lock locked mutex 0x10905ad8
        vfio_region_write  (0001:03:00.0:region1+0xc4, 0xb0000, 4)
      qemu_mutex_unlock unlocked mutex 0x10905ad8
      qemu_mutex_lock locked mutex 0x10905ad8
        vfio_region_write  (0001:03:00.0:region1+0xc4, 0xa0000, 4)
      qemu_mutex_unlock unlocked mutex 0x10905ad8
      
      causing strange errors in guest OS. With this patch, such accesses
      are protected by the same lock guard:
      
      qemu_mutex_lock locked mutex 0x10905ad8
      vfio_region_write  (0001:03:00.0:region1+0xc0, 0x2000c, 4)
      vfio_region_write  (0001:03:00.0:region1+0xc4, 0xb0000, 4)
      qemu_mutex_unlock unlocked mutex 0x10905ad8
      
      This happens because the 8-byte write should be broken into 4-byte
      writes by memory.c:access_with_adjusted_size() in order to be under
      the same lock. Today, it's done in exec.c:address_space_write_continue()
      which was able to handle only 4 bytes due to a zero'ed
      valid.max_access_size (see exec.c:memory_access_size()).
      Signed-off-by: default avatarJose Ricardo Ziviani <joserz@linux.vnet.ibm.com>
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      15126cba
  22. 20 Apr, 2017 1 commit
    • Peter Xu's avatar
      memory: add section range info for IOMMU notifier · 698feb5e
      Peter Xu authored
      In this patch, IOMMUNotifier.{start|end} are introduced to store section
      information for a specific notifier. When notification occurs, we not
      only check the notification type (MAP|UNMAP), but also check whether the
      notified iova range overlaps with the range of specific IOMMU notifier,
      and skip those notifiers if not in the listened range.
      
      When removing an region, we need to make sure we removed the correct
      VFIOGuestIOMMU by checking the IOMMUNotifier.start address as well.
      
      This patch is solving the problem that vfio-pci devices receive
      duplicated UNMAP notification on x86 platform when vIOMMU is there. The
      issue is that x86 IOMMU has a (0, 2^64-1) IOMMU region, which is
      splitted by the (0xfee00000, 0xfeefffff) IRQ region. AFAIK
      this (splitted IOMMU region) is only happening on x86.
      
      This patch also helps vhost to leverage the new interface as well, so
      that vhost won't get duplicated cache flushes. In that sense, it's an
      slight performance improvement.
      Suggested-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarEric Auger <eric.auger@redhat.com>
      Reviewed-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <1491562755-23867-2-git-send-email-peterx@redhat.com>
      [ehabkost: included extra vhost_iommu_region_del() change from Peter Xu]
      Signed-off-by: default avatarEduardo Habkost <ehabkost@redhat.com>
      698feb5e
  23. 17 Feb, 2017 3 commits
  24. 31 Oct, 2016 3 commits
    • Yongji Xie's avatar
      vfio: Add support for mmapping sub-page MMIO BARs · 95251725
      Yongji Xie authored
      Now the kernel commit 05f0c03fbac1 ("vfio-pci: Allow to mmap
      sub-page MMIO BARs if the mmio page is exclusive") allows VFIO
      to mmap sub-page BARs. This is the corresponding QEMU patch.
      With those patches applied, we could passthrough sub-page BARs
      to guest, which can help to improve IO performance for some devices.
      
      In this patch, we expand MemoryRegions of these sub-page
      MMIO BARs to PAGE_SIZE in vfio_pci_write_config(), so that
      the BARs could be passed to KVM ioctl KVM_SET_USER_MEMORY_REGION
      with a valid size. The expanding size will be recovered when
      the base address of sub-page BAR is changed and not page aligned
      any more in guest. And we also set the priority of these BARs'
      memory regions to zero in case of overlap with BARs which share
      the same page with sub-page BARs in guest.
      Signed-off-by: default avatarYongji Xie <xyjxie@linux.vnet.ibm.com>
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      95251725
    • Alex Williamson's avatar
      vfio: Handle zero-length sparse mmap ranges · 24acf72b
      Alex Williamson authored
      As reported in the link below, user has a PCI device with a 4KB BAR
      which contains the MSI-X table.  This seems to hit a corner case in
      the kernel where the region reports being mmap capable, but the sparse
      mmap information reports a zero sized range.  It's not entirely clear
      that the kernel is incorrect in doing this, but regardless, we need
      to handle it.  To do this, fill our mmap array only with non-zero
      sized sparse mmap entries and add an error return from the function
      so we can tell the difference between nr_mmaps being zero based on
      sparse mmap info vs lack of sparse mmap info.
      
      NB, this doesn't actually change the behavior of the device, it only
      removes the scary "Failed to mmap ... Performance may be slow" error
      message.  We cannot currently create an mmap over the MSI-X table.
      
      Link: http://lists.nongnu.org/archive/html/qemu-discuss/2016-10/msg00009.htmlSigned-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      24acf72b
    • Alex Williamson's avatar
      memory: Replace skip_dump flag with "ram_device" · 21e00fa5
      Alex Williamson authored
      Setting skip_dump on a MemoryRegion allows us to modify one specific
      code path, but the restriction we're trying to address encompasses
      more than that.  If we have a RAM MemoryRegion backed by a physical
      device, it not only restricts our ability to dump that region, but
      also affects how we should manipulate it.  Here we recognize that
      MemoryRegions do not change to sometimes allow dumps and other times
      not, so we replace setting the skip_dump flag with a new initializer
      so that we know exactly the type of region to which we're applying
      this behavior.
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Acked-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      21e00fa5
  25. 17 Oct, 2016 1 commit