David Gibson | 1 Aug 04:48 2011
Picon

Re: kvm PCI assignment & VFIO ramblings

On Sat, Jul 30, 2011 at 09:58:53AM +1000, Benjamin Herrenschmidt wrote:
[snip]
> That current hack won't work well if two devices share an iommu. Note
> that we have an additional constraint here due to our paravirt
> interfaces (specificed in PAPR) which is that PE domains must have a
> common parent. Basically, pHyp makes them look like a PCIe host bridge
> per domain in the guest. I think that's a pretty good idea and qemu
> might want to do the same.
> 
> - We hack out the currently unconditional mapping of the entire guest
> space in the iommu. Something will have to be done to "decide" whether
> to do that or not ... qemu argument -> ioctl ?

Not quite.  We already require the not-yet-upstream patches which add
guest-side (emulated) IOMMU support to qemu.  The approach we're using
for the passthrough (or at least will when I fix up my patches again)
is that we only map all guest ram into the vfio iommu if and only if
there is no guest visible iommu advertised in the qdev.

This kind of makes sense - if there is no iommu from the guest
perspective, the guest will expect to see all its physical memory 1:1
in DMA.

The hacky bit is that when there *is* a guest visible iommu, it's
assumed that whatever interface the guest iommu uses is somehow wired
up to vfio map/unmap calls.  For us at the moment, this means
passthrough devices for us must be assigned to a special (guest) pci
domain which sets up a suitable wires up the paravirt iommu to the vfio iommu.

In theory under some circumstances, with full emu, you could wire up
(Continue reading)

Deng-Cheng Zhu | 1 Aug 12:13 2011
Picon

Re: [PATCH 1/2] PCI: make pci_claim_resource() work with conflict resources as appropriate

It was found that PCI quirks claim resources (by calling pci_claim_resource())
*BEFORE* pcibios_fixup_bus() is called. In pcibios_fixup_bus(),
pci_bus->resource[0] for the root bus DOES point to msc_io_resource. If PCI
quirks do the resource claim after the arch-defined pcibios_fixup_bus() being
called, then the problem with Malta goes away.

So, it looks like 2 solutions out there:

1) To manage the call sequence. This seems not a desired one as it affects other
arches.

2) To raise the start point of the system controller's io_resource in
mips_pcibios_init() in arch/mips/mti-malta/malta-pci.c. This will place PCI
quirks' resources at the same level of the system controller's resources.

Ralf and Bjorn, which one sounds good to you?

Deng-Cheng

2011/7/30 Bjorn Helgaas <bhelgaas <at> google.com>:
> On Fri, Jul 29, 2011 at 12:32 AM, Deng-Cheng Zhu
> <dengcheng.zhu <at> gmail.com> wrote:
>> I noticed that at 79896cf42f Linus changed the function from insert_resource()
>> to request_resource() (and later evolved into request_resource_conflict()) and
>> he explained the reason. So, in the NIC's case, the problem is that in
>> pci_claim_resource() the function pci_find_parent_resource() returns the root
>> (0x0-0xffffff) rather than the MSC PCI I/O (0x1000-0xffffff).
>
> This seems like the real problem: PCI has the wrong idea of the
> resources available on bus 00.  The pci_bus->resource[0] for bus 00
(Continue reading)

Bjorn Helgaas | 1 Aug 17:21 2011
Picon

Re: [PATCH 1/2] PCI: make pci_claim_resource() work with conflict resources as appropriate

On Mon, Aug 1, 2011 at 4:13 AM, Deng-Cheng Zhu <dengcheng.zhu <at> gmail.com> wrote:
> It was found that PCI quirks claim resources (by calling pci_claim_resource())
> *BEFORE* pcibios_fixup_bus() is called. In pcibios_fixup_bus(),
> pci_bus->resource[0] for the root bus DOES point to msc_io_resource. If PCI
> quirks do the resource claim after the arch-defined pcibios_fixup_bus() being
> called, then the problem with Malta goes away.

Oh, I see.  pcibios_fixup_bus() copies the hose resources to the root
bus pci_bus structure.  I think that's bogus because we have the
interval between mips_pcibios_init() and pcibios_fixup_bus() where the
root bus resources are incorrect.

I think it would be better to set up the resources correctly right
away, as we do on x86.  In fact, I'm dubious about pci_create_bus()
filling in ioport_resource and iomem_resource as defaults -- that's
never what we really want there, and we have to rely on the arch
coming back later to fix it up.

I'd like to see some sort of restructuring there so we could pass in a
list of resources at the time we create the bus.

Bjorn
Alex Williamson | 1 Aug 18:40 2011
Picon

Re: kvm PCI assignment & VFIO ramblings

On Sun, 2011-07-31 at 08:21 +1000, Benjamin Herrenschmidt wrote:
> On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > Hi folks !
> > 
> > So I promised Anthony I would try to summarize some of the comments &
> > issues we have vs. VFIO after we've tried to use it for PCI pass-through
> > on POWER. It's pretty long, there are various items with more or less
> > impact, some of it is easily fixable, some are API issues, and we'll
> > probably want to discuss them separately, but for now here's a brain
> > dump.
> > 
> > David, Alexei, please make sure I haven't missed anything :-)
> 
> And I think I have :-)
> 
>   * Config space
> 
> VFIO currently handles that as a byte stream. It's quite gross to be
> honest and it's not right. You shouldn't lose access size information
> between guest and host when performing real accesses.
> 
> Some config space registers can have side effects and not respecting
> access sizes can be nasty.

It's a bug, let's fix it.

Alex

Jesse Barnes | 1 Aug 20:52 2011

Re: [RFC v7] PCI: Set PCI-E Max Payload Size on fabric

On Wed, 20 Jul 2011 15:20:54 -0500
Jon Mason <mason <at> myri.com> wrote:

> On a given PCI-E fabric, each device, bridge, and root port can have a
> different PCI-E maximum payload size.  There is a sizable performance
> boost for having the largest possible maximum payload size on each PCI-E
> device.  However, if improperly configured, fatal bus errors can occur.
> Thus, it is important to ensure that PCI-E payloads sends by a device
> are never larger than the MPS setting of all devices on the way to the
> destination.

Applied to my for-linus branch, thanks for being so persistent Jon!

--

-- 
Jesse Barnes, Intel Open Source Technology Center
Jesse Barnes | 1 Aug 20:52 2011

Re: [PATCH 1/5 v3] PCI: honor child buses add_size in hot plug configuration

On Mon, 25 Jul 2011 13:08:38 -0700
Ram Pai <linuxram <at> us.ibm.com> wrote:

> From: Yinghai Lu <yinghai <at> kernel.org>
> 
> git commit c8adf9a3e873eddaaec11ac410a99ef6b9656938
>     "PCI: pre-allocate additional resources to devices only after
> 	successful allocation of essential resources."
> 
> fails to take into consideration the optional-resources needed by children
> devices while calculating the optional-resource needed by the bridge.
> 
> This can be a problem on some setup. For example, if a hotplug bridge has 8
> children hotplug bridges, the bridge should have enough resources to accomodate
> the hotplug requirements for each of its children hotplug bridges.  Currently
> this is not the case.
> 
> This patch fixes the problem.

Just pulled these guys into my for-linus branch.  They should be safe
since they're behind the new =realloc param, but please test.

Thanks,
--

-- 
Jesse Barnes, Intel Open Source Technology Center
Alex Williamson | 1 Aug 20:59 2011
Picon

Re: kvm PCI assignment & VFIO ramblings

On Sun, 2011-07-31 at 09:54 +1000, Benjamin Herrenschmidt wrote:
> On Sat, 2011-07-30 at 12:20 -0600, Alex Williamson wrote:
> 
> > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > bridge, so don't suffer the source identifier problem, but they do often
> > share an interrupt.  But even then, we can count on most modern devices
> > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > share interrupts.  In any case, yes, it's more rare but we need to know
> > how to handle devices behind PCI bridges.  However I disagree that we
> > need to assign all the devices behind such a bridge to the guest.
> 
> Well, ok so let's dig a bit more here :-) First, yes I agree they don't
> all need to appear to the guest. My point is really that we must prevent
> them to be "used" by somebody else, either host or another guest.
> 
> Now once you get there, I personally prefer having a clear "group"
> ownership rather than having devices stay in some "limbo" under vfio
> control but it's an implementation detail.
> 
> Regarding DisINTx, well, it's a bit like putting separate PCIe functions
> into separate guests, it looks good ... but you are taking a chance.
> Note that I do intend to do some of that for power ... well I think, I
> haven't completely made my mind.
> 
> pHyp for has a stricter requirement, PEs essentially are everything
> behind a bridge. If you have a slot, you have some kind of bridge above
> this slot and everything on it will be a PE.
> 
> The problem I see is that with your filtering of config space, BAR
> emulation, DisINTx etc... you essentially assume that you can reasonably
(Continue reading)

Alex Williamson | 1 Aug 22:27 2011
Picon

Re: kvm PCI assignment & VFIO ramblings

On Sun, 2011-07-31 at 17:09 +0300, Avi Kivity wrote:
> On 07/30/2011 02:58 AM, Benjamin Herrenschmidt wrote:
> > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > for example. At all. If the guest configures crap into it, too bad, it
> > can only shoot itself in the foot since the host bridge enforce
> > validation anyways as I explained earlier. Because it's all paravirt, we
> > don't need to "translate" the interrupt vectors&  addresses, the guest
> > will call hyercalls to configure things anyways.
> 
> So, you have interrupt redirection?  That is, MSI-x table values encode 
> the vcpu, not pcpu?
> 
> Alex, with interrupt redirection, we can skip this as well?  Perhaps 
> only if the guest enables interrupt redirection?

It's not clear to me how we could skip it.  With VT-d, we'd have to
implement an emulated interrupt remapper and hope that the guest picks
unused indexes in the host interrupt remapping table before it could do
anything useful with direct access to the MSI-X table.  Maybe AMD IOMMU
makes this easier?  Thanks,

Alex

Benjamin Herrenschmidt | 2 Aug 03:27 2011

Re: kvm PCI assignment & VFIO ramblings

On Sun, 2011-07-31 at 17:09 +0300, Avi Kivity wrote:
> On 07/30/2011 02:58 AM, Benjamin Herrenschmidt wrote:
> > - Having a magic heuristic in libvirt to figure out those constraints is
> > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> > knowledge of PCI resource management and getting it wrong in many many
> > cases, something that took years to fix essentially by ripping it all
> > out. This is kernel knowledge and thus we need the kernel to expose in a
> > way or another what those constraints are, what those "partitionable
> > groups" are.
> 
> How about a sysfs entry partition=<partition-id>? then libvirt knows not 
> to assign devices from the same partition to different guests (and not 
> to let the host play with them, either).

That would work. On POWER I also need to expose the way that such
partitions also mean shared iommu domain but that's probably doable.

It would be easy for me to implement it that way since I would just pass
down my PE#.

However, it seems to be a bit of the "smallest possible tweak" to get it
to work. We keep a completely orthogonal iommu domain handling for x86
and there is no link between them.

I still personally prefer a way to statically define the grouping, but
it looks like you guys don't agree... oh well.

> > The interface currently proposed for VFIO (and associated uiommu)
> > doesn't handle that problem at all. Instead, it is entirely centered
> > around a specific "feature" of the VTd iommu's for creating arbitrary
(Continue reading)

Benjamin Herrenschmidt | 2 Aug 03:29 2011

Re: kvm PCI assignment & VFIO ramblings

On Mon, 2011-08-01 at 10:40 -0600, Alex Williamson wrote:
> On Sun, 2011-07-31 at 08:21 +1000, Benjamin Herrenschmidt wrote:
> > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > > Hi folks !
> > > 
> > > So I promised Anthony I would try to summarize some of the comments &
> > > issues we have vs. VFIO after we've tried to use it for PCI pass-through
> > > on POWER. It's pretty long, there are various items with more or less
> > > impact, some of it is easily fixable, some are API issues, and we'll
> > > probably want to discuss them separately, but for now here's a brain
> > > dump.
> > > 
> > > David, Alexei, please make sure I haven't missed anything :-)
> > 
> > And I think I have :-)
> > 
> >   * Config space
> > 
> > VFIO currently handles that as a byte stream. It's quite gross to be
> > honest and it's not right. You shouldn't lose access size information
> > between guest and host when performing real accesses.
> > 
> > Some config space registers can have side effects and not respecting
> > access sizes can be nasty.
> 
> It's a bug, let's fix it.

Right. I was just trying to be exhaustive :-) If you don't beat us to
it, we'll eventually submit patches to fix it, we haven't fixed it yet
either, just something I noticed (because this byte-transport also makes
(Continue reading)


Gmane