Deng-Cheng Zhu | 1 Aug 2011 12:13
Picon

Re: [PATCH 1/2] PCI: make pci_claim_resource() work with conflict resources as appropriate

It was found that PCI quirks claim resources (by calling pci_claim_resource())
*BEFORE* pcibios_fixup_bus() is called. In pcibios_fixup_bus(),
pci_bus->resource[0] for the root bus DOES point to msc_io_resource. If PCI
quirks do the resource claim after the arch-defined pcibios_fixup_bus() being
called, then the problem with Malta goes away.

So, it looks like 2 solutions out there:

1) To manage the call sequence. This seems not a desired one as it affects other
arches.

2) To raise the start point of the system controller's io_resource in
mips_pcibios_init() in arch/mips/mti-malta/malta-pci.c. This will place PCI
quirks' resources at the same level of the system controller's resources.

Ralf and Bjorn, which one sounds good to you?

Deng-Cheng

2011/7/30 Bjorn Helgaas <bhelgaas <at> google.com>:
> On Fri, Jul 29, 2011 at 12:32 AM, Deng-Cheng Zhu
> <dengcheng.zhu <at> gmail.com> wrote:
>> I noticed that at 79896cf42f Linus changed the function from insert_resource()
>> to request_resource() (and later evolved into request_resource_conflict()) and
>> he explained the reason. So, in the NIC's case, the problem is that in
>> pci_claim_resource() the function pci_find_parent_resource() returns the root
>> (0x0-0xffffff) rather than the MSC PCI I/O (0x1000-0xffffff).
>
> This seems like the real problem: PCI has the wrong idea of the
> resources available on bus 00.  The pci_bus->resource[0] for bus 00
(Continue reading)

Alex Williamson | 1 Aug 2011 18:40
Picon
Favicon

Re: kvm PCI assignment & VFIO ramblings

On Sun, 2011-07-31 at 08:21 +1000, Benjamin Herrenschmidt wrote:
> On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > Hi folks !
> > 
> > So I promised Anthony I would try to summarize some of the comments &
> > issues we have vs. VFIO after we've tried to use it for PCI pass-through
> > on POWER. It's pretty long, there are various items with more or less
> > impact, some of it is easily fixable, some are API issues, and we'll
> > probably want to discuss them separately, but for now here's a brain
> > dump.
> > 
> > David, Alexei, please make sure I haven't missed anything :-)
> 
> And I think I have :-)
> 
>   * Config space
> 
> VFIO currently handles that as a byte stream. It's quite gross to be
> honest and it's not right. You shouldn't lose access size information
> between guest and host when performing real accesses.
> 
> Some config space registers can have side effects and not respecting
> access sizes can be nasty.

It's a bug, let's fix it.

Alex

Jesse Barnes | 1 Aug 2011 20:52
Favicon

Re: [RFC v7] PCI: Set PCI-E Max Payload Size on fabric

On Wed, 20 Jul 2011 15:20:54 -0500
Jon Mason <mason <at> myri.com> wrote:

> On a given PCI-E fabric, each device, bridge, and root port can have a
> different PCI-E maximum payload size.  There is a sizable performance
> boost for having the largest possible maximum payload size on each PCI-E
> device.  However, if improperly configured, fatal bus errors can occur.
> Thus, it is important to ensure that PCI-E payloads sends by a device
> are never larger than the MPS setting of all devices on the way to the
> destination.

Applied to my for-linus branch, thanks for being so persistent Jon!

--

-- 
Jesse Barnes, Intel Open Source Technology Center
Alex Williamson | 1 Aug 2011 20:59
Picon
Favicon

Re: kvm PCI assignment & VFIO ramblings

On Sun, 2011-07-31 at 09:54 +1000, Benjamin Herrenschmidt wrote:
> On Sat, 2011-07-30 at 12:20 -0600, Alex Williamson wrote:
> 
> > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > bridge, so don't suffer the source identifier problem, but they do often
> > share an interrupt.  But even then, we can count on most modern devices
> > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > share interrupts.  In any case, yes, it's more rare but we need to know
> > how to handle devices behind PCI bridges.  However I disagree that we
> > need to assign all the devices behind such a bridge to the guest.
> 
> Well, ok so let's dig a bit more here :-) First, yes I agree they don't
> all need to appear to the guest. My point is really that we must prevent
> them to be "used" by somebody else, either host or another guest.
> 
> Now once you get there, I personally prefer having a clear "group"
> ownership rather than having devices stay in some "limbo" under vfio
> control but it's an implementation detail.
> 
> Regarding DisINTx, well, it's a bit like putting separate PCIe functions
> into separate guests, it looks good ... but you are taking a chance.
> Note that I do intend to do some of that for power ... well I think, I
> haven't completely made my mind.
> 
> pHyp for has a stricter requirement, PEs essentially are everything
> behind a bridge. If you have a slot, you have some kind of bridge above
> this slot and everything on it will be a PE.
> 
> The problem I see is that with your filtering of config space, BAR
> emulation, DisINTx etc... you essentially assume that you can reasonably
(Continue reading)

Alex Williamson | 1 Aug 2011 22:27
Picon
Favicon

Re: kvm PCI assignment & VFIO ramblings

On Sun, 2011-07-31 at 17:09 +0300, Avi Kivity wrote:
> On 07/30/2011 02:58 AM, Benjamin Herrenschmidt wrote:
> > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > for example. At all. If the guest configures crap into it, too bad, it
> > can only shoot itself in the foot since the host bridge enforce
> > validation anyways as I explained earlier. Because it's all paravirt, we
> > don't need to "translate" the interrupt vectors&  addresses, the guest
> > will call hyercalls to configure things anyways.
> 
> So, you have interrupt redirection?  That is, MSI-x table values encode 
> the vcpu, not pcpu?
> 
> Alex, with interrupt redirection, we can skip this as well?  Perhaps 
> only if the guest enables interrupt redirection?

It's not clear to me how we could skip it.  With VT-d, we'd have to
implement an emulated interrupt remapper and hope that the guest picks
unused indexes in the host interrupt remapping table before it could do
anything useful with direct access to the MSI-X table.  Maybe AMD IOMMU
makes this easier?  Thanks,

Alex

Nikolaev, Maxim | 2 Aug 2011 05:47
Picon

BAR Conflict

Hello Folks,

I work on passing-thought nVidia PCI video card from Linux host to
Windows guest and was able to do it for FX-3800 but I faced the problem
with Quadro 2000. The problem is one of Quadro 2000 BAR's conflicts with
memory reported by BIOS-e820 as reserved. Basically they are overlapped
and kvm does not want to go further. dmesg, /proc/iomem and lspci are at
the end of this email. 
The memory reserved by BIOS-e820
[    0.000000]  BIOS-e820: 00000000fa000000 - 00000000fc000000
(reserved)
Quadro 2000 is 05:00.0 and its conflicting BAR is 
	Region 3: Memory at f8000000 (64-bit, prefetchable) [disabled]
[size=64M]
and its upstream bridge 00:1c.00

I have a number of questions and will appreciate you a lot for
clarification
1. Is this conflict a really problem of kernel PCI? (I just assume that
maybe kvm does not handle this case properly)
2. If it is a kernel problem/feature then how I can avoid it? Where BAR
allocation is located in kernel source code? I enabled pci=earlydump and
see that PCI config space is the same at early boot stage and at the
entire kernel boot up that means at least in my case kernel does nothing
with BAR's allocated by BIOS. I've tried different kernel PCI options:
nocrs, use_crs, noacpi with the same results: BAR's are always located
at the same addresses. What I need is to make kernel to allocate BAR's
at different addresses that does not overlap with BIOS-e820 region. How
to make kernel to rearrange BAR's the other way then it is done in BIOS?

(Continue reading)

Avi Kivity | 2 Aug 2011 10:32
Picon
Favicon

Re: kvm PCI assignment & VFIO ramblings

On 08/01/2011 11:27 PM, Alex Williamson wrote:
> On Sun, 2011-07-31 at 17:09 +0300, Avi Kivity wrote:
> >  On 07/30/2011 02:58 AM, Benjamin Herrenschmidt wrote:
> >  >  Due to our paravirt nature, we don't need to masquerade the MSI-X table
> >  >  for example. At all. If the guest configures crap into it, too bad, it
> >  >  can only shoot itself in the foot since the host bridge enforce
> >  >  validation anyways as I explained earlier. Because it's all paravirt, we
> >  >  don't need to "translate" the interrupt vectors&   addresses, the guest
> >  >  will call hyercalls to configure things anyways.
> >
> >  So, you have interrupt redirection?  That is, MSI-x table values encode
> >  the vcpu, not pcpu?
> >
> >  Alex, with interrupt redirection, we can skip this as well?  Perhaps
> >  only if the guest enables interrupt redirection?
>
> It's not clear to me how we could skip it.  With VT-d, we'd have to
> implement an emulated interrupt remapper and hope that the guest picks
> unused indexes in the host interrupt remapping table before it could do
> anything useful with direct access to the MSI-X table.

Yeah.  We need the interrupt remapping hardware to indirect based on the 
source of the message, not just the address and data.

> Maybe AMD IOMMU
> makes this easier?  Thanks,
>

No idea.

(Continue reading)

Avi Kivity | 2 Aug 2011 11:12
Picon
Favicon

Re: kvm PCI assignment & VFIO ramblings

On 08/02/2011 04:27 AM, Benjamin Herrenschmidt wrote:
> >
> >  I have a feeling you'll be getting the same capabilities sooner or
> >  later, or you won't be able to make use of S/R IOV VFs.
>
> I'm not sure why you mean. We can do SR/IOV just fine (well, with some
> limitations due to constraints with how our MMIO segmenting works and
> indeed some of those are being lifted in our future chipsets but
> overall, it works).

Don't those limitations include "all VFs must be assigned to the same 
guest"?

PCI on x86 has function granularity, SRIOV reduces this to VF 
granularity, but I thought power has partition or group granularity 
which is much coarser?

> In -theory-, one could do the grouping dynamically with some kind of API
> for us as well. However the constraints are such that it's not
> practical. Filtering on RID is based on number of bits to match in the
> bus number and whether to match the dev and fn. So it's not arbitrary
> (but works fine for SR-IOV).
>
> The MMIO segmentation is a bit special too. There is a single MMIO
> region in 32-bit space (size is configurable but that's not very
> practical so for now we stick it to 1G) which is evenly divided into N
> segments (where N is the number of PE# supported by the host bridge,
> typically 128 with the current bridges).
>
> Each segment goes through a remapping table to select the actual PE# (so
(Continue reading)

David Gibson | 2 Aug 2011 10:28
Picon

Re: kvm PCI assignment & VFIO ramblings

On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
[snip]
> On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> bridge, so don't suffer the source identifier problem, but they do often
> share an interrupt.  But even then, we can count on most modern devices
> supporting PCI2.3, and thus the DisINTx feature, which allows us to
> share interrupts.  In any case, yes, it's more rare but we need to know
> how to handle devices behind PCI bridges.  However I disagree that we
> need to assign all the devices behind such a bridge to the guest.
> There's a difference between removing the device from the host and
> exposing the device to the guest.

I think you're arguing only over details of what words to use for
what, rather than anything of substance here.  The point is that an
entire partitionable group must be assigned to "host" (in which case
kernel drivers may bind to it) or to a particular guest partition (or
at least to a single UID on the host).  Which of the assigned devices
the partition actually uses is another matter of course, as is at
exactly which level they become "de-exposed" if you don't want to use
all of then.

[snip]
> > Maybe something like /sys/devgroups ? This probably warrants involving
> > more kernel people into the discussion.
> 
> I don't yet buy into passing groups to qemu since I don't buy into the
> idea of always exposing all of those devices to qemu.  Would it be
> sufficient to expose iommu nodes in sysfs that link to the devices
> behind them and describe properties and capabilities of the iommu
(Continue reading)

Benjamin Herrenschmidt | 2 Aug 2011 14:58

Re: kvm PCI assignment & VFIO ramblings

On Tue, 2011-08-02 at 12:12 +0300, Avi Kivity wrote:
> On 08/02/2011 04:27 AM, Benjamin Herrenschmidt wrote:
> > >
> > >  I have a feeling you'll be getting the same capabilities sooner or
> > >  later, or you won't be able to make use of S/R IOV VFs.
> >
> > I'm not sure why you mean. We can do SR/IOV just fine (well, with some
> > limitations due to constraints with how our MMIO segmenting works and
> > indeed some of those are being lifted in our future chipsets but
> > overall, it works).
> 
> Don't those limitations include "all VFs must be assigned to the same 
> guest"?

No, not at all. We put them in different PE# and because the HW is
SR-IOV we know we can trust it to the extent that it won't have nasty
hidden side effects between them. We have 64-bit windows for MMIO that
are also segmented and that we can "resize" to map over the VF BAR
region, the limitations are more about the allowed sizes, number of
segments supported etc...  for these things which can cause us to play
interesting games with the system page size setting to find a good
match.

> PCI on x86 has function granularity, SRIOV reduces this to VF 
> granularity, but I thought power has partition or group granularity 
> which is much coarser?

The granularity of a "Group" really depends on what the HW is like. On
pure PCIe SR-IOV we can go down to function granularity.

(Continue reading)


Gmane