Lorenzo Pieralisi | 1 Feb 15:59
Favicon

Re: [PATCH 0/3] coupled cpuidle state support

On Wed, Feb 01, 2012 at 12:13:26PM +0000, Vincent Guittot wrote:

[...]

> >> In your patch, you put in safe state (WFI for most of platform) the
> >> cpus that become idle and these cpus are woken up each time a new cpu
> >> of the cluster becomes idle. Then, the cluster state is chosen and the
> >> cpus enter the selected C-state. On ux500, we are using another
> >> behavior for synchronizing  the cpus. The cpus are prepared to enter
> >> the c-state that has been chosen by the governor and the last cpu,
> >> that enters idle, chooses the final cluster state (according to cpus'
> >> C-state). The main advantage of this solution is that you don't need
> >> to wake other cpus to enter the C-state of a cluster. This can be
> >> quite worth full when tasks mainly run on one cpu. Have you also think
> >> about such behavior when developing the coupled cpuidle driver ? It
> >> could be interesting to add such behavior.
> >
> > Waking up the cpus that are in the safe state is not done just to
> > choose the target state, it's done to allow the cpus to take
> > themselves to the target low power state.  On ux500, are you saying
> > you take the cpus directly from the safe state to a lower power state
> > without ever going back to the active state?  I once implemented Tegra
> 
> yes it is

But if there is a single power rail for the entire cluster, when a CPU
is "prepared" for shutdown this means that you have to save the context and
clean L1, maybe for nothing since if other CPUs are up and running the
CPU going idle can just enter a simple standby wfi (clock-gated but power on).

(Continue reading)

Lorenzo Pieralisi | 1 Feb 19:07
Favicon

Re: [PATCH 0/3] coupled cpuidle state support

On Wed, Feb 01, 2012 at 05:30:15PM +0000, Colin Cross wrote:
> On Wed, Feb 1, 2012 at 6:59 AM, Lorenzo Pieralisi
> <lorenzo.pieralisi <at> arm.com> wrote:
> > On Wed, Feb 01, 2012 at 12:13:26PM +0000, Vincent Guittot wrote:
> >
> > [...]
> >
> >> >> In your patch, you put in safe state (WFI for most of platform) the
> >> >> cpus that become idle and these cpus are woken up each time a new cpu
> >> >> of the cluster becomes idle. Then, the cluster state is chosen and the
> >> >> cpus enter the selected C-state. On ux500, we are using another
> >> >> behavior for synchronizing  the cpus. The cpus are prepared to enter
> >> >> the c-state that has been chosen by the governor and the last cpu,
> >> >> that enters idle, chooses the final cluster state (according to cpus'
> >> >> C-state). The main advantage of this solution is that you don't need
> >> >> to wake other cpus to enter the C-state of a cluster. This can be
> >> >> quite worth full when tasks mainly run on one cpu. Have you also think
> >> >> about such behavior when developing the coupled cpuidle driver ? It
> >> >> could be interesting to add such behavior.
> >> >
> >> > Waking up the cpus that are in the safe state is not done just to
> >> > choose the target state, it's done to allow the cpus to take
> >> > themselves to the target low power state.  On ux500, are you saying
> >> > you take the cpus directly from the safe state to a lower power state
> >> > without ever going back to the active state?  I once implemented Tegra
> >>
> >> yes it is
> >
> > But if there is a single power rail for the entire cluster, when a CPU
> > is "prepared" for shutdown this means that you have to save the context and
(Continue reading)

Antti P Miettinen | 2 Feb 07:06
Favicon

Re: [PATCH v2 0/8] RFC: CPU frequency min/max as PM QoS params

"Rafael J. Wysocki" <rjw <at> sisk.pl> writes:
> On Sunday, January 22, 2012, Antti P Miettinen wrote:
[..]
>> Seems that the device specific constraints are not yet in use in
>> 3.3-rc1, or am I not looking hard enough?
>
> They are in use through generic PM domains (drivers/base/power/domain*.c
> and friends) and ARM/shmobile uses those.
>
> Thanks,
> Rafael

Sorry for the delay - got pre-empted by other stuff. I took a look at
the per device constraints. Do I understand it correctly that the idea
is that there is only one constraint per device? If we want to make
frequency and latency per CPU I guess we'd need separate constraints
associated with the CPU device. Or do I misunderstand something?

Or would global CPU frequency be more in line with global CPU latency
after all?

	--Antti

Amit Kachhap | 2 Feb 08:14
Favicon

Re: [RFC PATCH 1/2] thermal: Add a new trip type to use cooling device instance number



On 1 February 2012 20:19, Matthew Garrett <mjg <at> redhat.com> wrote:
I'm not really a fan of this as it stands - the name isn't very
intuitive and the code's pretty difficult to read. Would the following
(incomplete and obviously untested) not have the effect you want? Then
you register multiple trip points with the same cooling device but
different private values, and the state set does whatever you want it
to. Or am I misunderstanding the problem you're trying to solve?

Thanks for the detailed review of the patch. Actually i tried to merge the benefits of trip type ACTIVE and PASSIVE into one so this name. This new trip type is just like ACTIVE but instead of OFF(0)/ON(1)  all values greater then 0 is on. Anyways I looked at your implementation below but this will not solve the purpose as   thermal_cooling_device_register() need to be be called only once to register a cooling device such cpu frequency based. However the same cooling device may be binded many times to different trips.
Say,
1) thermal_zone_bind_cooling_device(tz_dev, 0, cdev);
2) thermal_zone_bind_cooling_device(tz_dev, 1, cdev);
3) thermal_zone_bind_cooling_device(tz_dev, 2, cdev);

0,1, 2 are nothing but trip points so the set_cur_state should be called like
set_cur_state(cdev, 0)
set_cur_state(cdev, 1)
set_cur_state(cdev, 2) when the trip point threshold are crossed.

Yeah I agree that implementation logic looks complex but this to prevent the lower temp trip points cooling handlers to be called. I will surely make this better in next version.


diff --git a/drivers/thermal/thermal_sys.c b/drivers/thermal/thermal_sys.c
index 220ce7e..817f2ba 100644
--- a/drivers/thermal/thermal_sys.c
+++ b/drivers/thermal/thermal_sys.c
<at> <at> -50,6 +50,7 <at> <at> struct thermal_cooling_device_instance {
       char attr_name[THERMAL_NAME_LENGTH];
       struct device_attribute attr;
       struct list_head node;
+       unsigned long private;
 };

 static DEFINE_IDR(thermal_tz_idr);
<at> <at> -909,7 +910,8 <at> <at> static struct class thermal_class = {
 * <at> ops:               standard thermal cooling devices callbacks.
 */
 struct thermal_cooling_device *thermal_cooling_device_register(
-     char *type, void *devdata, const struct thermal_cooling_device_ops *ops)
+       char *type, void *devdata, const struct thermal_cooling_device_ops *ops,
+       unsigned long private)
 {
       struct thermal_cooling_device *cdev;
       struct thermal_zone_device *pos;
<at> <at> -936,6 +938,7 <at> <at> struct thermal_cooling_device *thermal_cooling_device_register(
       cdev->ops = ops;
       cdev->device.class = &thermal_class;
       cdev->devdata = devdata;
+       cdev->private = private;
       dev_set_name(&cdev->device, "cooling_device%d", cdev->id);
       result = device_register(&cdev->device);
       if (result) {
<at> <at> -1079,11 +1082,14 <at> <at> void thermal_zone_device_update(struct thermal_zone_device *tz)
                                       continue;

                               cdev = instance->cdev;
-
-                               if (temp >= trip_temp)
-                                       cdev->ops->set_cur_state(cdev, 1);
-                               else
-                                       cdev->ops->set_cur_state(cdev, 0);
+                               if (cdev->private) {
+                                       cdev->ops->set_cur_state(cdev, cdev->private);
+                               } else {
+                                       if (temp >= trip_temp)
+                                               cdev->ops->set_cur_state(cdev, 1);
+                                       else
+                                               cdev->ops->set_cur_state(cdev, 0);
+                               }
                       }
                       break;
               case THERMAL_TRIP_PASSIVE:
diff --git a/include/linux/thermal.h b/include/linux/thermal.h
index 796f1ff..04aac09 100644
--- a/include/linux/thermal.h
+++ b/include/linux/thermal.h
<at> <at> -148,7 +148,7 <at> <at> int thermal_zone_unbind_cooling_device(struct thermal_zone_device *, int,
                                      struct thermal_cooling_device *);
 void thermal_zone_device_update(struct thermal_zone_device *);
 struct thermal_cooling_device *thermal_cooling_device_register(char *, void *,
-               const struct thermal_cooling_device_ops *);
+                    const struct thermal_cooling_device_ops *, unsigned long private);
 void thermal_cooling_device_unregister(struct thermal_cooling_device *);

 #ifdef CONFIG_NET


--
Matthew Garrett | mjg59 <at> srcf.ucam.org

Amit Kachhap | 2 Feb 08:16
Favicon

Re: [RFC PATCH 1/2] thermal: Add a new trip type to use cooling device instance number

On 1 February 2012 20:19, Matthew Garrett <mjg <at> redhat.com> wrote:
>
> I'm not really a fan of this as it stands - the name isn't very
> intuitive and the code's pretty difficult to read. Would the following
> (incomplete and obviously untested) not have the effect you want? Then
> you register multiple trip points with the same cooling device but
> different private values, and the state set does whatever you want it
> to. Or am I misunderstanding the problem you're trying to solve?

Thanks for the detailed review of the patch. Actually i tried to merge
the benefits of trip type ACTIVE and PASSIVE into one so this name.
This new trip type is just like ACTIVE but instead of OFF(0)/ON(1)
all values greater then 0 is on. Anyways I looked at your
implementation below but this will not solve the purpose as
thermal_cooling_device_register() need to be be called only once to
register a cooling device such cpu frequency based. However the same
cooling device may be binded many times to different trips.
Say,
1) thermal_zone_bind_cooling_device(tz_dev, 0, cdev);
2) thermal_zone_bind_cooling_device(tz_dev, 1, cdev);
3) thermal_zone_bind_cooling_device(tz_dev, 2, cdev);

0,1, 2 are nothing but trip points so the set_cur_state should be called like
set_cur_state(cdev, 0)
set_cur_state(cdev, 1)
set_cur_state(cdev, 2) when the trip point threshold are crossed.

Yeah I agree that implementation logic looks complex but this to
prevent the lower temp trip points cooling handlers to be called. I
will surely make this better in next version.

>
> diff --git a/drivers/thermal/thermal_sys.c b/drivers/thermal/thermal_sys.c
> index 220ce7e..817f2ba 100644
> --- a/drivers/thermal/thermal_sys.c
> +++ b/drivers/thermal/thermal_sys.c
> @@ -50,6 +50,7 @@ struct thermal_cooling_device_instance {
>        char attr_name[THERMAL_NAME_LENGTH];
>        struct device_attribute attr;
>        struct list_head node;
> +       unsigned long private;
>  };
>
>  static DEFINE_IDR(thermal_tz_idr);
> @@ -909,7 +910,8 @@ static struct class thermal_class = {
>  * @ops:               standard thermal cooling devices callbacks.
>  */
>  struct thermal_cooling_device *thermal_cooling_device_register(
> -     char *type, void *devdata, const struct thermal_cooling_device_ops *ops)
> +       char *type, void *devdata, const struct thermal_cooling_device_ops *ops,
> +       unsigned long private)
>  {
>        struct thermal_cooling_device *cdev;
>        struct thermal_zone_device *pos;
> @@ -936,6 +938,7 @@ struct thermal_cooling_device *thermal_cooling_device_register(
>        cdev->ops = ops;
>        cdev->device.class = &thermal_class;
>        cdev->devdata = devdata;
> +       cdev->private = private;
>        dev_set_name(&cdev->device, "cooling_device%d", cdev->id);
>        result = device_register(&cdev->device);
>        if (result) {
> @@ -1079,11 +1082,14 @@ void thermal_zone_device_update(struct thermal_zone_device *tz)
>                                        continue;
>
>                                cdev = instance->cdev;
> -
> -                               if (temp >= trip_temp)
> -                                       cdev->ops->set_cur_state(cdev, 1);
> -                               else
> -                                       cdev->ops->set_cur_state(cdev, 0);
> +                               if (cdev->private) {
> +                                       cdev->ops->set_cur_state(cdev, cdev->private);
> +                               } else {
> +                                       if (temp >= trip_temp)
> +                                               cdev->ops->set_cur_state(cdev, 1);
> +                                       else
> +                                               cdev->ops->set_cur_state(cdev, 0);
> +                               }
>                        }
>                        break;
>                case THERMAL_TRIP_PASSIVE:
> diff --git a/include/linux/thermal.h b/include/linux/thermal.h
> index 796f1ff..04aac09 100644
> --- a/include/linux/thermal.h
> +++ b/include/linux/thermal.h
> @@ -148,7 +148,7 @@ int thermal_zone_unbind_cooling_device(struct thermal_zone_device *, int,
>                                       struct thermal_cooling_device *);
>  void thermal_zone_device_update(struct thermal_zone_device *);
>  struct thermal_cooling_device *thermal_cooling_device_register(char *, void *,
> -               const struct thermal_cooling_device_ops *);
> +                    const struct thermal_cooling_device_ops *, unsigned long private);
>  void thermal_cooling_device_unregister(struct thermal_cooling_device *);
>
>  #ifdef CONFIG_NET
>
>
> --
> Matthew Garrett | mjg59 <at> srcf.ucam.org
Amit Kachhap | 2 Feb 10:16
Favicon

Re: [RFC PATCH 2/2] thermal: Add generic cpu cooling implementation

On 1 February 2012 20:27, Matthew Garrett <mjg <at> redhat.com> wrote:
> On Tue, Dec 13, 2011 at 08:43:16PM +0530, Amit Daniel Kachhap wrote:
>> This patch adds support for generic cpu thermal cooling low level
>> implementations using frequency scaling and cpuhotplugg currently.
>
> We've been over this kind of thing in the past. cpu hotplug is a
> relatively expensive operation, so people have previously been
> enthusiastic about using the scheduler to simply avoid running anything
> on CPUs if they're overheating. Has any general consensus been reached
> on this?
yes you are right that cpuhotplug is an expensive process and it may
further heat up the system before turning off so the ideal way would
be to reduce the capacity of the cpu gradually. Anyway these patches
are only exporting those API's and the actual use of them depends on
the user. Although my bigger focus is on cpufreq as cooling devices so
I might remove cpuhotplug in the next version.
>
> I'm also not entirely thrilled at now having two ways to manage the cpu
> through the thermal layer. ACPI already plugs in via the passive trip
> points. If we're going to do this then I'd like to see the ACPI code
> merged in with the generic cpu cooling code.
Yeah I also agree that there is a kind of repetition and not entirely
sure where to place these codes. I will try adding them inside acpi.
Thanks for the suggestion.
>
> --
> Matthew Garrett | mjg59 <at> srcf.ucam.org
Colin Cross | 3 Feb 02:19
Favicon

Re: [PATCH 0/3] coupled cpuidle state support

On Wed, Feb 1, 2012 at 10:07 AM, Lorenzo Pieralisi
<lorenzo.pieralisi <at> arm.com> wrote:
> On Wed, Feb 01, 2012 at 05:30:15PM +0000, Colin Cross wrote:
>> On Wed, Feb 1, 2012 at 6:59 AM, Lorenzo Pieralisi
>> <lorenzo.pieralisi <at> arm.com> wrote:
>> > On Wed, Feb 01, 2012 at 12:13:26PM +0000, Vincent Guittot wrote:
>> >
>> > [...]
>> >
>> >> >> In your patch, you put in safe state (WFI for most of platform) the
>> >> >> cpus that become idle and these cpus are woken up each time a new cpu
>> >> >> of the cluster becomes idle. Then, the cluster state is chosen and the
>> >> >> cpus enter the selected C-state. On ux500, we are using another
>> >> >> behavior for synchronizing  the cpus. The cpus are prepared to enter
>> >> >> the c-state that has been chosen by the governor and the last cpu,
>> >> >> that enters idle, chooses the final cluster state (according to cpus'
>> >> >> C-state). The main advantage of this solution is that you don't need
>> >> >> to wake other cpus to enter the C-state of a cluster. This can be
>> >> >> quite worth full when tasks mainly run on one cpu. Have you also think
>> >> >> about such behavior when developing the coupled cpuidle driver ? It
>> >> >> could be interesting to add such behavior.
>> >> >
>> >> > Waking up the cpus that are in the safe state is not done just to
>> >> > choose the target state, it's done to allow the cpus to take
>> >> > themselves to the target low power state.  On ux500, are you saying
>> >> > you take the cpus directly from the safe state to a lower power state
>> >> > without ever going back to the active state?  I once implemented Tegra
>> >>
>> >> yes it is
>> >
>> > But if there is a single power rail for the entire cluster, when a CPU
>> > is "prepared" for shutdown this means that you have to save the context and
>> > clean L1, maybe for nothing since if other CPUs are up and running the
>> > CPU going idle can just enter a simple standby wfi (clock-gated but power on).
>> >
>> > With Colin's approach, context is saved and L1 cleaned only when it is
>> > almost certain the cluster is powered off (so the CPUs).
>> >
>> > It is a trade-off, I am not saying one approach is better than the
>> > other; we just have to make sure that preparing the CPU for "possible" shutdown
>> > is better than sending IPIs to take CPUs out of wfi and synchronize
>> > them (this happens if and only if CPUs enter coupled C-states).
>> >
>> > As usual this will depend on use cases (and silicon implementations :) )
>> >
>> > It is definitely worth benchmarking them.
>> >
>>
>> I'm less worried about performance, and more worried about race
>> conditions.  How do you deal with the following situation:
>> CPU0 goes to WFI, and saves its state
>> CPU1 goes idle, and selects a deep idle state that powers down CPU0
>> CPU1 saves is state, and is about to trigger the power down
>> CPU0 gets an interrupt, restores its state, and modifies state (maybe
>> takes a spinlock during boot)
>> CPU1 cuts the power to CPU0
>>
>> On OMAP4, the race is handled in hardware.  When CPU1 tries to cut the
>> power to the blocks shared by CPU0 the hardware will ignore the
>> request if CPU0 is not in WFI.  On Tegra2, there is no hardware
>> support and I had to handle it with a spinlock implemented in scratch
>> registers because CPU0 is out of coherency when it starts booting and
>> ldrex/strex don't work.  I'm not convinced my implementation is
>> correct, and I'd be curious to see any other implementations.
>
> That's a problem you solved with coupled C-states (ie your example in
> the cover letter), where the primary waits for other CPUs to be reset
> before issuing the power down command, right ? At that point in time
> secondaries cannot wake up (?) and if wfi (ie power down) aborts you just
> take the secondaries out of reset and restart executing simultaneously,
> correct ? It mirrors the suspend behaviour, which is easier to deal with
> than completely random idle paths.

Yes, anything that supports hotplug and suspend should support coupled
cpuidle states fairly easily.  The only thing required that is not
already used by hotplug/suspend is the ability to save and restore
context on cpu1, but most implementations end up doing that already.

> It is true that this should be managed by the PM HW; if HW is not
> capable of managing these situations things get nasty as you highlighted.

Yes - on some platforms, the HW is not designed to handle it.  On
others, it is designed to, but due to HW bugs it cannot be used.

> And it is also true ldrex/strex on cacheable memory might not be available in
> those early warm-boot stages. I came up with a locking algorithm on
> strongly ordered memory to deal with that, but I am still not sure it is
> something we really really need.

I did the same, but with device memory.

> I will test coupled C-state code ASAP, and come back with feedback.
>
> Thanks,
> Lorenzo
>
Nigel Cunningham | 3 Feb 09:59
Favicon

[Patch] JBD and JBD2 missing set_freezable()

Hi all.

With the latest and greatest changes to the freezer, I started seeing
panics that were caused by jbd2 running post-process freezing and
hitting the canary BUG_ON for non-TuxOnIce I/O submission. I've traced
this back to a lack of set_freezable calls in both jbd and jbd2. Since
they're clearly meant to be frozen (there are tests for freezing()), I
submit the following patch to add the missing calls.

Signed-off-by: Nigel Cunningham <nigel <at> tuxonice.net>

diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index 59c09f9..89cd985 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -129,6 +129,8 @@ static int kjournald(void *arg)
 	setup_timer(&journal->j_commit_timer, commit_timeout,
 			(unsigned long)current);

+	set_freezable();
+
 	/* Record that the journal thread is running */
 	journal->j_task = current;
 	wake_up(&journal->j_wait_done_commit);
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index c0a5f9f..663e47c 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -139,6 +139,8 @@ static int kjournald2(void *arg)
 	setup_timer(&journal->j_commit_timer, commit_timeout,
 			(unsigned long)current);

+	set_freezable();
+
 	/* Record that the journal thread is running */
 	journal->j_task = current;
 	wake_up(&journal->j_wait_done_commit);
--

-- 
Evolution (n): A hypothetical process whereby improbable
events occur with alarming frequency, order arises from chaos, and
no one is given credit.
Pihet-XID, Jean | 3 Feb 15:04
Picon
Favicon

Re: [PATCH] CPU C-state breakage with PM Qos change

Looping in linux-pm

On Fri, Feb 3, 2012 at 1:14 AM, Venkatesh Pallipadi <venki <at> google.com> wrote:
> Looks like change "PM QoS: Move and rename the implementation files"
> made pm_qos depend on CONFIG_PM which depends on
> PM_SLEEP || PM_RUNTIME
>
> That breaks CPU C-states with kernels not having these CONFIGs, causing CPUs
> to spend time in Polling loop idle instead of going into deep C-states,
> consuming way way more power. This is with either acpi idle or intel idle
> enabled.
>
> Either CONFIG_PM should be enabled with any pm_qos users or
> the !CONFIG_PM pm_qos_request() should return sane defaults not to break
> the existing users. Here's is the patch for the latter option.
I think the real question is whether PM QoS should be functional in
all cases (as is ACPI) or whether only if certain options are set
(CONFIG_PM).
In the current code if CONFIG_PM is not enabled, a dummy PM QoS API is
provided as function stubs in order for the build to succeed.

Rafael, Mark,
What do you think? Should PM QoS be enabled in all cases? Are there
any known dependencies with CONFIG_PM?

Regards,
Jean
Rafael J. Wysocki | 3 Feb 21:02
Picon
Gravatar

Re: [PATCH] CPU C-state breakage with PM Qos change

On Friday, February 03, 2012, Pihet-XID, Jean wrote:
> Looping in linux-pm
> 
> On Fri, Feb 3, 2012 at 1:14 AM, Venkatesh Pallipadi <venki <at> google.com> wrote:
> > Looks like change "PM QoS: Move and rename the implementation files"
> > made pm_qos depend on CONFIG_PM which depends on
> > PM_SLEEP || PM_RUNTIME
> >
> > That breaks CPU C-states with kernels not having these CONFIGs, causing CPUs
> > to spend time in Polling loop idle instead of going into deep C-states,
> > consuming way way more power. This is with either acpi idle or intel idle
> > enabled.
> >
> > Either CONFIG_PM should be enabled with any pm_qos users or
> > the !CONFIG_PM pm_qos_request() should return sane defaults not to break
> > the existing users. Here's is the patch for the latter option.
> I think the real question is whether PM QoS should be functional in
> all cases (as is ACPI) or whether only if certain options are set
> (CONFIG_PM).
> In the current code if CONFIG_PM is not enabled, a dummy PM QoS API is
> provided as function stubs in order for the build to succeed.
> 
> Rafael, Mark,
> What do you think? Should PM QoS be enabled in all cases? Are there
> any known dependencies with CONFIG_PM?

At least we should keep the current behavior to avoid breaking things
for now.  We can change that in the next cycle, however, if everyone
agrees, but more carefully.

The patch has been applied to linux-pm/linux-next and will be pushed to Linus
early next week.

Thanks,
Rafael

Gmane