Batch start CPU timeout – What influences start() post-processing?

We’re experiencing a very annoying issue with a Batch job processing “Large Data Volume” (1.5M records over a table containing 4.5M records). Our batch start method last statement is: return Database.getQueryLocator(soql); and despite the query completes successfully returning 1.5M rows, the batch doesn’t kick off and a CPU timeout error is reported.

I believe the CPU timeout is not to be attributed to any of our logic, being the query our last code statement and completing it successfully (look at the pseudo-code, debug log and screenshots of CPU timeline and Apex Jobs provided below). I have the strong suspect the issue is related to the “housekeeping” sfdcfox mentions in here – but that’s just a weak suspect as if that’d be the case I suppose I wouldn’t be the only one experiencing the issue.

We have optimized the SOQL so that only the Id field of targeted records is selected: the SOQL doesn’t contain any child/sub-query, and the records returned by the SOQL are as lightweight as they can be being only their ID field selected. This is a known factor that affects badly batch start() methods with CPU timeouts, but it’s not to be considered by us after optimization. I wonder then if there are others.

The question (and bounty) is out for a clear, documented, detailed answer to the question: what are factors that can affect/impact on the post-processing that SFDC does on the recordset which is returned by a batch start() method, other than the presence of child queries or parent records?

Pseudo code that gives an idea of my batch

global with sharing class APseudoBatch {

  private String soqlString = 'SELECT alias.Id FROM MyObject__c alias WHERE ...';

  global Database.QueryLocator start (Database.BatchableContext BC) {

    MyOtherObject__c[] prevBatchData = [SELECT Id FROM MyOtherObject__c WHERE ...];

    delete prevBatchData;

    return Database.getQueryLocator(soqlString);

  }

  (...)

}

Debug log

34.0 APEX_CODE,DEBUG;APEX_PROFILING,INFO;CALLOUT,INFO;DB,INFO;SYSTEM,DEBUG;VALIDATION,INFO;VISUALFORCE,INFO;WORKFLOW,INFO

05:12:19.152 (152753319)|SOQL_EXECUTE_BEGIN|[159]|Aggregations:0|SELECT id FROM AAA__c

05:12:19.152 (158857573)|SOQL_EXECUTE_END|[159]|Rows:0

05:12:19.152 (163217923)|SOQL_EXECUTE_BEGIN|[165]|Aggregations:0|SELECT id FROM BBB__c

05:12:19.152 (167389830)|SOQL_EXECUTE_END|[165]|Rows:1

05:12:19.152 (167994190)|DML_BEGIN|[405]|Op:Delete|Type:BBB__c|Rows:1

05:12:19.152 (401425184)|DML_END|[405]

05:12:19.152 (402249359)|SOQL_EXECUTE_BEGIN|[172]|Aggregations:0|SELECT id FROM CCC__c

05:12:19.152 (406737724)|SOQL_EXECUTE_END|[172]|Rows:4

05:12:19.152 (407218557)|DML_BEGIN|[405]|Op:Delete|Type:CCC__c|Rows:4

05:12:21.104 (2104679115)|DML_END|[405]

05:12:21.104 (2110622491)|DML_BEGIN|[409]|Op:Upsert|Type:DDD__c|Rows:4

05:12:21.104 (2183712913)|DML_END|[409]

05:12:21.104 (2192886624)|SOQL_EXECUTE_BEGIN|[188]|Aggregations:0|Select s.Id From LDV__c

05:13:25.773 (66773435973)|SOQL_EXECUTE_END|[188]|Rows:1517052

(Line 188 is the last line of my start() method – the debug log ends there, I’ve pasted it all as it is)

Screenshot of Apex Jobs page

Failed batch line from Apex Jobs page

The SOQL completes successfully at 05:13:25, the Batch Job is marked failed at 05:16 in the Apex Jobs page. Strange, isn’t?

Screenshot of Execution Timeline

timeline from dev console

LAST UPDATE

Changed the batch query removing all the WHERE criterias. The behavior of the CPU timeout changed – now it is happening 10 minutes after the SOQL failure (as the linked sfdcfox post suggests) instead of 2-3 minutes as per my previous debug logs. The table contains 3.5M records now.

Answer

Every time you ask for a data from salesforce, everytime you have to call the salesforce servers. Now each server call limit is 10,000 millisecond = 10 seconds, that means if server call is taking more than 10 seconds to return results your batch will immediately fail with status “CPU time exceeds”.Each server call can be querying data, Executing codes or others.

I feel, in your case your start method works fine as there are no failure during fetching data (that is why your SOQL completes at some time and finally after few minutes your batch is getting failed). Even If you know, you may find different debug logs for start method, each batch and finish method. Suppose you have 5 batches then you may get 7 debug logs for your batch class (1 for Start,5 for execute and 1 for finish). If you are getting your very first debug without error that means your start method is fine.

I feel, the CPU limit reach is hitting from execute method somewhere. As you specified that batch fails after few minutes of SOQL execution.

Possibility for failure:

  1. may be using heavy subquery (another query in your execute method)
  2. may be using for loop inside a for loop
  3. may be very lengthy code in execute method

Solution:

  1. You can use a subquery but make sure it is optimized and return data rapidly OR You can always make you class stateful and instead of writing a subquery, write the same query inside your start method and gather data in a map to be used further.
  2. try eliminating nested for loops as they are the main eaters of CPU time to process the whole execute code within 10 seconds. Make two different loops out of nested one, Use collections to store data results temporarily gained from first loop then feed it to another loop to get the outcome.
  3. If your logic is lengthy, go for splitting batch into multiple batches. Process data according to your half logic, store it in collections like map, place your another half logic in another batch, call the batch from the finish method with the temporary collection as input.

Attribution
Source : Link , Question Author : MLucci , Answer Author : Adrian Larson

Leave a Comment