Monday 20 November 2017

DRILL: Factors determining number of minor fragments spawn for MapRDB table scan


AIM:

Discuss different factors that will determine the number of minor fragments spawn for DRILL- MapRDB table scan. 

ENVIRONMENT DETAILS:

  
MapR Version5.2.1
Drill Version1.10

3 node cluster (VMs) each with 20GB and 8 vcores.

Created a MapRDB table ‘/tmp/mdbtb2’. Loaded some random data into the table.
Total number of rows in the table is 1,000,003.

Details of MapRDB table:

[mapr@hostnameA root]$ maprcli table region list -path '/tmp/mdbtb2' -json
{
        "timestamp":1510279617490,
        "timeofday":"2017-11-09 09:06:57.490 GMT-0500",
        "status":"OK",
        "total":4,
        "data":[
                {
                        "primarymfs":"hostnameB:5660",
                        "secondarymfs":"hostnameA:5660, hostnameC:5660",
                        "startkey":"-INFINITY",
                        "endkey":"user3020912552751108539",
                        "lastheartbeat":0,
                        "fid":"2176.50.1967492",
                        "logicalsize":270729216,
                        "physicalsize":173088768,
                        "copypendingsize":0,
                        "numberofrows":243531,
                        "numberofrowswithdelete":0,
                        "numberofspills":77,
                        "numberofsegments":77
                },
                {
                        "primarymfs":"hostnameB:5660",
                        "secondarymfs":"hostnameA:5660, hostnameC:5660",
                        "startkey":"user3020912552751108539",
                        "endkey":"user5074931419976666452",
                        "lastheartbeat":0,
                        "fid":"2176.337.1968048",
                        "logicalsize":275193856,
                        "physicalsize":176119808,
                        "copypendingsize":0,
                        "numberofrows":247945,
                        "numberofrowswithdelete":0,
                        "numberofspills":76,
                        "numberofsegments":76
                },
                {
                        "primarymfs":"hostnameB:5660",
                        "secondarymfs":"hostnameA:5660, hostnameC:5660",
                        "startkey":"user5074931419976666452",
                        "endkey":"user7176543975612180375",
                        "lastheartbeat":0,
                        "fid":"2817.1600.1055958",
                        "logicalsize":280805376,
                        "physicalsize":179748864,
                        "copypendingsize":0,
                        "numberofrows":252843,
                        "numberofrowswithdelete":0,
                        "numberofspills":80,
                        "numberofsegments":80
                },
                {
                        "primarymfs":"hostnameB:5660",
                        "secondarymfs":"hostnameA:5660, hostnameC:5660",
                        "startkey":"user7176543975612180375",
                        "endkey":"INFINITY",
                        "lastheartbeat":0,
                        "fid":"2817.33.1053610",
                        "logicalsize":284286976,
                        "physicalsize":181764096,
                        "copypendingsize":0,
                        "numberofrows":255684,
                        "numberofrowswithdelete":0,
                        "numberofspills":82,
                        "numberofsegments":80
                }
        ]
}

Hence, the MapRDB table have 4 tablets. 

VARIOUS TESTS:

Query used for all the tests: select count(*) from dfs.`tmp`.`mdbtb2`


Test 1:
 
DRILL PropertyValue
planner.slice_target100000
planner.width.max_per_query1000
planner.width.max_per_node100
Number of minor fragments spawn for major fragment 1 = 4.
Observation: The number depends on the number of tablets for the MapRDB table.






















Test 2:

 
DRILL PropertyValue
planner.slice_target1000100
planner.width.max_per_query1000
planner.width.max_per_node100
Number of minor fragments spawn for major fragment 0 = 1.
Observation: The number depends on the slice_target. The slice_target is set to 1000100. The total number of records in MapRDB table is 1000003. Hence, a single thread reads data from all the four tablets. This can cause remote reads. 
 




















Test 3:

 
DRILL PropertyValue
planner.slice_target500000
planner.width.max_per_query1000
planner.width.max_per_node100
Number of minor fragments spawn for major fragment 1 = 2.
Observation: The number depends on the slice_target. The slice_target is set to 500000. The total number of records in MapRDB table is 1000003. Hence, a 2 threads read data from all the four tablets. This can also cause remote reads. 





















SUMMARY:

  • The number of minor fragments spawn for MapRDB scan depends on two factors - number of MapRDB table tablets and `planner.slice_target`.
  • Only one minor fragment will read data from a tablet, meaning there will not be any case where multiple minor fragments will scan data from same tablet at a time.
  • If the total estimated rowcount from MapRDB is less than the `planner.slice_target` value, then only one minor fragment will be used even is table contains multiple tablets.
  • Or in more generic terms, a minor fragment can read as many tablets as long as the estimated row count is less than set `planner.slice_target`.

NOTE

In Case 3, the `planner.slice_target` is 500000. The total number of records in table is 1000003. From the query profile, we can see that following are the number of records read:
01-00-xx reads 496,374 and 01-01-xx reads 503,629.
Here, 01-01-xx reads more than the number specified in slice_target and they belong to 2 different tablets. Why did it read the data in single fragment?

There can only be maximum of 4 minor fragments in this case since there are 4 tablets.  We don’t enforce the slice target at run-time, only at planning time to determine the degree of parallelism.  
So, at run time it is possible that the sum total of the rowcount exceeds the slice target but that is fine because once we have assigned a minor fragment to read data from the 2 tablets, we don’t create additional minor fragments on the fly – it is all decided up front during planning time.