小编典典

Hive-有没有办法进一步优化HiveQL查询?

sql

我写了一个查询来查找3月至4月美国10个最繁忙的机场。它产生所需的输出,但是我想尝试进一步优化它。

是否有任何适用于查询的HiveQL特定优化? GROUPING SETS适用在这里吗?我是Hive的新手,现在这是我提出的最短的查询。

SELECT airports.airport, COUNT(Flights.FlightsNum) AS Total_Flights
FROM (
SELECT Origin AS Airport, FlightsNum 
  FROM flights_stats
  WHERE (Cancelled = 0 AND Month IN (3,4))
UNION ALL
SELECT Dest AS Airport, FlightsNum 
  FROM flights_stats
  WHERE (Cancelled = 0 AND Month IN (3,4))
) Flights
INNER JOIN airports ON (Flights.Airport = airports.iata AND airports.country = 'USA')
GROUP BY airports.airport
ORDER BY Total_Flights DESC
LIMIT 10;

表列如下:

飞机场

|iata|airport|city|state|country|

Flights_stats

|originAirport|destAirport|FlightsNum|Cancelled|Month|

阅读 172

收藏
2021-04-07

共1个答案

小编典典

按机场(内部联接)过滤,并在UNION ALL之前进行聚合,以减少传递到最终聚合简化程序的数据集。具有UNION ALL的UNION
ALL子查询应该比UNION ALL之后具有更大数据集的Join并行且运行速度更快。

SELECT f.airport, SUM(cnt) AS Total_Flights
FROM (
      SELECT a.airport, COUNT(*) as cnt 
       FROM flights_stats f
            INNER JOIN airports a ON f.Origin=a.iata AND a.country='USA'
       WHERE Cancelled = 0 AND Month IN (3,4)
       GROUP BY a.airport
       UNION ALL
      SELECT a.airport, COUNT(*) as cnt
       FROM flights_stats f
            INNER JOIN airports a ON f.Dest=a.iata AND a.country='USA'
       WHERE Cancelled = 0 AND Month IN (3,4)
       GROUP BY a.airport
     ) f 
GROUP BY f.airport
ORDER BY Total_Flights DESC
LIMIT 10
;

调整mapjoin并启用并行执行:

set hive.exec.parallel=true;
set hive.auto.convert.join=true; --this enables map-join
set hive.mapjoin.smalltable.filesize=25000000; --size of table to fit in memory
2021-04-07