我有一个具有以下架构的配置单元表:
COOKIE | PRODUCT_ID | CAT_ID | QTY 1234123 [1,2,3] [r,t,null] [2,1,null]
我如何规范化数组,以便得到以下结果
COOKIE | PRODUCT_ID | CAT_ID | QTY 1234123 [1] [r] [2] 1234123 [2] [t] [1] 1234123 [3] null null
我尝试过以下方法:
select concat_ws('|',visid_high,visid_low) as cookie ,pid ,catid ,qty from table lateral view explode(productid) ptable as pid lateral view explode(catalogId) ptable2 as catid lateral view explode(qty) ptable3 as qty
然而结果是笛卡尔积.
我没有使用任何UDF就找到了解决这个问题的非常好的解决方案, posexplode是一个非常好的解决方案:
SELECT COOKIE , ePRODUCT_ID, eCAT_ID, eQTY FROM TABLE LATERAL VIEW posexplode(PRODUCT_ID) ePRODUCT_IDAS seqp, ePRODUCT_ID LATERAL VIEW posexplode(CAT_ID) eCAT_ID AS seqc, eCAT_ID LATERAL VIEW posexplode(QTY) eQTY AS seqq, eDateReported WHERE seqp = seqc AND seqc = seqq;
您可以使用Brickhouse中的numeric_range
和array_index
UDF(http://github.com/klout/brickhouse)来解决此问题.有一篇内容丰富的博客文章详细介绍了http://brickhouseconfessions.wordpress.com/2013/03/07/exploding-multiple-arrays-at-the-same-time-with-numeric_range/
使用这些UDF,查询将是这样的
select cookie, array_index( product_id_arr, n ) as product_id, array_index( catalog_id_arr, n ) as catalog_id, array_index( qty_id_arr, n ) as qty from table lateral view numeric_range( size( product_id_arr )) n1 as n;