通过Hive或Impala或Pig中的字符串匹配连接表

本文介绍了通过Hive或Impala或Pig中的字符串匹配连接表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有两个表 A 和 B ，其中 B 是巨大的（2000万乘300）， A 的大小适中（300k乘10）。 A 包含一列是地址， B 包含3列，可以组合在一起形成适当的街道地址。例如，在 A 中，地址栏可以是：

  id |地址
 ----------- 
 233 | 123 Main St

和 B 我们可以有：

  Number | Street_name | Street_suffix |税
 --------------------------------------------- --- 
 123 |主| |街道| 320.2

我想使用类似于 LIKE ，如下所示：

 选择A.id，B.从A $ b中扣除
 $ b在A.Address上加入B 
 **包含** B.Number 
和A.Address **包含** B.Street_name;

基本上我试图匹配记录，说如果A 的地址包含 B 的号码和street_name，然后我说它们是相同的地址（实际上，我也有城市，州和邮政编码但是我选择忽略这些用于说明的目的）。

2包含的部分是我不确定如何实现的。任何想法？我在Cloudera的Hue Hadoop发行版中，我可以访问Hive（1.1.0，不幸的是因为1.2.0具有Levenshtein距离函数），Impala（您可以只使用join的等式条件，但 - 您可以交叉加入和过滤。

 选择A.id，B.Tax 
从交叉连接B 
其中concat（''，A.Address，''）像concat（' （'％'，B.Street_name，'％'）
;

 
 
 演示 
 
 
  hive>创建表A（id int，Address string）; 
确定
配置单元>创建表B（数字int，Street_name字符串，Street_suffix字符串，税小数（12,2））; 
确定
配置单元>插入A值（233，'123 Main St'）; 
查询ID = ... 
 OK 
 hive>插入B值（123，'Main'，'Street'，320.2）; 
查询ID = ... 
 OK 
 hive>选择A.id，B.Tax 
>从A交叉加入B 
> concat（''，A.Address，''）像concat（'％'，cast（B.Number as string），'％'）
>和concat（''，A.Address，''）像concat（'％'，B.Street_name，'％'）
> ; 
警告：在任务'Stage-3：MAPRED'中加入MAPJOIN [8] [bigTable = b]是一个交叉产品
查询ID = ... 
 OK 
 233 320.2 
 hive> 
  
 
I have two tables A and B, where B is huge (20 million by 300) and A is of moderate size (300k by 10). A contains one column that is address and B contains 3 columns that can be put together to form a proper street address. For example, in A, the address column could be:
id  | Address
-----------
233 | 123 Main St
and in B we could have:
Number  |  Street_name  | Street_suffix | Tax
------------------------------------------------
123     |  Main         | Street        | 320.2
I want to join them using string matching similar to LIKE with something like the following:
select A.id, B.Tax
from A
  left join B
    on A.Address **contains** B.Number
    and A.Address **contains** B.Street_name;
Basically I am trying to match the records by saying that if A's address contains B's number and street_name, then I say they are the same address (In reality, I also have city, state and zip code. But I chose to ignore those for illustration purposes).
The 2 contains part is something I am not sure how to implement. Any ideas?
I am on Cloudera's Hue Hadoop distribution, where I have access to Hive (1.1.0, unfortunately because 1.2.0 has Levenshtein distance function), Impala (v2.3.0) and Pig (0.12.0-cdh5.5.0).
 解决方案 
You can use join with equality conditions only, but -
You can cross join and filter.
select      A.id, B.Tax
from        A cross join B
where       concat(' ',A.Address,' ') like concat('% ',cast(B.Number as string),' %')
        and concat(' ',A.Address,' ') like concat('% ',B.Street_name,' %')
;
Demo
hive> create table A (id int,Address string);
OK
hive> create table B (number int,Street_name string,Street_suffix string,tax decimal(12,2));
OK
hive> insert into A values (233,'123 Main St');
Query ID = ...
OK
hive> insert into B values (123,'Main','Street',320.2);
Query ID = ...
OK
hive> select      A.id, B.Tax
    > from        A cross join B
    > where       concat(' ',A.Address,' ') like concat('% ',cast(B.Number as string),' %')
    >         and concat(' ',A.Address,' ') like concat('% ',B.Street_name,' %')
    > ;
Warning: Map Join MAPJOIN[8][bigTable=b] in task 'Stage-3:MAPRED' is a cross product
Query ID = ...
OK
233 320.2
hive>
                        
这篇关于通过Hive或Impala或Pig中的字符串匹配连接表的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！