问题描述
我有两个表 A
和 B
,其中 B
是巨大的(2000万乘300), A
的大小适中(300k乘10)。 A
包含一列是地址, B
包含3列,可以组合在一起形成适当的街道地址。例如,在 A
中,地址栏可以是:
id |地址
-----------
233 | 123 Main St
和 B
我们可以有:
Number | Street_name | Street_suffix |税
--------------------------------------------- ---
123 |主| |街道| 320.2
我想使用类似于 LIKE
,如下所示:
选择A.id,B.从A $ b中扣除
$ b在A.Address上加入B
**包含** B.Number
和A.Address **包含** B.Street_name;
基本上我试图匹配记录,说如果 2包含的部分是我不确定如何实现的。任何想法?我在Cloudera的Hue Hadoop发行版中,我可以访问Hive(1.1.0,不幸的是因为1.2.0具有Levenshtein距离函数),Impala(您可以只使用join的等式条件, 但 - 演示 I have two tables and in I want to join them using string matching similar to Basically I am trying to match the records by saying that if The 2 contains part is something I am not sure how to implement. Any ideas? I am on Cloudera's Hue Hadoop distribution, where I have access to Hive (1.1.0, unfortunately because 1.2.0 has Levenshtein distance function), Impala (v2.3.0) and Pig (0.12.0-cdh5.5.0). You can use join with equality conditions only, but - Demo 这篇关于通过Hive或Impala或Pig中的字符串匹配连接表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! A
的地址包含
B
的号码和street_name,然后我说它们是相同的地址(实际上,我也有城市,州和邮政编码但是我选择忽略这些用于说明的目的)。
您可以交叉加入和过滤。
选择A.id,B.Tax
从交叉连接B
其中concat('',A.Address,'')像concat(' ('%',B.Street_name,'%')
;
hive>创建表A(id int,Address string);
确定
配置单元>创建表B(数字int,Street_name字符串,Street_suffix字符串,税小数(12,2));
确定
配置单元>插入A值(233,'123 Main St');
查询ID = ...
OK
hive>插入B值(123,'Main','Street',320.2);
查询ID = ...
OK
hive>选择A.id,B.Tax
>从A交叉加入B
> concat('',A.Address,'')像concat('%',cast(B.Number as string),'%')
>和concat('',A.Address,'')像concat('%',B.Street_name,'%')
> ;
警告:在任务'Stage-3:MAPRED'中加入MAPJOIN [8] [bigTable = b]是一个交叉产品
查询ID = ...
OK
233 320.2
hive>
A
and B
, where B
is huge (20 million by 300) and A
is of moderate size (300k by 10). A
contains one column that is address and B
contains 3 columns that can be put together to form a proper street address. For example, in A
, the address column could be:id | Address
-----------
233 | 123 Main St
B
we could have:Number | Street_name | Street_suffix | Tax
------------------------------------------------
123 | Main | Street | 320.2
LIKE
with something like the following:select A.id, B.Tax
from A
left join B
on A.Address **contains** B.Number
and A.Address **contains** B.Street_name;
A
's address contains B
's number and street_name, then I say they are the same address (In reality, I also have city, state and zip code. But I chose to ignore those for illustration purposes).
You can cross join and filter.select A.id, B.Tax
from A cross join B
where concat(' ',A.Address,' ') like concat('% ',cast(B.Number as string),' %')
and concat(' ',A.Address,' ') like concat('% ',B.Street_name,' %')
;
hive> create table A (id int,Address string);
OK
hive> create table B (number int,Street_name string,Street_suffix string,tax decimal(12,2));
OK
hive> insert into A values (233,'123 Main St');
Query ID = ...
OK
hive> insert into B values (123,'Main','Street',320.2);
Query ID = ...
OK
hive> select A.id, B.Tax
> from A cross join B
> where concat(' ',A.Address,' ') like concat('% ',cast(B.Number as string),' %')
> and concat(' ',A.Address,' ') like concat('% ',B.Street_name,' %')
> ;
Warning: Map Join MAPJOIN[8][bigTable=b] in task 'Stage-3:MAPRED' is a cross product
Query ID = ...
OK
233 320.2
hive>