我必须计算从IMDB小样本一起工作的星星,然后进行缩放。
我只需要用于电影中的那些 Actor ,而不必用于电视连续剧

#Input: (actor, title, year, num, type, episode, billing, role)
raw = LOAD 'hdfs://cm:9000/uhadoop/shared/imdb/imdb-stars-example.tsv' USING PigStorage('\t') AS (actor, title, year, num, type, episode, billing, role);
#Line 1: Filter raw to make sure type equals 'THEATRICAL_MOVIE'
 movies = FILTER raw BY type == 'THEATRICAL_MOVIE';
#Then I get the variables with stars and costars every billing that is equal to 1 is the movies star and every billing >=2 it is the co movie star
 c1 = FILTER movies BY billing == 1;
 c2 = FILTER movies BY billing >= 2;
 c3 = JOIN c1 BY title, c2 BY title;
从这里开始,我需要数出电影中最常出现的一对,而我的大脑刚好挤下来,我尝试了很多事情,但总是会出错。
actor_coactors_freq_movies = GROUP c3 BY actor;
actor_coactors_freq_movies_count = FOREACH actor_coactors_freq_movies GENERATE COUNT($1) AS count,
group AS actor_pair;
ordered_actor_pair_count = ORDER actor_movie_count BY count DESC;
显然我迷路了,我是所有爵士乐的新手。
感谢您的帮助

最佳答案

第1行:过滤原始数据以确保类型等于“THEATRICAL_MOVIE”
电影= FILTER原始BY类型=='THEATRICAL_MOVIE';
-第2行:生成具有完整电影名称的新关系(将标题+“-” + year +“-” + num“和 Actor 联系在一起)
full_movies1 = FOREACH电影GENERATE CONCAT(title,'-',year,'-',num),actor;
full_movies2 = FOREACH电影GENERATE CONCAT(title,'-',year,'-',num),actor;
-第3行:按 Actor 分组关系
coactor_movies = JOIN full_movies1 BY $ 0,full_movies2 BY $ 0;
转储coactor_movies
coactor_movies2 = FOREACH coactor_movies生成$ 0作为mv,$ 1作为act1,$ 3作为act2;
转储coactor_movies2
-萨科·洛斯·帕雷斯(Saco los pares)配偶
coactor_movies3 = FILTER coactor_movies2 BY act1!= act2;
转储coactor_movies3
-transformo los pares simetricos
coactor_movies4 = FOREACH coactor_movies3生成mv,FLATTEN((act1 转储coactor_movies4
-埃里米诺·迪普利卡多斯
ca1 = DISTINCT coactor_movies4;
转储ca1
-Dejo独奏洛杉矶Actores
ca2 = FOREACH ca1生成act1,act2;
转储ca2
actores =由(act1,act2)组成的组ca2;
结果= FOREACH actores将GEENERATE FLATTEN(group)表示为(act1,act2),COUNT($ 1)作为计数;
or_results =按计数DESC排序的结果;

关于hadoop - HADOOP/PIG-LATIN:计算经常合作的电影明星PIG,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/63084642/

10-11 09:14