본문 바로가기

Dev/Data

Apache Tajo 테스트 (Windows)


# Apache Tajo 

- Apache Tajo™: A big data warehouse system on Hadoop

http://tajo.apache.org/


# Apache Tajo 설치 

- Download : http://tajo.apache.org/downloads.html

- 최신 바이너리(Latest Release 0.11.0) 를 받아서 압축을 풀기 


- conf/tajo-env.cmd 파일의 HADOOP_HOME 과 JAVA_HOME 세팅 

@rem Hadoop home. Required

set HADOOP_HOME=%HADOOP_HOME%


@rem The java implementation to use.  Required.

set JAVA_HOME=%JAVA_HOME%


# Apache Tajo 실행

bin\start-tajo.cmd


# tsql 실행 및 테스트 

- 영화의 평점 샘플 데이터 활용 - http://grouplens.org/datasets/movielens/

- http://files.grouplens.org/datasets/movielens/ml-20m.zip  (MovieLens 20M Dataset 사용)


> hadoop fs -ls /user/cdecl/data                                                                     

Found 6 items                                                                                        

-rw-r--r--   1 cdecl supergroup       8652 2015-11-13 13:03 /user/cdecl/data/README.txt              

-rw-r--r--   1 cdecl supergroup     569517 2015-11-13 13:03 /user/cdecl/data/links.csv               

-rw-r--r--   1 cdecl supergroup    1397542 2015-11-13 13:03 /user/cdecl/data/movies.csv              

-rw-r--r--   1 cdecl supergroup        258 2015-11-13 13:03 /user/cdecl/data/movies.csv.dsn          

-rw-r--r--   1 cdecl supergroup  533444411 2015-11-13 13:03 /user/cdecl/data/ratings.csv             

-rw-r--r--   1 cdecl supergroup   16603996 2015-11-13 13:03 /user/cdecl/data/tags.csv              


- ratings.csv 

- 영화 평점 정보, 약 500MB, 20,000,264 rows

Ratings Data File Structure (ratings.csv)

-----------------------------------------

All ratings are contained in the file `ratings.csv`.

    userId,movieId,rating,timestamp


userId,movieId,rating,timestamp

138493,60816,4.5,1259865163

138493,61160,4.0,1258390537

138493,65682,4.5,1255816373

138493,66762,4.5,1255805408

138493,68319,4.5,1260209720


- movies.csv

- 영화 정보, 약 1MB , 27,279 rows

Movies Data File Structure (movies.csv)

---------------------------------------

Movie information is contained in the file `movies.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId,title,genres


movieId,title,genres

131241,Ants in the Pants (2000),Comedy|Romance

131243,Werner - Gekotzt wird später (2003),Animation|Comedy

131248,Brother Bear 2 (2006),Adventure|Animation|Children|Comedy|Fantasy

131250,No More School (2000),Comedy

131252,Forklift Driver Klaus: The First Day on the Job (2001),Comedy|Horror



- tsql 실행 

D:\hadoop\tajo-0.11.0

> bin\tsql

starting cli, logging to D:\hadoop\tajo-0.11.0\logs\tajo.log


Try \? for help.

default>


CREATE EXTERNAL table movies ( mid int,  title text,  genres text )

USING TEXT WITH ('text.delimiter'=',', 'text.skip.headerlines'='1')

LOCATION 'hdfs://localhost:9000/user/cdecl/data/movies.csv';


create EXTERNAL table ratings ( userid int, mid int, rate int, timest text )

USING TEXT WITH ('text.delimiter'=',', 'text.skip.headerlines'='1')

LOCATION 'hdfs://localhost:9000/user/cdecl/data/ratings.csv';


SELECT a.mid, max(b.title), avg(a.rate) 

FROM ratings a join movies b on a.mid = b.mid 

GROUP BY a.mid 

ORDER BY avg(a.rate) DESC 

LIMIT 10;



- 같은 결과를 얻기위해 Spark(Python)의 경우 약 3분의 소요된 반면 Tajo의 경우 약 1분 정도로 단순 Single node에서 실행은 빠른것으로 판단

- 허나 Spark 나 Tajo 의 경우 1개의 노드가 아닌 많은 Cluster에 의해 운영되어 성능을 극대화에 목적이 있으므로 로컬에서는 단순 테스트로만..

- Spark(Python) Test : http://cdecl.tistory.com/306