본문 바로가기

Dev/Data

Apache Hadoop 2.7.1 (Windows)

** Apache Hadoop을 Windows 10에 설치 및 테스트 


Apache Hadoop for Windows 

- 깃헙에서 소스를 받아 빌드 작업을 해야 하지만, 친철하게 Windows용 64비트 비공식 빌드가 있어 해당 바이너리를 다운로드 

- karthikj1/Hadoop-2.7.1-Windows-64-binaries https://github.com/karthikj1/Hadoop-2.7.1-Windows-64-binaries

- Single Node Cluster, Pseudo-Distributed Mode 로 설치 


# 설치

- 다운로드 받은 파일을 공백이 없는 위치에 압축 풀기

- HADOOP_HOME , JAVA_HOME 환경번수 세팅 

- PATH에 HADOOP_HOME\bin 경로 추가 

* 시스템-고급 시스템 설정-환경변수 세팅 

HADOOP_HOME=D:\hadoop\hadoop-2.7.1

JAVA_HOME=D:\hadoop\Java


PATH=%PATH%;D:\hadoop\hadoop-2.7.1\bin


* Java를 기본 설치 할 경우 공백을 들어간 "C:\Program Files\Java\jre1.8.0_66"에 설치가 되는데 mklink를 이용하여 공백없는 경로에 심볼릭 링크 생성 

mklink /j d:\hadoop\Java "C:\Program Files\Java\jre1.8.0_66"

d:\hadoop\Java <<===>> C:\Program Files\Java\jre1.8.0_66에 대한 교차점을 만들었습니다.


# Hadoop Conifg 설정 

- %HADOOP_HOME%\etc\hadoop\core-site.xml

- Hadoop 인터페이스 서비스 URI 설정 

- 외부로 서비스 노출 하고  싶다면 hdfs://0.0.0.0:9000 으로 세팅

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

fs.defaultFS:

The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.


- %HADOOP_HOME%\etc\hadoop\hdfs-site.xml

- dfs.replication 블록 복제, 파일 복제 개수를 지정 

- namenode 와 datanode 의 경로를 지정 (옵션), 지정하지 않으면 /tmp 밑에 생성함

- file:/ 이 경로는 현재 드라이브 루트를 의미 (c:\ or d:\)

<configuration>
	<property>
		<name>dfs.replication</name>
		<value>1</value>
	</property>
	<property>
		<name>dfs.namenode.name.dir</name>
		<value>file:/hadoop/data/dfs/namenode</value>
	</property>
	<property>
		<name>dfs.datanode.data.dir</name>
		<value>file:/hadoop/data/dfs/datanode</value>
	</property>
</configuration>

dfs.replication:

Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.


dfs.namenode.name.dir:

Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.


dfs.datanode.data.dir:

Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.


- %HADOOP_HOME%\etc\hadoop\yarn-site.xml

- yarn 설정 및 hadoop 어플리케이션 classpath 설정 

<configuration>
    <property>
       <name>yarn.nodemanager.aux-services</name>
       <value>mapreduce_shuffle</value>
    </property>
    <property>
       <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
       <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
       <name>yarn.application.classpath</name>
       <value>
            %HADOOP_HOME%\etc\hadoop,
            %HADOOP_HOME%\share\hadoop\common\*,
            %HADOOP_HOME%\share\hadoop\common\lib\*,
            %HADOOP_HOME%\share\hadoop\mapreduce\*,
            %HADOOP_HOME%\share\hadoop\mapreduce\lib\*,
            %HADOOP_HOME%\share\hadoop\hdfs\*,
            %HADOOP_HOME%\share\hadoop\hdfs\lib\*,         
            %HADOOP_HOME%\share\hadoop\yarn\*,
            %HADOOP_HOME%\share\hadoop\yarn\lib\*
       </value>
    </property>
</configuration>

yarn.nodemanager.aux-services:

The auxiliary service name. Default value is omapreduce_shuffle


yarn.nodemanager.aux-services.mapreduce.shuffle.class:

The auxiliary service class to use. Default value is org.apache.hadoop.mapred.ShuffleHandler


yarn.application.classpath:

CLASSPATH for YARN applications. A comma-separated list of CLASSPATH entries.


- %HADOOP_HOME%\etc\hadoop\mapred-site.xml

- 맵리듀스 런타임 프레임웍 설정 

<configuration>
    <property>
       <name>mapreduce.framework.name</name>
       <value>yarn</value>
    </property>
</configuration>

mapreduce.framework.name:

The runtime framework for executing MapReduce jobs. Can be one of local, classic or yarn.


# Namenode 포맷

%HADOOP_HOME%\bin\hdfs namenode -format 


# HDFS (Namenode and Datanode), YARN (Resource Manager and Node Manager) 실행 

%HADOOP_HOME%sbin\start-dfs.cmd

%HADOOP_HOME%sbin\start-yarn.cmd


* 콘솔창이 4개가 실행되면서 각각의 서비스 시작 


# 서비스 확인 

- http://localhost:8042Resource Manager and Node Manager : 

http://localhost:50070Namenode 


* http://localhost:9000 - Service URI


# 테스트 (Apache Spark, Python)

> hadoop fs -mkdir -p /user/cdecl/data 


> hadoop fs -ls /                                                                                    

Found 1 items                                                                                        

drwxr-xr-x   - cdecl supergroup          0 2015-11-13 13:02 /user

                                    

> hadoop fs -put D:\hadoop\data\ml-20m\* /user/cdecl/data


> hadoop fs -ls /user/cdecl/data                                                                     

Found 6 items                                                                                        

-rw-r--r--   1 cdecl supergroup       8652 2015-11-13 13:03 /user/cdecl/data/README.txt              

-rw-r--r--   1 cdecl supergroup     569517 2015-11-13 13:03 /user/cdecl/data/links.csv               

-rw-r--r--   1 cdecl supergroup    1397542 2015-11-13 13:03 /user/cdecl/data/movies.csv              

-rw-r--r--   1 cdecl supergroup        258 2015-11-13 13:03 /user/cdecl/data/movies.csv.dsn          

-rw-r--r--   1 cdecl supergroup  533444411 2015-11-13 13:03 /user/cdecl/data/ratings.csv             

-rw-r--r--   1 cdecl supergroup   16603996 2015-11-13 13:03 /user/cdecl/data/tags.csv                

 


# SparkApp
from pyspark import SparkContext


def main():
	sc = SparkContext("local[*]", "SparkApp")

	ratings = "hdfs://localhost:9000/user/cdecl/data/movies.csv"
	rdd = sc.textFile(ratings)

	print(rdd.take(5))	


if __name__ == "__main__":
	main()


['movieId,title,genres', '1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy', '2,Jumanji (1995),Adventure|Children|Fantasy', '3,Grumpier Old Men (1995),Comedy|Romance', '4,Waiting to Exhale (1995),Comedy|Drama|Romance']


참고 : 

http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/SingleCluster.html

http://www.srccodes.com/p/article/38/build-install-configure-run-apache-hadoop-2.2.0-microsoft-windows-os