Let's get started with sample dataset from open sources. We will use USA civil flights data since 1987 till 2015. It's hard to call this sample a Big Data (contains 166 millions rows, 63 Gb of uncompressed data) but this allows us to quickly get to work. Dataset is available for download here. Also you may download it from the original datasource as described here.
Firstly we will deploy ClickHouse to a single server. Below that we will also review the process of deployment to a cluster with support for sharding and replication.
On Ubuntu and Debian Linux ClickHouse can be installed from packages. For other Linux distributions you can compile ClickHouse from sources and then install.
clickhouse-client package contains clickhouse-client application — interactive ClickHouse client. clickhouse-server-base contains a clickhouse-server binary file. clickhouse-server-common — contains config files for the clickhouse-server.
Server config files are located in /etc/clickhouse-server/. Before getting to work please notice the path element in config. Path determines the location for data storage. It's not really handy to directly edit config.xml file considering package updates. Recommended way is to override the config elements in files of config.d directory. Also you may want to set up access rights at the start.
clickhouse-server won't be launched automatically after package installation. It won't be automatically restarted after updates either. Start the server with:
sudo service clickhouse-server startDefault location for server logs is /var/log/clickhouse-server/ Server is ready to handle client conections once "Ready for connections" message was logged.
Use clickhouse-client to connect to the server.
clickhouse-client clickhouse-client --host=... --port=... --user=... --password=...Enable multiline queries:
clickhouse-client -m clickhouse-client --multilineRun queries in batch-mode:
clickhouse-client --query='SELECT 1' echo 'SELECT 1' | clickhouse-clientInser data from file of a specified format:
clickhouse-client --query='INSERT INTO table VALUES' < data.txt clickhouse-client --query='INSERT INTO table FORMAT TabSeparated' < data.tsv
$ clickhouse-client --multiline ClickHouse client version 0.0.53720. Connecting to localhost:9000. Connected to ClickHouse server version 0.0.53720. :) CREATE TABLE ontime ( Year UInt16, Quarter UInt8, Month UInt8, DayofMonth UInt8, DayOfWeek UInt8, FlightDate Date, UniqueCarrier FixedString(7), AirlineID Int32, Carrier FixedString(2), TailNum String, FlightNum String, OriginAirportID Int32, OriginAirportSeqID Int32, OriginCityMarketID Int32, Origin FixedString(5), OriginCityName String, OriginState FixedString(2), OriginStateFips String, OriginStateName String, OriginWac Int32, DestAirportID Int32, DestAirportSeqID Int32, DestCityMarketID Int32, Dest FixedString(5), DestCityName String, DestState FixedString(2), DestStateFips String, DestStateName String, DestWac Int32, CRSDepTime Int32, DepTime Int32, DepDelay Int32, DepDelayMinutes Int32, DepDel15 Int32, DepartureDelayGroups String, DepTimeBlk String, TaxiOut Int32, WheelsOff Int32, WheelsOn Int32, TaxiIn Int32, CRSArrTime Int32, ArrTime Int32, ArrDelay Int32, ArrDelayMinutes Int32, ArrDel15 Int32, ArrivalDelayGroups Int32, ArrTimeBlk String, Cancelled UInt8, CancellationCode FixedString(1), Diverted UInt8, CRSElapsedTime Int32, ActualElapsedTime Int32, AirTime Int32, Flights Int32, Distance Int32, DistanceGroup UInt8, CarrierDelay Int32, WeatherDelay Int32, NASDelay Int32, SecurityDelay Int32, LateAircraftDelay Int32, FirstDepTime String, TotalAddGTime String, LongestAddGTime String, DivAirportLandings String, DivReachedDest String, DivActualElapsedTime String, DivArrDelay String, DivDistance String, Div1Airport String, Div1AirportID Int32, Div1AirportSeqID Int32, Div1WheelsOn String, Div1TotalGTime String, Div1LongestGTime String, Div1WheelsOff String, Div1TailNum String, Div2Airport String, Div2AirportID Int32, Div2AirportSeqID Int32, Div2WheelsOn String, Div2TotalGTime String, Div2LongestGTime String, Div2WheelsOff String, Div2TailNum String, Div3Airport String, Div3AirportID Int32, Div3AirportSeqID Int32, Div3WheelsOn String, Div3TotalGTime String, Div3LongestGTime String, Div3WheelsOff String, Div3TailNum String, Div4Airport String, Div4AirportID Int32, Div4AirportSeqID Int32, Div4WheelsOn String, Div4TotalGTime String, Div4LongestGTime String, Div4WheelsOff String, Div4TailNum String, Div5Airport String, Div5AirportID Int32, Div5AirportSeqID Int32, Div5WheelsOn String, Div5TotalGTime String, Div5LongestGTime String, Div5WheelsOff String, Div5TailNum String ) ENGINE = MergeTree(FlightDate, (Year, FlightDate), 8192);
Now we have a table of MergeTree type. MergeTree table type is recommended for usage in production. Table of this kind has a primary key used for incremental sort of table data. This allows fast execution of queries in ranges of a primary key.
Note We store ad network banners impressions logs in ClickHouse. Each table entry looks like: