Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Test] [Test Hive Connector V2] Test the Hive Source Connector V2 and record the problems encountered in the test #2793

Closed
12 tasks done
EricJoy2048 opened this issue Sep 19, 2022 · 6 comments
Assignees
Milestone

Comments

@EricJoy2048
Copy link
Member

EricJoy2048 commented Sep 19, 2022

SeaTunnel version: dev
Hadoop version: Hadoop 2.10.2
Flink version: 1.12.7
Spark version: 2.4.3, scala version 2.11.12

Problems found to be fixed

Hive Source Connector

1. test text file format table

create table test_hive_source(
     test_tinyint                          TINYINT,
     test_smallint                       SMALLINT,
     test_int                                INT,
     test_bigint                           BIGINT,
     test_boolean                       BOOLEAN,
     test_float                             FLOAT,
     test_double                         DOUBLE,
     test_string                           STRING,
     test_binary                          BINARY,
     test_timestamp                  TIMESTAMP,
     test_decimal                       DECIMAL(8,2),
     test_char                             CHAR(64),
     test_varchar                        VARCHAR(64),
     test_date                             DATE,
     test_array                            ARRAY<INT>,
     test_map                              MAP<STRING, FLOAT>,
     test_struct                           STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
     )
PARTITIONED BY (test_par1 STRING, test_par2 STRING);

// insert 10 rows in the table partition(test_par1='par1', test_par2='par2')
insert into table test_hive_source partition(test_par1='par1', test_par2='par2') select 
1 as test_tinyint,
1 as test_smallint,
100 as test_int,
40000000000 as test_bigint,
true as test_boolean,
1.01 as test_float,
1.002 as test_double,
'gan' as test_string,
'DataDataData' as test_binary,
current_timestamp() as test_timestamp,
83.2 as test_decimal,
'char64' as test_char,
'varchar64' as test_varchar,
cast(substring(from_unixtime(unix_timestamp(cast('2016-01-01' as string), 'yyyy-MM-dd')),1,10) as date) as test_date,
array(1,2) as test_array,
map("name",cast('1.11' as float),"age",cast('1.11' as float)) as test_map,
NAMED_STRUCT('street', 'London', 'city','W1a9JF','state','Finished','zip', 123) as test_struct;

// insert 10 rows in the table partition(test_par1='par1', test_par2='par2_1')
insert into table test_hive_source partition(test_par1='par1', test_par2='par2_1') select 
test_tinyint,
 test_smallint,
test_int,
test_bigint,
test_boolean,
test_float,
test_double,
test_string,
test_binary,
test_timestamp,
test_decimal,
test_char,
test_varchar,
test_date,
test_array,
test_map,
test_struct from test_hive_source;

// insert 20 rows in the table partition(test_par1='par1_1', test_par2='par2_2')
insert into table test_hive_source partition(test_par1='par1_1', test_par2='par2_2') select 
test_tinyint,
 test_smallint,
test_int,
test_bigint,
test_boolean,
test_float,
test_double,
test_string,
test_binary,
test_timestamp,
test_decimal,
test_char,
test_varchar,
test_date,
test_array,
test_map,
test_struct from test_hive_source;

Total rows:40

10      par1    par2
10      par1    par2_1
20      par1_1  par2_2

1.1 Test in Flink Engine

1.1.1 Job Config File

env {
  # You can set flink configuration here
  execution.parallelism = 3
  job.name="test_hive_source_to_console"
}

source {
  # This is a example input plugin **only for test and demonstrate the feature input plugin**

  Hive {
    table_name = "test_hive.test_hive_source"
    metastore_uri = "thrift://ctyun7:9083"
  }

  # If you would like to get more information about how to configure seatunnel and see full list of input plugins,
  # please go to https://seatunnel.apache.org/docs/flink/configuration/source-plugins/Fake
}

transform {

}

sink {
  # choose stdout output plugin to output data to console
  Console {
  }

  # If you would like to get more information about how to configure seatunnel and see full list of output plugins,
  # please go to https://seatunnel.apache.org/docs/flink/configuration/sink-plugins/Console
}

1.1.2 Submit Job Command

sh start-seatunnel-flink-connector-v2.sh --config ../config/flink_hive_to_console.conf 

1.1.3 Check Result in Flink Job log

The fields in data is lost, it can be fix in the feature #2473
The rows is right.

image

1.2 Test in Spark Engine

1.2.1 Job Config File

env {
  # You can set flink configuration here
  source.parallelism = 3
  job.name="test_hive_source_to_console"
}

source {
  # This is a example input plugin **only for test and demonstrate the feature input plugin**

  Hive {
    table_name = "test_hive.test_hive_source"
    metastore_uri = "thrift://ctyun7:9083"
  }

  # If you would like to get more information about how to configure seatunnel and see full list of input plugins,
  # please go to https://seatunnel.apache.org/docs/flink/configuration/source-plugins/Fake
}

transform {

}

sink {
  # choose stdout output plugin to output data to console
  Console {
  }

  # If you would like to get more information about how to configure seatunnel and see full list of output plugins,
  # please go to https://seatunnel.apache.org/docs/flink/configuration/sink-plugins/Console
}

1.2.2 Submit Job Command --deploy-mode CLIENT --master local

sh start-seatunnel-spark-connector-v2.sh --config ../config/spark_hive_to_console.conf --deploy-mode client --master local

1.2.3 Check data result

The fields in data is lost, it can be fix in the feature.
The rows is right.

image

1.2.4 Submit Job Command --deploy-mode client --master yarn

sh start-seatunnel-spark-connector-v2.sh --config ../config/spark_hive_to_console.conf --deploy-mode client --master yarn

1.2.5 Check data result

image

2. test orc file format table

create table test_hive_source_orc(
     test_tinyint                          TINYINT,
     test_smallint                       SMALLINT,
     test_int                                INT,
     test_bigint                           BIGINT,
     test_boolean                       BOOLEAN,
     test_float                             FLOAT,
     test_double                         DOUBLE,
     test_string                           STRING,
     test_binary                          BINARY,
     test_timestamp                  TIMESTAMP,
     test_decimal                       DECIMAL(8,2),
     test_char                             CHAR(64),
     test_varchar                        VARCHAR(64),
     test_date                             DATE,
     test_array                            ARRAY<INT>,
     test_map                              MAP<STRING, FLOAT>,
     test_struct                           STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
     )
PARTITIONED BY (test_par1 STRING, test_par2 STRING)
stored as orcfile;

The test data is same as text file format table.

2.1 Test in Flink engine

2.1.1 Job Config File

env {
  # You can set flink configuration here
  execution.parallelism = 3
  job.name="test_hiveorc_source_to_console"
}

source {
  # This is a example input plugin **only for test and demonstrate the feature input plugin**

  Hive {
    table_name = "test_hive.test_hive_source_orc"
    metastore_uri = "thrift://ctyun7:9083"
  }

  # If you would like to get more information about how to configure seatunnel and see full list of input plugins,
  # please go to https://seatunnel.apache.org/docs/flink/configuration/source-plugins/Fake
}

transform {

}

sink {
  # choose stdout output plugin to output data to console
  Console {
  }

  # If you would like to get more information about how to configure seatunnel and see full list of output plugins,
  # please go to https://seatunnel.apache.org/docs/flink/configuration/sink-plugins/Console
}

2.1.2 Submit Job Command

 sh start-seatunnel-flink-connector-v2.sh --config ../config/flink_hiveorc_to_console.conf 

2.1.3 Check Data Result

The fields in data is right.
image

The rows is right.
image

2.2 Test in Spark Engine

2.2.1 Job Config File

env {
  # You can set flink configuration here
  source.parallelism = 3
  job.name="test_hiveorc_source_to_console"
}

source {
  # This is a example input plugin **only for test and demonstrate the feature input plugin**

  Hive {
    table_name = "test_hive.test_hive_source_orc"
    metastore_uri = "thrift://ctyun7:9083"
  }

  # If you would like to get more information about how to configure seatunnel and see full list of input plugins,
  # please go to https://seatunnel.apache.org/docs/flink/configuration/source-plugins/Fake
}

transform {

}

sink {
  # choose stdout output plugin to output data to console
  Console {
  }

  # If you would like to get more information about how to configure seatunnel and see full list of output plugins,
  # please go to https://seatunnel.apache.org/docs/flink/configuration/sink-plugins/Console
}

2.2.2 Submit Job Command

sh start-seatunnel-spark-connector-v2.sh --config ../config/spark_hiveorc_to_console.conf --deploy-mode client --master local

2.2.3 Check Data Result

3. test parquet file format table

create table test_hive_source_parquet(
     test_tinyint                          TINYINT,
     test_smallint                       SMALLINT,
     test_int                                INT,
     test_bigint                           BIGINT,
     test_boolean                       BOOLEAN,
     test_float                             FLOAT,
     test_double                         DOUBLE,
     test_string                           STRING,
     test_binary                          BINARY,
     test_timestamp                  TIMESTAMP,
     test_decimal                       DECIMAL(8,2),
     test_char                             CHAR(64),
     test_varchar                        VARCHAR(64),
     test_date                             DATE,
     test_array                            ARRAY<INT>,
     test_map                              MAP<STRING, FLOAT>,
     test_struct                           STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
     )
PARTITIONED BY (test_par1 STRING, test_par2 STRING)
stored as PARQUET;
@EricJoy2048 EricJoy2048 added help wanted good first issue good first issue and removed help wanted good first issue good first issue labels Sep 19, 2022
@john8628
Copy link
Contributor

please assign it to me , I will try it ;

@EricJoy2048
Copy link
Member Author

please assign it to me , I will try it ;

Which issue do you want to work?

I do the test now and I will record all the problems found in the test and add todo list here. You can receive the issue you want to handle.

@john8628
Copy link
Contributor

please assign it to me , I will try it ;

Which issue do you want to work?

I do the test now and I will record all the problems found in the test and add todo list here. You can receive the issue you want to handle.

I want to work for #2792 ,but it seems more difficult for me ;

@EricJoy2048
Copy link
Member Author

please assign it to me , I will try it ;

Which issue do you want to work?
I do the test now and I will record all the problems found in the test and add todo list here. You can receive the issue you want to handle.

I want to work for #2792 ,but it seems more difficult for me ;

I fix #2792 already. You can look at other tasks. We have many issues marked as good first issue and want help.

@EricJoy2048 EricJoy2048 changed the title [Test] [Test Hive Connector V2] Test the Hive Connector V2 and record the problems encountered in the test [Test] [Test Hive Connector V2] Test the Hive Source Connector V2 and record the problems encountered in the test Sep 20, 2022
@EricJoy2048 EricJoy2048 assigned EricJoy2048 and unassigned john8628 Sep 22, 2022
@EricJoy2048 EricJoy2048 added this to the 2.2.0 milestone Sep 23, 2022
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

@github-actions github-actions bot added the stale label Nov 19, 2022
@github-actions
Copy link

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants