Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REST: Docker file for Rest catalog adapter image #11283

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ajantha-bhat
Copy link
Member

depends on #11279

build.gradle Outdated Show resolved Hide resolved
@@ -985,6 +985,15 @@ project(':iceberg-open-api') {
exclude group: 'org.apache.commons', module: 'commons-configuration2'
exclude group: 'org.apache.hadoop.thirdparty', module: 'hadoop-shaded-protobuf_3_7'
exclude group: 'org.eclipse.jetty'
exclude group: 'com.google.re2j', module: 're2j'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excluded some more which are unrelated to 'Configuration' class.

These were not included in the license too.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to plug this in the PyIceberg integration tests, but it didn't work out of the box:

iceberg-python git:(main) ✗ make test-integration
docker compose -f dev/docker-compose-integration.yml kill
WARN[0000] /Users/fokko.driesprong/work/iceberg-python/dev/docker-compose-integration.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion 
[+] Killing 4/4
 ✔ Container pyiceberg-spark  Killed                                                                                                                                                                                                            0.2s 
 ✔ Container pyiceberg-mc     Killed                                                                                                                                                                                                            0.2s 
 ✔ Container pyiceberg-minio  Killed                                                                                                                                                                                                            0.3s 
 ✔ Container hive             Killed                                                                                                                                                                                                            0.2s 
docker compose -f dev/docker-compose-integration.yml rm -f
WARN[0000] /Users/fokko.driesprong/work/iceberg-python/dev/docker-compose-integration.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion 
Going to remove pyiceberg-spark, pyiceberg-mc, hive, pyiceberg-rest, pyiceberg-minio
[+] Removing 5/0
 ✔ Container pyiceberg-minio  Removed                                                                                                                                                                                                           0.0s 
 ✔ Container pyiceberg-spark  Removed                                                                                                                                                                                                           0.0s 
 ✔ Container hive             Removed                                                                                                                                                                                                           0.0s 
 ✔ Container pyiceberg-mc     Removed                                                                                                                                                                                                           0.0s 
 ✔ Container pyiceberg-rest   Removed                                                                                                                                                                                                           0.0s 
docker compose -f dev/docker-compose-integration.yml up -d
WARN[0000] /Users/fokko.driesprong/work/iceberg-python/dev/docker-compose-integration.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion 
[+] Running 5/5
 ✔ Container pyiceberg-rest   Started                                                                                                                                                                                                           0.2s 
 ✔ Container hive             Started                                                                                                                                                                                                           0.2s 
 ✔ Container pyiceberg-minio  Started                                                                                                                                                                                                           0.2s 
 ✔ Container pyiceberg-mc     Started                                                                                                                                                                                                           0.3s 
 ✔ Container pyiceberg-spark  Started                                                                                                                                                                                                           0.3s 
sleep 10
docker compose -f dev/docker-compose-integration.yml cp ./dev/provision.py spark-iceberg:/opt/spark/provision.py
WARN[0000] /Users/fokko.driesprong/work/iceberg-python/dev/docker-compose-integration.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion 
[+] Copying 1/0
 ✔ pyiceberg-spark copy ./dev/provision.py to pyiceberg-spark:/opt/spark/provision.py Copied                                                                                                                                                    0.0s 
docker compose -f dev/docker-compose-integration.yml exec -T spark-iceberg ipython ./provision.py
WARN[0000] /Users/fokko.driesprong/work/iceberg-python/dev/docker-compose-integration.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion 
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/06 21:48:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
---------------------------------------------------------------------------
gaierror                                  Traceback (most recent call last)
File /usr/local/lib/python3.9/site-packages/urllib3/connection.py:174, in HTTPConnection._new_conn(self)
    173 try:
--> 174     conn = connection.create_connection(
    175         (self._dns_host, self.port), self.timeout, **extra_kw
    176     )
    178 except SocketTimeout:

File /usr/local/lib/python3.9/site-packages/urllib3/util/connection.py:72, in create_connection(address, timeout, source_address, socket_options)
     68     return six.raise_from(
     69         LocationParseError(u"'%s', label empty or too long" % host), None
     70     )
---> 72 for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
     73     af, socktype, proto, canonname, sa = res

File /usr/local/lib/python3.9/socket.py:954, in getaddrinfo(host, port, family, type, proto, flags)
    953 addrlist = []
--> 954 for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
    955     af, socktype, proto, canonname, sa = res

gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

NewConnectionError                        Traceback (most recent call last)
File /usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py:715, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    714 # Make the request on the httplib connection object.
--> 715 httplib_response = self._make_request(
    716     conn,
    717     method,
    718     url,
    719     timeout=timeout_obj,
    720     body=body,
    721     headers=headers,
    722     chunked=chunked,
    723 )
    725 # If we're going to release the connection in ``finally:``, then
    726 # the response doesn't need to know about the connection. Otherwise
    727 # it will also try to release it and we'll have a double-release
    728 # mess.

File /usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py:416, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    415     else:
--> 416         conn.request(method, url, **httplib_request_kw)
    418 # We are swallowing BrokenPipeError (errno.EPIPE) since the server is
    419 # legitimately able to close the connection after sending a valid response.
    420 # With this behaviour, the received response is still readable.

File /usr/local/lib/python3.9/site-packages/urllib3/connection.py:244, in HTTPConnection.request(self, method, url, body, headers)
    243     headers["User-Agent"] = _get_default_user_agent()
--> 244 super(HTTPConnection, self).request(method, url, body=body, headers=headers)

File /usr/local/lib/python3.9/http/client.py:1285, in HTTPConnection.request(self, method, url, body, headers, encode_chunked)
   1284 """Send a complete request to the server."""
-> 1285 self._send_request(method, url, body, headers, encode_chunked)

File /usr/local/lib/python3.9/http/client.py:1331, in HTTPConnection._send_request(self, method, url, body, headers, encode_chunked)
   1330     body = _encode(body, 'body')
-> 1331 self.endheaders(body, encode_chunked=encode_chunked)

File /usr/local/lib/python3.9/http/client.py:1280, in HTTPConnection.endheaders(self, message_body, encode_chunked)
   1279     raise CannotSendHeader()
-> 1280 self._send_output(message_body, encode_chunked=encode_chunked)

File /usr/local/lib/python3.9/http/client.py:1040, in HTTPConnection._send_output(self, message_body, encode_chunked)
   1039 del self._buffer[:]
-> 1040 self.send(msg)
   1042 if message_body is not None:
   1043 
   1044     # create a consistent interface to message_body

File /usr/local/lib/python3.9/http/client.py:980, in HTTPConnection.send(self, data)
    979 if self.auto_open:
--> 980     self.connect()
    981 else:

File /usr/local/lib/python3.9/site-packages/urllib3/connection.py:205, in HTTPConnection.connect(self)
    204 def connect(self):
--> 205     conn = self._new_conn()
    206     self._prepare_conn(conn)

File /usr/local/lib/python3.9/site-packages/urllib3/connection.py:186, in HTTPConnection._new_conn(self)
    185 except SocketError as e:
--> 186     raise NewConnectionError(
    187         self, "Failed to establish a new connection: %s" % e
    188     )
    190 return conn

NewConnectionError: <urllib3.connection.HTTPConnection object at 0xffff7f744d90>: Failed to establish a new connection: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

MaxRetryError                             Traceback (most recent call last)
File /usr/local/lib/python3.9/site-packages/requests/adapters.py:667, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    666 try:
--> 667     resp = conn.urlopen(
    668         method=request.method,
    669         url=url,
    670         body=request.body,
    671         headers=request.headers,
    672         redirect=False,
    673         assert_same_host=False,
    674         preload_content=False,
    675         decode_content=False,
    676         retries=self.max_retries,
    677         timeout=timeout,
    678         chunked=chunked,
    679     )
    681 except (ProtocolError, OSError) as err:

File /usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py:801, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    799     e = ProtocolError("Connection aborted.", e)
--> 801 retries = retries.increment(
    802     method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
    803 )
    804 retries.sleep()

File /usr/local/lib/python3.9/site-packages/urllib3/util/retry.py:594, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
    593 if new_retry.is_exhausted():
--> 594     raise MaxRetryError(_pool, url, error or ResponseError(cause))
    596 log.debug("Incremented Retry for (url='%s'): %r", url, new_retry)

MaxRetryError: HTTPConnectionPool(host='rest', port=8181): Max retries exceeded with url: /v1/config (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xffff7f744d90>: Failed to establish a new connection: [Errno -2] Name or service not known'))

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
File /opt/spark/provision.py:28
     23 from pyiceberg.types import FixedType, NestedField, UUIDType
     25 spark = SparkSession.builder.getOrCreate()
     27 catalogs = {
---> 28     'rest': load_catalog(
     29         "rest",
     30         **{
     31             "type": "rest",
     32             "uri": "http://rest:8181",
     33             "s3.endpoint": "http://minio:9000",
     34             "s3.access-key-id": "admin",
     35             "s3.secret-access-key": "password",
     36         },
     37     ),
     38     'hive': load_catalog(
     39         "hive",
     40         **{
     41             "type": "hive",
     42             "uri": "http://hive:9083",
     43             "s3.endpoint": "http://minio:9000",
     44             "s3.access-key-id": "admin",
     45             "s3.secret-access-key": "password",
     46         },
     47     ),
     48 }
     50 for catalog_name, catalog in catalogs.items():
     51     spark.sql(
     52         f"""
     53       CREATE DATABASE IF NOT EXISTS {catalog_name}.default;
     54     """
     55     )

File /usr/local/lib/python3.9/site-packages/pyiceberg/catalog/__init__.py:258, in load_catalog(name, **properties)
    255     catalog_type = infer_catalog_type(name, conf)
    257 if catalog_type:
--> 258     return AVAILABLE_CATALOGS[catalog_type](name, cast(Dict[str, str], conf))
    260 raise ValueError(f"Could not initialize catalog with the following properties: {properties}")

File /usr/local/lib/python3.9/site-packages/pyiceberg/catalog/__init__.py:133, in load_rest(name, conf)
    130 def load_rest(name: str, conf: Properties) -> Catalog:
    131     from pyiceberg.catalog.rest import RestCatalog
--> 133     return RestCatalog(name, **conf)

File /usr/local/lib/python3.9/site-packages/pyiceberg/catalog/rest.py:237, in RestCatalog.__init__(self, name, **properties)
    235 super().__init__(name, **properties)
    236 self.uri = properties[URI]
--> 237 self._fetch_config()
    238 self._session = self._create_session()

File /usr/local/lib/python3.9/site-packages/pyiceberg/catalog/rest.py:334, in RestCatalog._fetch_config(self)
    331     params[WAREHOUSE_LOCATION] = warehouse_location
    333 with self._create_session() as session:
--> 334     response = session.get(self.url(Endpoints.get_config, prefixed=False), params=params)
    335 try:
    336     response.raise_for_status()

File /usr/local/lib/python3.9/site-packages/requests/sessions.py:602, in Session.get(self, url, **kwargs)
    594 r"""Sends a GET request. Returns :class:`Response` object.
    595 
    596 :param url: URL for the new :class:`Request` object.
    597 :param \*\*kwargs: Optional arguments that ``request`` takes.
    598 :rtype: requests.Response
    599 """
    601 kwargs.setdefault("allow_redirects", True)
--> 602 return self.request("GET", url, **kwargs)

File /usr/local/lib/python3.9/site-packages/requests/sessions.py:589, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    584 send_kwargs = {
    585     "timeout": timeout,
    586     "allow_redirects": allow_redirects,
    587 }
    588 send_kwargs.update(settings)
--> 589 resp = self.send(prep, **send_kwargs)
    591 return resp

File /usr/local/lib/python3.9/site-packages/requests/sessions.py:703, in Session.send(self, request, **kwargs)
    700 start = preferred_clock()
    702 # Send the request
--> 703 r = adapter.send(request, **kwargs)
    705 # Total elapsed time of the request (approximately)
    706 elapsed = preferred_clock() - start

File /usr/local/lib/python3.9/site-packages/requests/adapters.py:700, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    696     if isinstance(e.reason, _SSLError):
    697         # This branch is for urllib3 v1.22 and later.
    698         raise SSLError(e, request=request)
--> 700     raise ConnectionError(e, request=request)
    702 except ClosedPoolError as e:
    703     raise ConnectionError(e, request=request)

ConnectionError: HTTPConnectionPool(host='rest', port=8181): Max retries exceeded with url: /v1/config (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xffff7f744d90>: Failed to establish a new connection: [Errno -2] Name or service not known'))
make: *** [test-integration] Error 1
➜  iceberg-python git:(main) ✗ git diff
diff --git a/dev/docker-compose-integration.yml b/dev/docker-compose-integration.yml
index fccdcdc75..9a807fca3 100644
--- a/dev/docker-compose-integration.yml
+++ b/dev/docker-compose-integration.yml
@@ -41,7 +41,7 @@ services:
       - hive:hive
       - minio:minio
   rest:
-    image: tabulario/iceberg-rest
+    image: apache/iceberg-rest-adapter
     container_name: pyiceberg-rest
     networks:
       iceberg_net:

@ajantha-bhat
Copy link
Member Author

ajantha-bhat commented Nov 7, 2024

I tried to plug this in the PyIceberg integration tests, but it didn't work out of the box:

Let me debug and check it. Nothing special in this compared to Tabular (tabular/iceberg-rest) image. I will try to figureout.

@ajantha-bhat
Copy link
Member Author

ajantha-bhat commented Nov 11, 2024

@Fokko: Thanks for the review. When I exclude some dependencies from hadoop-common (like hadoop auth), it failed at runtime.

I fixed and double checked now. Everything works fine.
I also enabled logging framework by default for this runtime jar now (it will be helpful for the user).

Steps to verify.

1. java -jar open-api/build/libs/iceberg-open-api-test-fixtures-runtime-1.8.0-SNAPSHOT.jar

2. /Users/ajantha/Downloads/spark-3.5.0-bin-hadoop3/bin/spark-sql \
--packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.0 \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.tck=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.defaultCatalog=tck \
--conf spark.sql.catalog.tck.uri=http://localhost:8181 \
--conf spark.sql.catalog.tck.type=rest \
--conf spark.sql.catalog.tck.warehouse=/Users/ajantha/Downloads/temp/wh

3. CREATE TABLE tck.nyc.taxis
(
  vendor_id bigint,
  trip_id bigint,
  trip_distance float,
  fare_amount double,
  store_and_fwd_flag string
)
PARTITIONED BY (vendor_id);


INSERT INTO tck.nyc.taxis
VALUES (1, 1000371, 1.8, 15.32, 'N'), (2, 1000372, 2.5, 22.15, 'N'), (2, 1000373, 0.9, 9.01, 'N'), (1, 1000374, 8.4, 42.13, 'Y');

SELECT * FROM tck.nyc.taxis;

@Fokko
Copy link
Contributor

Fokko commented Nov 12, 2024

Hey @ajantha-bhat Thanks for taking a stab at this again. My main goal is to use this docker image to replace PyIceberg, Iceberg-Rust, and Iceberg-Go. These repositories still rely on a third-pary container that we want to get rid of (I believe you also raised this earlier). I tried this, but failed because it didn't come with the AWS Runtime:

pyiceberg-rest   | [main] INFO org.apache.iceberg.rest.RESTCatalogServer - Creating catalog with properties: {jdbc.password=password, s3.endpoint=http://minio:9000, jdbc.user=user, io-impl=org.apache.iceberg.aws.s3.S3FileIO, catalog-impl=org.apache.iceberg.jdbc.JdbcCatalog, jdbc.schema-version=V1, warehouse=s3://warehouse/, uri=jdbc:sqlite::memory:}
pyiceberg-rest   | [main] INFO org.apache.iceberg.CatalogUtil - Loading custom FileIO implementation: org.apache.iceberg.aws.s3.S3FileIO
pyiceberg-rest   | Exception in thread "main" java.lang.IllegalArgumentException: Cannot initialize FileIO implementation org.apache.iceberg.aws.s3.S3FileIO: Cannot find constructor for interface org.apache.iceberg.io.FileIO
pyiceberg-rest   | 	Missing org.apache.iceberg.aws.s3.S3FileIO [java.lang.NoClassDefFoundError: software/amazon/awssdk/core/exception/SdkException]
pyiceberg-rest   | 	at org.apache.iceberg.CatalogUtil.loadFileIO(CatalogUtil.java:356)
pyiceberg-rest   | 	at org.apache.iceberg.jdbc.JdbcCatalog.initialize(JdbcCatalog.java:132)
pyiceberg-rest   | 	at org.apache.iceberg.CatalogUtil.loadCatalog(CatalogUtil.java:274)
pyiceberg-rest   | 	at org.apache.iceberg.CatalogUtil.buildIcebergCatalog(CatalogUtil.java:328)
pyiceberg-rest   | 	at org.apache.iceberg.rest.RESTCatalogServer.initializeBackendCatalog(RESTCatalogServer.java:88)

This is because we store the metadata in Minio, so the metadata is easily accessible outside of the container as well. How do you feel about adding the S3 runtime?

Steps to run the tests:

git clone [email protected]:apache/iceberg-python.git
cd iceberg-python

Apply patch as described in #11283 (review):

➜  iceberg-python git:(main) ✗ git diff
diff --git a/dev/docker-compose-integration.yml b/dev/docker-compose-integration.yml
index fccdcdc75..9a807fca3 100644
--- a/dev/docker-compose-integration.yml
+++ b/dev/docker-compose-integration.yml
@@ -41,7 +41,7 @@ services:
       - hive:hive
       - minio:minio
   rest:
-    image: tabulario/iceberg-rest
+    image: apache/iceberg-rest-adapter
     container_name: pyiceberg-rest
     networks:
       iceberg_net:

And run the tests:

make install
make test-integration

Tail the logs using:

docker compose -f dev/docker-compose-integration.yml logs -f

@ajantha-bhat
Copy link
Member Author

@Fokko: I have also added GCP and Azure runtime dependency.
Most of the tests were passed. 2 failures are there. I think @sungwy is looking at it.

================================================================================== short test summary info ===================================================================================
FAILED tests/integration/test_writes/test_writes.py::test_create_table_transaction[session_catalog-2] - pydantic_core._pydantic_core.ValidationError: 1 validation error for TableResponse
FAILED tests/integration/test_writes/test_writes.py::test_create_table_with_non_default_values[session_catalog-2] - pydantic_core._pydantic_core.ValidationError: 1 validation error for TableResponse
==================================================== 2 failed, 814 passed, 8 skipped, 2853 deselected, 1451 warnings in 524.78s (0:08:44) ====================================================
make: *** [test-integration] Error 1
ajantha@Ajantha-Bhat-MacBook-Pro-16-inch-2023- iceberg-python % 

@Fokko
Copy link
Contributor

Fokko commented Nov 14, 2024

@ajantha-bhat Thanks for adding the runtime dependencies! 🙌

Yes, that looks like it will be fixed in apache/iceberg-python#1321 (review). I'll do some final checks, but I think this is ready 👍

@Fokko
Copy link
Contributor

Fokko commented Nov 14, 2024

The patch fixes it indeed 👍

==================================================================================== 816 passed, 8 skipped, 2853 deselected, 1455 warnings in 182.65s (0:03:02) =====================================================================================

Properties write.object-storage.enabled true
write.object-storage.path s3://iceberg-test-data/tpc/tpc-ds/3.2.0/1000/iceberg/customer/data
```

Copy link
Contributor

@Fokko Fokko Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a separate PR, can document how we generate the LICENSE and NOTICE? I think we need to do that from time to time.

cc @ajantha-bhat @bryanck

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I was planning to add the grade task to regenerate it soon.

Copy link
Member Author

@ajantha-bhat ajantha-bhat Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, what we need is a plugin or gradle task added for each runtime jar module, that automatically updates the notice and license on every build. Manually doing it is not efficient.

Because for example, after AWS dependency version bump by dependabot, it may bring new dependencies and our notice and license may never get updated for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created an issue tracker for this #11559

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some final checks on the licenses, looks all good 👍 Thanks @ajantha-bhat for fixing this, @kevinjqliu, @findepi, @mrcnc for the review and thanks @bryanck for the help around the licenses generation.

@jbonofre
Copy link
Member

I'm doing a new pass on license and content (Cat A deps).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants