-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REST: Docker file for Rest catalog adapter image #11283
base: main
Are you sure you want to change the base?
Conversation
ed2ba10
to
56ee5ee
Compare
b497524
to
e8a7a36
Compare
d039972
to
65be33f
Compare
65be33f
to
d8b98e3
Compare
@@ -985,6 +985,15 @@ project(':iceberg-open-api') { | |||
exclude group: 'org.apache.commons', module: 'commons-configuration2' | |||
exclude group: 'org.apache.hadoop.thirdparty', module: 'hadoop-shaded-protobuf_3_7' | |||
exclude group: 'org.eclipse.jetty' | |||
exclude group: 'com.google.re2j', module: 're2j' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excluded some more which are unrelated to 'Configuration' class.
These were not included in the license too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to plug this in the PyIceberg integration tests, but it didn't work out of the box:
iceberg-python git:(main) ✗ make test-integration
docker compose -f dev/docker-compose-integration.yml kill
WARN[0000] /Users/fokko.driesprong/work/iceberg-python/dev/docker-compose-integration.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion
[+] Killing 4/4
✔ Container pyiceberg-spark Killed 0.2s
✔ Container pyiceberg-mc Killed 0.2s
✔ Container pyiceberg-minio Killed 0.3s
✔ Container hive Killed 0.2s
docker compose -f dev/docker-compose-integration.yml rm -f
WARN[0000] /Users/fokko.driesprong/work/iceberg-python/dev/docker-compose-integration.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion
Going to remove pyiceberg-spark, pyiceberg-mc, hive, pyiceberg-rest, pyiceberg-minio
[+] Removing 5/0
✔ Container pyiceberg-minio Removed 0.0s
✔ Container pyiceberg-spark Removed 0.0s
✔ Container hive Removed 0.0s
✔ Container pyiceberg-mc Removed 0.0s
✔ Container pyiceberg-rest Removed 0.0s
docker compose -f dev/docker-compose-integration.yml up -d
WARN[0000] /Users/fokko.driesprong/work/iceberg-python/dev/docker-compose-integration.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion
[+] Running 5/5
✔ Container pyiceberg-rest Started 0.2s
✔ Container hive Started 0.2s
✔ Container pyiceberg-minio Started 0.2s
✔ Container pyiceberg-mc Started 0.3s
✔ Container pyiceberg-spark Started 0.3s
sleep 10
docker compose -f dev/docker-compose-integration.yml cp ./dev/provision.py spark-iceberg:/opt/spark/provision.py
WARN[0000] /Users/fokko.driesprong/work/iceberg-python/dev/docker-compose-integration.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion
[+] Copying 1/0
✔ pyiceberg-spark copy ./dev/provision.py to pyiceberg-spark:/opt/spark/provision.py Copied 0.0s
docker compose -f dev/docker-compose-integration.yml exec -T spark-iceberg ipython ./provision.py
WARN[0000] /Users/fokko.driesprong/work/iceberg-python/dev/docker-compose-integration.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/06 21:48:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
---------------------------------------------------------------------------
gaierror Traceback (most recent call last)
File /usr/local/lib/python3.9/site-packages/urllib3/connection.py:174, in HTTPConnection._new_conn(self)
173 try:
--> 174 conn = connection.create_connection(
175 (self._dns_host, self.port), self.timeout, **extra_kw
176 )
178 except SocketTimeout:
File /usr/local/lib/python3.9/site-packages/urllib3/util/connection.py:72, in create_connection(address, timeout, source_address, socket_options)
68 return six.raise_from(
69 LocationParseError(u"'%s', label empty or too long" % host), None
70 )
---> 72 for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
73 af, socktype, proto, canonname, sa = res
File /usr/local/lib/python3.9/socket.py:954, in getaddrinfo(host, port, family, type, proto, flags)
953 addrlist = []
--> 954 for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
955 af, socktype, proto, canonname, sa = res
gaierror: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
NewConnectionError Traceback (most recent call last)
File /usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py:715, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
714 # Make the request on the httplib connection object.
--> 715 httplib_response = self._make_request(
716 conn,
717 method,
718 url,
719 timeout=timeout_obj,
720 body=body,
721 headers=headers,
722 chunked=chunked,
723 )
725 # If we're going to release the connection in ``finally:``, then
726 # the response doesn't need to know about the connection. Otherwise
727 # it will also try to release it and we'll have a double-release
728 # mess.
File /usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py:416, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
415 else:
--> 416 conn.request(method, url, **httplib_request_kw)
418 # We are swallowing BrokenPipeError (errno.EPIPE) since the server is
419 # legitimately able to close the connection after sending a valid response.
420 # With this behaviour, the received response is still readable.
File /usr/local/lib/python3.9/site-packages/urllib3/connection.py:244, in HTTPConnection.request(self, method, url, body, headers)
243 headers["User-Agent"] = _get_default_user_agent()
--> 244 super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File /usr/local/lib/python3.9/http/client.py:1285, in HTTPConnection.request(self, method, url, body, headers, encode_chunked)
1284 """Send a complete request to the server."""
-> 1285 self._send_request(method, url, body, headers, encode_chunked)
File /usr/local/lib/python3.9/http/client.py:1331, in HTTPConnection._send_request(self, method, url, body, headers, encode_chunked)
1330 body = _encode(body, 'body')
-> 1331 self.endheaders(body, encode_chunked=encode_chunked)
File /usr/local/lib/python3.9/http/client.py:1280, in HTTPConnection.endheaders(self, message_body, encode_chunked)
1279 raise CannotSendHeader()
-> 1280 self._send_output(message_body, encode_chunked=encode_chunked)
File /usr/local/lib/python3.9/http/client.py:1040, in HTTPConnection._send_output(self, message_body, encode_chunked)
1039 del self._buffer[:]
-> 1040 self.send(msg)
1042 if message_body is not None:
1043
1044 # create a consistent interface to message_body
File /usr/local/lib/python3.9/http/client.py:980, in HTTPConnection.send(self, data)
979 if self.auto_open:
--> 980 self.connect()
981 else:
File /usr/local/lib/python3.9/site-packages/urllib3/connection.py:205, in HTTPConnection.connect(self)
204 def connect(self):
--> 205 conn = self._new_conn()
206 self._prepare_conn(conn)
File /usr/local/lib/python3.9/site-packages/urllib3/connection.py:186, in HTTPConnection._new_conn(self)
185 except SocketError as e:
--> 186 raise NewConnectionError(
187 self, "Failed to establish a new connection: %s" % e
188 )
190 return conn
NewConnectionError: <urllib3.connection.HTTPConnection object at 0xffff7f744d90>: Failed to establish a new connection: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
MaxRetryError Traceback (most recent call last)
File /usr/local/lib/python3.9/site-packages/requests/adapters.py:667, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
666 try:
--> 667 resp = conn.urlopen(
668 method=request.method,
669 url=url,
670 body=request.body,
671 headers=request.headers,
672 redirect=False,
673 assert_same_host=False,
674 preload_content=False,
675 decode_content=False,
676 retries=self.max_retries,
677 timeout=timeout,
678 chunked=chunked,
679 )
681 except (ProtocolError, OSError) as err:
File /usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py:801, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
799 e = ProtocolError("Connection aborted.", e)
--> 801 retries = retries.increment(
802 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
803 )
804 retries.sleep()
File /usr/local/lib/python3.9/site-packages/urllib3/util/retry.py:594, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
593 if new_retry.is_exhausted():
--> 594 raise MaxRetryError(_pool, url, error or ResponseError(cause))
596 log.debug("Incremented Retry for (url='%s'): %r", url, new_retry)
MaxRetryError: HTTPConnectionPool(host='rest', port=8181): Max retries exceeded with url: /v1/config (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xffff7f744d90>: Failed to establish a new connection: [Errno -2] Name or service not known'))
During handling of the above exception, another exception occurred:
ConnectionError Traceback (most recent call last)
File /opt/spark/provision.py:28
23 from pyiceberg.types import FixedType, NestedField, UUIDType
25 spark = SparkSession.builder.getOrCreate()
27 catalogs = {
---> 28 'rest': load_catalog(
29 "rest",
30 **{
31 "type": "rest",
32 "uri": "http://rest:8181",
33 "s3.endpoint": "http://minio:9000",
34 "s3.access-key-id": "admin",
35 "s3.secret-access-key": "password",
36 },
37 ),
38 'hive': load_catalog(
39 "hive",
40 **{
41 "type": "hive",
42 "uri": "http://hive:9083",
43 "s3.endpoint": "http://minio:9000",
44 "s3.access-key-id": "admin",
45 "s3.secret-access-key": "password",
46 },
47 ),
48 }
50 for catalog_name, catalog in catalogs.items():
51 spark.sql(
52 f"""
53 CREATE DATABASE IF NOT EXISTS {catalog_name}.default;
54 """
55 )
File /usr/local/lib/python3.9/site-packages/pyiceberg/catalog/__init__.py:258, in load_catalog(name, **properties)
255 catalog_type = infer_catalog_type(name, conf)
257 if catalog_type:
--> 258 return AVAILABLE_CATALOGS[catalog_type](name, cast(Dict[str, str], conf))
260 raise ValueError(f"Could not initialize catalog with the following properties: {properties}")
File /usr/local/lib/python3.9/site-packages/pyiceberg/catalog/__init__.py:133, in load_rest(name, conf)
130 def load_rest(name: str, conf: Properties) -> Catalog:
131 from pyiceberg.catalog.rest import RestCatalog
--> 133 return RestCatalog(name, **conf)
File /usr/local/lib/python3.9/site-packages/pyiceberg/catalog/rest.py:237, in RestCatalog.__init__(self, name, **properties)
235 super().__init__(name, **properties)
236 self.uri = properties[URI]
--> 237 self._fetch_config()
238 self._session = self._create_session()
File /usr/local/lib/python3.9/site-packages/pyiceberg/catalog/rest.py:334, in RestCatalog._fetch_config(self)
331 params[WAREHOUSE_LOCATION] = warehouse_location
333 with self._create_session() as session:
--> 334 response = session.get(self.url(Endpoints.get_config, prefixed=False), params=params)
335 try:
336 response.raise_for_status()
File /usr/local/lib/python3.9/site-packages/requests/sessions.py:602, in Session.get(self, url, **kwargs)
594 r"""Sends a GET request. Returns :class:`Response` object.
595
596 :param url: URL for the new :class:`Request` object.
597 :param \*\*kwargs: Optional arguments that ``request`` takes.
598 :rtype: requests.Response
599 """
601 kwargs.setdefault("allow_redirects", True)
--> 602 return self.request("GET", url, **kwargs)
File /usr/local/lib/python3.9/site-packages/requests/sessions.py:589, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
584 send_kwargs = {
585 "timeout": timeout,
586 "allow_redirects": allow_redirects,
587 }
588 send_kwargs.update(settings)
--> 589 resp = self.send(prep, **send_kwargs)
591 return resp
File /usr/local/lib/python3.9/site-packages/requests/sessions.py:703, in Session.send(self, request, **kwargs)
700 start = preferred_clock()
702 # Send the request
--> 703 r = adapter.send(request, **kwargs)
705 # Total elapsed time of the request (approximately)
706 elapsed = preferred_clock() - start
File /usr/local/lib/python3.9/site-packages/requests/adapters.py:700, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
696 if isinstance(e.reason, _SSLError):
697 # This branch is for urllib3 v1.22 and later.
698 raise SSLError(e, request=request)
--> 700 raise ConnectionError(e, request=request)
702 except ClosedPoolError as e:
703 raise ConnectionError(e, request=request)
ConnectionError: HTTPConnectionPool(host='rest', port=8181): Max retries exceeded with url: /v1/config (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xffff7f744d90>: Failed to establish a new connection: [Errno -2] Name or service not known'))
make: *** [test-integration] Error 1
➜ iceberg-python git:(main) ✗ git diff
diff --git a/dev/docker-compose-integration.yml b/dev/docker-compose-integration.yml
index fccdcdc75..9a807fca3 100644
--- a/dev/docker-compose-integration.yml
+++ b/dev/docker-compose-integration.yml
@@ -41,7 +41,7 @@ services:
- hive:hive
- minio:minio
rest:
- image: tabulario/iceberg-rest
+ image: apache/iceberg-rest-adapter
container_name: pyiceberg-rest
networks:
iceberg_net:
Let me debug and check it. Nothing special in this compared to Tabular (tabular/iceberg-rest) image. I will try to figureout. |
e0c357e
to
a69b02f
Compare
a69b02f
to
8c89087
Compare
@Fokko: Thanks for the review. When I exclude some dependencies from hadoop-common (like hadoop auth), it failed at runtime. I fixed and double checked now. Everything works fine. Steps to verify.
|
8c89087
to
15baa13
Compare
Hey @ajantha-bhat Thanks for taking a stab at this again. My main goal is to use this docker image to replace PyIceberg, Iceberg-Rust, and Iceberg-Go. These repositories still rely on a third-pary container that we want to get rid of (I believe you also raised this earlier). I tried this, but failed because it didn't come with the AWS Runtime:
This is because we store the metadata in Minio, so the metadata is easily accessible outside of the container as well. How do you feel about adding the S3 runtime? Steps to run the tests: git clone [email protected]:apache/iceberg-python.git
cd iceberg-python Apply patch as described in #11283 (review):
And run the tests: make install
make test-integration Tail the logs using: docker compose -f dev/docker-compose-integration.yml logs -f |
@Fokko: I have also added GCP and Azure runtime dependency.
|
@ajantha-bhat Thanks for adding the runtime dependencies! 🙌 Yes, that looks like it will be fixed in apache/iceberg-python#1321 (review). I'll do some final checks, but I think this is ready 👍 |
The patch fixes it indeed 👍
|
Properties write.object-storage.enabled true | ||
write.object-storage.path s3://iceberg-test-data/tpc/tpc-ds/3.2.0/1000/iceberg/customer/data | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a separate PR, can document how we generate the LICENSE
and NOTICE
? I think we need to do that from time to time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I was planning to add the grade task to regenerate it soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, what we need is a plugin or gradle task added for each runtime jar module, that automatically updates the notice and license on every build. Manually doing it is not efficient.
Because for example, after AWS dependency version bump by dependabot, it may bring new dependencies and our notice and license may never get updated for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created an issue tracker for this #11559
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some final checks on the licenses, looks all good 👍 Thanks @ajantha-bhat for fixing this, @kevinjqliu, @findepi, @mrcnc for the review and thanks @bryanck for the help around the licenses generation.
I'm doing a new pass on license and content (Cat A deps). |
depends on #11279