- Upgraded @apify/ps-tree depedency (fixes "Error: spawn ps ENFILE"), upgraded other NPM packages
- Updated documentation and README, consolidated images.
- Added CONTRIBUTING.md
- Updated documentation and README.
- Bugfixes in
RequestQueueLocal
- Updated documentation and README.
- Optimized autoscaled pool default configuration.
- BREAKING CHANGES IN AUTOSCALED POOL
- It has been completely rebuilt for better performance.
- It also now works locally.
- see Migration Guide for more information.
- Updated to [email protected]
- Bug fixes and documentation improvements.
- Upgraded Puppeteer to 1.8.0
- Upgraded NPM dependencies, fixed lint errors
Apify.main()
now sets theAPIFY_LOCAL_STORAGE_DIR
env var to a default value if neitherAPIFY_LOCAL_STORAGE_DIR
norAPIFY_TOKEN
is defined
- Updated
DEFAULT_USER_AGENT
andUSER_AGENT_LIST
- Added
recycleDiskCache
option toPuppeteerPool
to enable reuse of disk cache and thus speed up browsing - WARNING:
APIFY_LOCAL_EMULATION_DIR
environment variable was renamed toAPIFY_LOCAL_STORAGE_DIR
. - Environment variables
APIFY_DEFAULT_KEY_VALUE_STORE_ID
,APIFY_DEFAULT_REQUEST_QUEUE_ID
andAPIFY_DEFAULT_DATASET_ID
have now default valuedefault
so there is no need to define them when developing locally.
- Added
compileScript()
function toutils.puppeteer
to enable use of external scripts at runtime.
- Fixed persistent deprecation warning of
pageOpsTimeoutMillis
. - Moved
cheerio
to dependencies. - Fixed
keepDuplicateUrls
errors with persistent RequestList.
- Added
getInfo()
method to Dataset to get meta-information about a dataset. - Added CheerioCrawler, a specialized class for crawling the web using
cheerio
. - Added
keepDuplicateUrls
option to RequestList to allow duplicate URLs. - Added
.abort()
method to all Crawler classes to enable stopping the crawl programmatically. - Deprecated
pageOpsTimeoutMillis
option. UsehandlePageTimeoutSecs
. - Bluebird promises are being phased out of
apify
in favor ofasync-await
. - Added
log
toApify.utils
to improve logging experience.
- Replaced git-hosted version of our fork of ps-tree with @apify/ps-tree package
- Removed old unused Apify.readyFreddy() function
- Improved logging of URL and port in
PuppeteerLiveViewBrowser
. - PuppeteerCrawler's default page load timeout changed from 30 to 60 seconds.
- Added
Apify.utils.puppeteer.blockResources()
function - More efficient implementation of
getMemoryInfo
function - Puppeteer upgraded to 1.7.0
- Upgraded NPM dependencies
- Dropped support for Node 7
- Fixed unresponsive magnifying glass and improved status tracking in LiveView frontend
- Fixed invalid URL parsing in RequestList.
- Added support for non-Latin language characters (unicode) in URLs.
- Added validation of payload size and automatic chunking to
dataset.pushData()
. - Added support for all content types and their known extensions to
KeyValueStoreLocal
.
- Puppeteer upgraded to 1.6.0.
- Removed
pageCloseTimeoutMillis
option fromPuppeteerCrawler
since it only affects debug logging.
- Bug where failed
page.close()
inPuppeteerPool
was causing request to be retried is fixed. - Added
memory
parameter toApify.call()
. - Added
PuppeteerPool.retire(browser)
method allowing retire a browser before it reaches his limits. This is useful when its IP address got blocked by anti-scraping protection. - Added option
liveView: true
toApify.launchPuppeteer()
that will start a live view server proving web page with overview of all running Puppeteer instances and their screenshots. PuppeteerPool
now kills opened Chrome instances inSIGINT
signal.
- Bugfix in BasicCrawler: native Promise doesn't have finally() function
- Parameter
maxRequestsPerCrawl
added toBasicCrawler
andPuppeteerCrawler
classes.
- Revereted back -
Apify.getApifyProxyUrl()
accepts againsession
andgroups
options instead ofapifyProxySession
andapifyProxyGroups
- Parameter
memory
added toApify.call()
.
PseudoUrl
class can now contain a template forRequest
object creation andPseudoUrl.createRequest()
method.- Added
Apify.utils.puppeteer.enqueueLinks()
function which enqueues requests created from links mathing given pseudo-URLs.
- Added 30s timeout to
page.close()
operation inPuppeteerCrawler
.
- Added
dataset.detData()
,dataset.map()
,dataset.forEach()
anddataset.reduce()
functions. - Added
delete()
method toRequestQueue
,Dataset
andKeyValueStore
classes.
- Added
loggingIntervalMillis
options toAutoscaledPool
- Bugfix:
utils.isProduction
function was incorrect - Added
RequestList.length()
function
- Bugfix in
RequestList
- skip invalid in-progress entries when restoring state - Added
request.ignoreErrors
options. See documentation for more info.
- Bugfix in
Apify.utils.puppeteer.injectXxx
functions
- Puppeteer updated to v1.4.0
- Added
Apify.utils
andApify.utils.puppeteer
namespaces for various helper functions. - Autoscaling feature of
AutoscaledPool
,BasicCrawler
andPuppeteerCrawler
is disabled on Apify platform until all issues are resolved.
- Added
Apify.isAtHome()
function that returnstrue
when code is running on Apify platform andfalse
otherwise (for example locally). - Added
ignoreMainProcess
parameter toAutoscaledPool
. Check documentation for more info. pageOpsTimeoutMillis
ofPuppeteerCrawler
increased to 300 seconds.
- Parameters
session
andgroups
ofgetApifyProxyUrl()
renamed toapifyProxySession
andapifyProxyGroups
to match naming of the same parameters in other classes.
RequestQueue
now caches known requests and their state to beware of unneeded API calls.
- WARNING:
disableProxy
configuration ofPuppeteerCrawler
andPuppeteerPool
removed. By default no proxy is used. You must either use new configurationlaunchPuppeteerOptions.useApifyProxy = true
to use Apify Proxy or provide own proxy vialaunchPuppeteerOptions.proxyUrl
. - WARNING:
groups
parameter ofPuppeteerCrawler
andPuppeteerPool
removed. UselaunchPuppeteerOptions.apifyProxyGroups
instead. - WARNING:
session
andgroups
parameters ofApify.getApifyProxyUrl()
are now validated to contain only alphanumberic characters and underscores. Apify.call()
now throws anApifyCallError
error if run doesn't succeed- Renamed options
abortInstanceAfterRequestCount
ofPuppeteerPool
andPuppeteerCrawler
to retireInstanceAfterRequestCcount - Logs are now in plain text instead of JSON for better readability.
- WARNING:
AutoscaledPool
was completely redesigned. Check documentation for reference. It still supports previous configuration parameters for backwards compatibility but in the future compatibility will break. handleFailedRequestFunction
in bothBasicCrawler
andPuppeteerCrawler
has now also error object available inops.error
.- Request Queue storage type implemented. See documentation for more information.
BasicCrawler
andPuppeteerCrawler
now supports bothRequestList
andRequestQueue
.launchPuppeteer()
changesUser-Agent
only when in headless mode or if not using full Google Chrome, to reduce chance of detection of the crawler.- Apify package now supports Node 7 and newer.
AutoscaledPool
now scales down less aggresively.PuppeteerCrawler
andBasicCrawler
now allow its underlyingAutoscaledPool
functionisFunction
to be overriden.- New events
persistState
andmigrating
added. Check documentation ofApify.events
for more information. RequestList
has a new parameterpersistStateKey
. If this is used thenRequestList
persists its state in the default key-value store at regular intervals.- Improved
README.md
and/examples
directory.
- Added
useChrome
flag tolaunchPuppeteer()
function - Bugfixes in
RequestList
- Removed again the --disable-dev-shm-usage flag when launching headless Chrome, it might be causing issues with high IO overheads
- Upgraded Puppeteer to version 1.2.0
- Added
finishWhenEmpty
andmaybeRunPromiseIntervalMillis
options toAutoscaledPool
class. - Fixed false positive errors logged by
PuppeteerPool
class.
- Added back
--no-sandbox
to launch of Puppeteer to avoid issues on older kernels
- If the
APIFY_XVFB
env var is set to1
, then avoid headless mode and use Xvfb instead - Updated DEFAULT_USER_AGENT to Linux Chrome
- Consolidated startup options for Chrome - use
--disable-dev-shm-usage
, skip--no-sandbox
, use--disable-gpu
only on Windows - Updated docs and package description
- Puppeteer updated to
1.1.1
- A lot of new stuff. Everything is backwards compatible. Check https://www.apify.com/docs/sdk/apify-runtime-js/latest for reference
Apify.setPromiseDependency()
/Apify.getPromiseDependency()
/Apify.getPromisePrototype()
removed- Bunch of classes as
AutoscaledPool
orPuppeteerCrawler
added, check documentation
- Renamed GitHub repo
- Changed links to Travis CI
- Changed links to Apify GitHub repo
Apify.pushData()
added
- Upgraded puppeteer optional dependency to version
^1.0.0
- Initial development, lot of new stuff