Frequently Asked Questions
Your questions about data ingestion and batch processing answered
FAQs about general topics
-
How do I verify a Googlebot request is actually from Google?
You should use a combination of reverse and forward DNS lookups. Run a reverse DNS lookup on the accessing IP address to see if it resolves to a googlebot.com domain, then run a forward DNS lookup on that domain to verify it matches the original IP. Google also publishes a list of IP ranges that can be used for allowlisting.
-
Can robots.txt successfully block bad bots?
No, the robots.txt file is a governance mechanism, not a security tool. It relies on the voluntary cooperation of the crawler to respect your directives. While legitimate bots (like Google or Bing) will honor it, malicious scrapers and vulnerability scanners will ignore it or verify the paths you are trying to hide.
-
Why is AI traffic considered different from standard search crawlers?
AI traffic is often categorized differently because the intent and value exchange differ from traditional SEO. While search crawlers index content to drive user traffic to your site, AI training bots often ingest content to train proprietary models without guaranteeing attribution or referral traffic, changing the ROI calculation for hosting that traffic.
-
What is TLS fingerprinting in bot detection?
TLS fingerprinting (such as JA3 or JA4) analyzes the specific parameters of the TLS handshake, including cipher suites and extensions, to identify the client application. This helps identify bots that claim to be a standard web browser (like Chrome) but technically behave like a script (like Python or Curl) during the secure connection establishment.
-
Why shouldn’t I use sampled logs for bot management?
Sampled logs frequently miss low-volume or distributed bot attacks that are designed to stay below detection thresholds. By analyzing only a fraction of your traffic, you lose the ability to correlate subtle patterns across different IP addresses and miss the outlier events that often signal a sophisticated account takeover or scraping campaign.
-
How do I normalize log fields across multiple CDN providers?
Normalization requires mapping proprietary vendor schemas to a common format at ingest. For example, you must map Akamai’s statusCode and Fastly’s status to a single unified field. This ensures that incident-time queries return consistent results across your entire delivery stack without requiring manual translation during an active outage.
-
What is the typical retention limit for native CDN dashboards?
Most native CDN tools cap queryable log retention at 7 to 30 days. While some providers offer longer archival storage, query performance often degrades significantly outside the standard window. For capacity planning and seasonal trend analysis, teams typically require at least 15 months of high-performance, hot data retention.
-
Can I integrate unified CDN reporting with Grafana or Kibana?
Yes, most unified reporting layers provide standard SQL interfaces or APIs that connect directly to visualization tools like Grafana. This allows teams to view normalized CDN metrics alongside application-level data. By centralizing these streams, infrastructure and security teams can correlate edge performance with origin health in a single pane of glass.
-
How do data residency rules affect CDN log storage?
Regulations like GDPR often mandate that raw logs containing client IPs stay within their originating region. To maintain compliance, reporting architectures should aggregate and anonymize telemetry locally before exporting it to a central dashboard. This privacy-preserving approach allows for global visibility without transferring sensitive, regulated data across borders.
FAQs about collection
-
Does Hydrolix support batch processing?
Yes, Hydrolix fully supports batch processing. This collection feature allows you to efficiently load data from a storage bucket into a target table, ensuring you can work with your data at scale.
Hydrolix offers two mechanisms for batch ingestion: the Batch Job API, which handles one-off tasks for loading one or more files based on job configurations, and Batch Auto-Ingest, which continuously ingests new files arriving in a storage bucket.
Supported data formats include CSV and JSON. Please note that Hydrolix requires read permissions to access external storage buckets for batch ingestion.
-
Does Hydrolix support data ingest from AWS Kinesis?
Yes, Hydrolix supports AWS Kinesis. You can ingest data into Hydrolix with AWS Kinesis, making it easy to work with your real-time streaming data.
-
Does Hydrolix support data ingest from Apache Kafka?
Yes, Hydrolix integrates with Apache Kafka. Hydrolix Projects and Tables can continuously ingest data from one or more Kafka-based streaming sources.
-
What data collection processes does Hydrolix support?
Hydrolix supports streaming data ingestion through the Stream API and batch ingestion through the Batch API. There are also special connectors for Apache Kafka and AWS Kinesis.
Check out data collection in the Hydrolix documentation.
FAQs about query
-
Does Hydrolix use SQL, or does it have a vendor-specific query language?
Hydrolix uses an ANSI-compliant SQL interface. This interface uses the syntax and some of the SQL engine of Clickhouse. All standard features, including how the interface API works for querying data, are supported.
-
What are Hydrolix query pools, and how do they help businesses scale?
Because Hydrolix query infrastructure is decoupled from storage and collection, you can quickly scale query pools up or down. Small query pools give you consistent low-cost queries, while large pools give you consistent low-latency queries.
You can create separate query pools for different groups of users. For example, you might configure separate sandboxes for administrator, interactive analyst, and monitoring queries.
Pool groups support independent scaling, so the capacity for each pool can adjust automatically to satisfy demand. You can even scale an entire pool to zero when demand is negligible–for example, over the weekend when staff do not need to access data. When demand returns, you can scale the pool back up within minutes.
-
How does Hydrolix make queries more efficient than other cloud data platforms?
Hydrolix improves query efficiency compared to other cloud data platforms through its unique architecture. Our decoupled and stateless design separates ingest and query resources from storage, allowing us to focus on efficiently handling high-cardinality and high-dimensionality data.
Here’s how our architecture achieves query efficiency:
- Scalable Query Pools: Hydrolix enables you to scale query resources independently, ensuring consistently low-latency queries as your data workload grows.
- Partition Metadata: We utilize partition metadata to speed up time-based queries, which is particularly beneficial for time-series data analysis.
- Full Column Indexing: Hydrolix leverages full-column indexing, which optimizes query performance by swiftly locating the necessary data.
- Predicate Pushdown: Our platform efficiently filters datasets using predicate pushdown, further enhancing query efficiency.
FAQs about retention
-
Does Hydrolix use tiered storage to manage data?
No, Hydrolix doesn’t distinguish between hot, warm, or cold data. Queries against all data, whether minutes or years old, deliver sub-second performance, ensuring all of your data remains accessible.
Because Hydrolix combines high-density compression technology with decoupled storage, the cost to deliver low-latency queries on your data, regardless of age, is 4x lower than other databases.
-
What does zero-egress mean?
In the context of Hydrolix, zero-egress means that when you deploy our solution on-premises, you have complete control over your data, and no additional egress costs are incurred.
This allows you to manage your data efficiently and cost-effectively within your own infrastructure.
-
What benefits does Hydrolix offer in terms of data retention and storage costs?
Hydrolix provides a cost-effective data retention solution with patented high-density compression technology. This technology lets you keep all your data online without offloading or sampling. And because of reduced storage costs, you can retain data for analysis, compliance, and security, eliminating the trade-off between data retention and cost.
You also get the additional benefit of reducing your environmental footprint by reducing the storage infrastructure required to store massive datasets.