Questions tagged [hadoop]

Question 1

I have mainly three groups of CSV files (each file is divided into several small files): First group of CSV files have 600+ GB in total (MAYBE 200+ GB if in int, cause CSV calculates by char right?), ...

Question 2

I have been exploring the microservice architecture for the batch-based system. Here is our current setup: Code: We have 5 systems that are internally connected and they pass data from one system ...

Question 3

Developing Big Data processing pipelines and storage, you probably come across software which is more or less a part of the Hadoop ecosystem. Be it Hadoop itself, Spark/Flink, HBase, Kafka, Accumulo, ...

Question 4

I want to ask for a review of my big data app plan. I haven’t much experience in that field, so every single piece of advice would be appreciated. Here is a link to a diagram of the architecture: My ...

Question 5

When people talk about MapReduce you think about Google and Hadoop. But what is MapReduce itself? How does it work? I came across this blog post that tries to explain just MapReduce without Hadoop, ...

Question 6

I am tasked with redesigning an existing catalog processor and the requirement goes as below Requirement I have 5 to 10 vendors(each vendor can have multiple stores) who would provide me with 'XML' ...

Question 7

We have a codebase at work that: Ingests (low) thousands of small files. Each of these input files contains about 50k “micro-items” These “micro-items” are then clustered together to find “macro-...

Question 8

I have millions of tweets currently stored in HDFS and I plan to analyze them from Spark (Data mining, text mining, Frequent Term-Based Text Clustering, Social Network Analysis) however, do not know ...

Question 9

I’m currently building a dashboard to view some analytics about the data generated by my company's product. We use MySQL as our database. The SQL queries to generate the analytics from the raw live ...

Question 10

I've been working for a time with some researches developing a tool to fetch tweets from Twitter and process them in some way. The first prototype "worked" but became a pain as we used sockets to ...

Question 11

I am not a professional coder, but rather an engineer/mathematician that uses computer to solve numerical problems. So far most of my problems are math-related, such as solving large scale linear ...

Question 12

In Hadoop, objects passed to reducers are reused. This is extremely surprising and hard to track down if you're not expecting it. Furthermore, the original tracker for this "feature" doesn't offer any ...

Question 13

We have a bunch of data (several TB) in Hadoop HDFS and it's growing. We want to create a dashboard that reports on the contents in there e.g counts of different types of objects, trends over time etc....

Question 14

I have a problem I was hoping I could get some advice on! I have a LOT of text as input (about 20GB worth, not MASSIVE but big enough). This is just free text, unstructured. I have a 'category list'...

Question 15

I have around 200 million new objects coming in, and a 90 day retention policy, so that leaves me with 18 billion records to be stored in the form of key-value pairs. Key and value both will be a ...

Stack Exchange Network

Questions tagged [hadoop]

A program design question: Good idea using HDFS in c for reading large data?

Microservice architecture pattern for Batch based system

Why is the whole Hadoop ecosystem written in Java?

Is this Big Data architecture good enough to handle many requests per second?

Can someone explain the technicalities of MapReduce in layman's terms?

#Apache-flink: Stream processing or Batch processing using Flink

SRP in the "big data" setting

Should I use NoSQL or HDFS for storage?

Best practices for dashboard of near real-time analytics

Improve communication between controller and trackers in a Twitter fetcher tool using RabbitMQ or Apache Flume

Is hadoop designed only for "simple" data processing jobs, where communications between the distributed nodes are sparse?

Hadoop and Object Reuse, Why?

How best to implement a Dashboard from data in HDFS/Hadoop [closed]

Text search - big data problem

Optimal way to store 18 billion key, value pairs [closed]

Hot Network Questions