Skip to main content

Questions tagged [hadoop]

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

0 votes
1 answer
106 views

I have mainly three groups of CSV files (each file is divided into several small files): First group of CSV files have 600+ GB in total (MAYBE 200+ GB if in int, cause CSV calculates by char right?), ...
heisthere's user avatar
  • 101
3 votes
2 answers
6k views

I have been exploring the microservice architecture for the batch-based system. Here is our current setup: Code: We have 5 systems that are internally connected and they pass data from one system ...
SMaZ's user avatar
  • 89
3 votes
4 answers
2k views

Developing Big Data processing pipelines and storage, you probably come across software which is more or less a part of the Hadoop ecosystem. Be it Hadoop itself, Spark/Flink, HBase, Kafka, Accumulo, ...
flowit's user avatar
  • 237
1 vote
1 answer
961 views

I want to ask for a review of my big data app plan. I haven’t much experience in that field, so every single piece of advice would be appreciated. Here is a link to a diagram of the architecture: My ...
Alan Mroczek's user avatar
7 votes
4 answers
2k views

When people talk about MapReduce you think about Google and Hadoop. But what is MapReduce itself? How does it work? I came across this blog post that tries to explain just MapReduce without Hadoop, ...
Eddie Bravo's user avatar
1 vote
1 answer
287 views

I am tasked with redesigning an existing catalog processor and the requirement goes as below Requirement I have 5 to 10 vendors(each vendor can have multiple stores) who would provide me with 'XML' ...
Arun's user avatar
  • 19
2 votes
2 answers
151 views

We have a codebase at work that: Ingests (low) thousands of small files. Each of these input files contains about 50k “micro-items” These “micro-items” are then clustered together to find “macro-...
Ivan's user avatar
  • 575
-3 votes
1 answer
205 views

I have millions of tweets currently stored in HDFS and I plan to analyze them from Spark (Data mining, text mining, Frequent Term-Based Text Clustering, Social Network Analysis) however, do not know ...
J Doe's user avatar
  • 9
4 votes
1 answer
2k views

I’m currently building a dashboard to view some analytics about the data generated by my company's product. We use MySQL as our database. The SQL queries to generate the analytics from the raw live ...
Julien's user avatar
  • 141
1 vote
0 answers
96 views

I've been working for a time with some researches developing a tool to fetch tweets from Twitter and process them in some way. The first prototype "worked" but became a pain as we used sockets to ...
David Moreno García's user avatar
1 vote
1 answer
361 views

I am not a professional coder, but rather an engineer/mathematician that uses computer to solve numerical problems. So far most of my problems are math-related, such as solving large scale linear ...
user138668's user avatar
4 votes
1 answer
1k views

In Hadoop, objects passed to reducers are reused. This is extremely surprising and hard to track down if you're not expecting it. Furthermore, the original tracker for this "feature" doesn't offer any ...
Andrew White's user avatar
2 votes
1 answer
8k views

We have a bunch of data (several TB) in Hadoop HDFS and it's growing. We want to create a dashboard that reports on the contents in there e.g counts of different types of objects, trends over time etc....
kellyfj's user avatar
  • 131
3 votes
2 answers
1k views

I have a problem I was hoping I could get some advice on! I have a LOT of text as input (about 20GB worth, not MASSIVE but big enough). This is just free text, unstructured. I have a 'category list'...
Duncan's user avatar
  • 131
6 votes
2 answers
15k views

I have around 200 million new objects coming in, and a 90 day retention policy, so that leaves me with 18 billion records to be stored in the form of key-value pairs. Key and value both will be a ...
Chaos's user avatar
  • 187

15 30 50 per page