🐟 Śledzie

🐟 Śledzie

🐟 Śledzie

Projekt z Eksploracji Masywnych Danych, analiza długości śledzi

Wstęp

Długość śledzi oceanicznych w Europie zmniejsza się wraz z upływem lat. W niniejszym raporcie przeanalizowano dane z połowów komercyjnych jednostek przez ostatnie 60 lat. W ramach połowu losowo wybierano od 50 do 100 sztuk trzyletnich śledzi. Wykonano m.in. interpretację i analizę atrybutów, analizę korelacji atrybutów, stworzenie klasyfikatorów w celu znalezienia najistotniejszej cechy.

Kod wyliczający wykorzystane biblioteki

## [1] "knitr" "kableExtra" "plotly" "dplyr" "tidyverse" ## [6] "ggplot2" "ggExtra" "cowplot" "gridExtra" "imputeTS" ## [11] "corrplot" "reshape2" "caret" "gganimate" "png" ## [16] "gifski"

Kod zapewniający powtarzalność wyników przy każdym uruchomieniu raportu na tych samych danych.

set.seed(40)

Kod pozwalający wczytać dane z pliku

headers <- read.table("sledzie.csv", nrow=1, stringsAsFactors = FALSE, sep = ",") content <- read.table("sledzie.csv", header=TRUE, stringsAsFactors = FALSE, sep=",", na.strings = c("", "?", "NA")) head(content)

## X length cfin1 cfin2 chel1 chel2 lcop1 lcop2 fbar recr ## 1 0 23.0 0.02778 0.27785 2.46875 NA 2.54787 26.35881 0.356 482831 ## 2 1 22.5 0.02778 0.27785 2.46875 21.43548 2.54787 26.35881 0.356 482831 ## 3 2 25.0 0.02778 0.27785 2.46875 21.43548 2.54787 26.35881 0.356 482831 ## 4 3 25.5 0.02778 0.27785 2.46875 21.43548 2.54787 26.35881 0.356 482831 ## 5 4 24.0 0.02778 0.27785 2.46875 21.43548 2.54787 26.35881 0.356 482831 ## 6 5 22.0 0.02778 0.27785 2.46875 21.43548 2.54787 NA 0.356 482831 ## cumf totaln sst sal xmonth nao ## 1 0.3059879 267380.8 14.30693 35.51234 7 2.8 ## 2 0.3059879 267380.8 14.30693 35.51234 7 2.8 ## 3 0.3059879 267380.8 14.30693 35.51234 7 2.8 ## 4 0.3059879 267380.8 14.30693 35.51234 7 2.8 ## 5 0.3059879 267380.8 14.30693 35.51234 7 2.8 ## 6 0.3059879 267380.8 14.30693 35.51234 7 2.8

Kod przetwarzający brakujące dane

content$cfin1 <- na_kalman(content$cfin1) content$cfin2 <- na_kalman(content$cfin2) content$chel1 <- na_kalman(content$chel1) content$chel2 <- na_kalman(content$chel2) content$lcop1 <- na_kalman(content$lcop1) content$lcop2 <- na_kalman(content$lcop2) content$sst <- na_kalman(content$sst) head(content)

## X length cfin1 cfin2 chel1 chel2 lcop1 lcop2 fbar recr ## 1 0 23.0 0.02778 0.27785 2.46875 21.43549 2.54787 26.35881 0.356 482831 ## 2 1 22.5 0.02778 0.27785 2.46875 21.43548 2.54787 26.35881 0.356 482831 ## 3 2 25.0 0.02778 0.27785 2.46875 21.43548 2.54787 26.35881 0.356 482831 ## 4 3 25.5 0.02778 0.27785 2.46875 21.43548 2.54787 26.35881 0.356 482831 ## 5 4 24.0 0.02778 0.27785 2.46875 21.43548 2.54787 26.35881 0.356 482831 ## 6 5 22.0 0.02778 0.27785 2.46875 21.43548 2.54787 26.35881 0.356 482831 ## cumf totaln sst sal xmonth nao ## 1 0.3059879 267380.8 14.30693 35.51234 7 2.8 ## 2 0.3059879 267380.8 14.30693 35.51234 7 2.8 ## 3 0.3059879 267380.8 14.30693 35.51234 7 2.8 ## 4 0.3059879 267380.8 14.30693 35.51234 7 2.8 ## 5 0.3059879 267380.8 14.30693 35.51234 7 2.8 ## 6 0.3059879 267380.8 14.30693 35.51234 7 2.8

Sekcja podsumowująca rozmiar zbioru i podstawowe statystyki

Rozmiar zbioru:

nrow(content)

## [1] 52582

Podstawowe statystyki:

content %>% summary()

## X length cfin1 cfin2 ## Min. : 0 Min. :19.0 Min. : 0.0000 Min. : 0.0000 ## 1st Qu.:13145 1st Qu.:24.0 1st Qu.: 0.0000 1st Qu.: 0.2778 ## Median :26290 Median :25.5 Median : 0.1111 Median : 0.7012 ## Mean :26290 Mean :25.3 Mean : 0.4460 Mean : 2.0255 ## 3rd Qu.:39436 3rd Qu.:26.5 3rd Qu.: 0.3333 3rd Qu.: 1.7936 ## Max. :52581 Max. :32.5 Max. :37.6667 Max. :19.3958 ## chel1 chel2 lcop1 lcop2 ## Min. : 0.000 Min. : 5.238 Min. : 0.3074 Min. : 7.849 ## 1st Qu.: 2.469 1st Qu.:13.427 1st Qu.: 2.5479 1st Qu.:17.808 ## Median : 5.750 Median :21.436 Median : 7.0000 Median :24.859 ## Mean :10.004 Mean :21.217 Mean : 12.8053 Mean :28.421 ## 3rd Qu.:11.500 3rd Qu.:27.193 3rd Qu.: 21.2315 3rd Qu.:37.232 ## Max. :75.000 Max. :57.706 Max. :115.5833 Max. :68.736 ## fbar recr cumf totaln ## Min. :0.0680 Min. : 140515 Min. :0.06833 Min. : 144137 ## 1st Qu.:0.2270 1st Qu.: 360061 1st Qu.:0.14809 1st Qu.: 306068 ## Median :0.3320 Median : 421391 Median :0.23191 Median : 539558 ## Mean :0.3304 Mean : 520366 Mean :0.22981 Mean : 514973 ## 3rd Qu.:0.4560 3rd Qu.: 724151 3rd Qu.:0.29803 3rd Qu.: 730351 ## Max. :0.8490 Max. :1565890 Max. :0.39801 Max. :1015595 ## sst sal xmonth nao ## Min. :12.77 Min. :35.40 Min. : 1.000 Min. :-4.89000 ## 1st Qu.:13.60 1st Qu.:35.51 1st Qu.: 5.000 1st Qu.:-1.89000 ## Median :13.86 Median :35.51 Median : 8.000 Median : 0.20000 ## Mean :13.88 Mean :35.51 Mean : 7.258 Mean :-0.09236 ## 3rd Qu.:14.16 3rd Qu.:35.52 3rd Qu.: 9.000 3rd Qu.: 1.63000 ## Max. :14.73 Max. :35.61 Max. :12.000 Max. : 5.08000

Szczegółowa analiza wartości atrybutów

Rozkład cech:

plots <- ( lapply(names(content)[-1], function(var_x) { ggplot(content) + aes_string(var_x) + geom_histogram() + theme_bw() }) ) plot_grid(plotlist = plots)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Sekcja sprawdzającą korelacje między zmiennymi

par(xpd=TRUE) corrplot.mixed(cor(content[-1]), tl.srt = 90, tl.pos="lt", diag = "l", tl.col="black", number.cex=.6, tl.cex=.6, mar = c(0, 0, 2, 0))

Można zauważyć dodatnią korelację pomiędzy chel1 (zagęszczenie Calanus helgolandicus gat. 1) a lcop1 (zagęszczenie widłonogów gat. 1) ~= 0.96
Można zauważyć dodatnią korelację pomiędzy chel2 (zagęszczenie Calanus helgolandicus gat. 2) a lcop2 (zagęszczenie widłonogów gat. 2) ~= 0.88
Można zauważyć dodatnią korelację pomiędzy cfin2 (zagęszczenie Calanus finmarchicus gat. 2) a lcop2 (zagęszczenie widłonogów gat. 2) ~= 0.65
Można zauważyć dodatnią korelację pomiędzy sst (temperatura przy powierzchni wody [°C]) a nao (oscylacja północnoatlantycka [mb]) ~= 0.51
[Najciekawsze] - można zauważyć ujemną korelację pomiędzy length (długość złowionego śledzia [cm]) a sst (temperatura przy powierzchni wody) ~= -0.45

ggplot(content, aes(sst, length)) + geom_point() + geom_smooth(method="lm") + ylab(sprintf("długość")) + xlab("temperatura") + ggtitle("Ze wzrostem temperatury maleje długość")

Interaktywny wykres lub animację prezentującą zmianę rozmiaru śledzi w czasie

p <- ggplot(content, aes(X, length)) + geom_point() + geom_smooth(method = "auto", color = "red") # ggplotly(p) - Plotly is not working for me - dependency issue p

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Można zauważyć wzrost długości śledzia do pewnego momentu (~row 17500) po czym następuje spadek długości.

Sekcja regresora przewidującego rozmiar śledzia, parametry modelu oraz oszacowanie jego skuteczności

Podzielmy dane na: - zbiór uczący - 0.75 zbioru początkowego - zbiór testowy - 0.25 zbioru początkowego

normalized_content <- data.frame(content[-1]) training_indexes <- createDataPartition(y = normalized_content$length, p = .75, list = FALSE) training_set <- normalized_content[ training_indexes] testing_set <- normalized_content[-training_indexes] rounded_content <- sapply(normalized_content$length, round, digits = 0) control_params <- trainControl( method = "repeatedcv", number = 5, repeats = 10)

kNN

set.seed(40) # bug # n <- names(normalized_content) # f <- as.formula(paste("length ~", paste(n[!n %in% "length"], collapse = " + "))) # f <- reformulate(setdiff(colnames(normalized_content), "length"), response="length") # model <- lm(f, data=training_set) # new <- data.frame(testing_set) # predict(model, newdata=new, interval="confidence") # model_regression <- caret::train(f, data=training_set, method = "lm", trControl = control_params) # regression_classes <- predict(model_regression, newdata = testing_set) # regression_classes <- sapply(regression_classes, round, digits = 0) # levels <- unique(c(rounded_content, regression_classes)) # postResample(pred = regression_classes, obs = rounded_content)

kNN

#model_knn <- train(length ~ ., data = training_set, method = "knn", importance=T, trControl = control_params) #knn_classes <- predict(model_knn, newdata = testing_set) #knn_classes <- sapply(knn_classes, round, digits = 0) #levels <- unique(c(rounded_content, knn_classes)) #postResample(pred = knn_classes, obs = rounded_content) #plot(model_knn)

Random Forest

#model_random_forest <- train(length ~ ., data = training_set, method = "rf", importance=T, trControl = control_params, ntree = 10)  #random_forest_classes <- predict(model_random_forest, newdata = testing_set) #random_forest_classes <- sapply(random_forest_classes, round, digits = 0) #levels <- unique(c(rounded_content, random_forest_classes)) #postResample(pred = random_forest_classes, obs = rounded_content)

Analiza ważności atrybutów najlepszego znalezionego modelu regresji

#models <- list(randomForest = model_random_forest, linearRegression = model_regression, kNN = model_knn) %>% resamples() #dotplot(models, metric = "RMSE") #dotplot(models, metric = "Rsquared")

Niestety z powodów problemu z biblioteką carot nie byłem w stanie sprawdzić wyników modeli :(

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
READMERmd_files/figure-markdown_github		READMERmd_files/figure-markdown_github
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
READMERmd.Rmd		READMERmd.Rmd
READMERmd.html		READMERmd.html
herrings.gif		herrings.gif
herrings.webp		herrings.webp
sledzie.csv		sledzie.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🐟 Śledzie

kNN

kNN

Random Forest

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🐟 Śledzie

kNN

kNN

Random Forest

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages