[00:00:13] mutante: for some reason 569100 says "can't merge" but the rebase button says "already up to date". Black and white at the same time? :- [00:00:17] :-) * [00:02:11] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on ganeti1016 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} https://wikitech.wikimedia.org/wiki/Microcode [00:02:11] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on ganeti1009 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} https://wikitech.wikimedia.org/wiki/Microcode [00:06:53] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on ganeti1010 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} https://wikitech.wikimedia.org/wiki/Microcode [00:10:52] (03PS1) 10Marostegui: Revert "db2087: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/569102 [00:11:35] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on ganeti1011 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} https://wikitech.wikimedia.org/wiki/Microcode [00:12:17] (03CR) 10Marostegui: [C: 03+2] Revert "db2087: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/569102 (owner: 10Marostegui) [00:14:23] (03PS1) 10Dzahn: ATS: directly talk wss:// to aphlict [puppet] - 10https://gerrit.wikimedia.org/r/569104 (https://phabricator.wikimedia.org/T238593) [00:16:17] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on ganeti1012 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} https://wikitech.wikimedia.org/wiki/Microcode [00:16:24] (03CR) 10Dzahn: ATS: directly talk wss:// to aphlict (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/569104 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [00:19:54] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on ganeti1013 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} https://wikitech.wikimedia.org/wiki/Microcode [00:23:44] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on ganeti1014 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} https://wikitech.wikimedia.org/wiki/Microcode [00:31:38] !log importing jenkins 2.219 to stretch-wikimedia APT repo; releases1001: upgrading jenkins to 2.219 [00:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:16] !log releases2001: upgrading jenkins to 2.219; install1002: import jenkins 2.219 into jessie-wikimedia APT repo [00:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:37] !log contint1001/contint2001 - upgrading jenkins to 2.219 [00:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:16] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on ganeti1015 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} https://wikitech.wikimedia.org/wiki/Microcode [00:57:56] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on ganeti1018 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} https://wikitech.wikimedia.org/wiki/Microcode [01:02:59] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@322ee4c]: Update mobileapps to 3eec28d [01:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:09:52] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@322ee4c]: Update mobileapps to 3eec28d (duration: 06m 53s) [01:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:38:40] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:40:32] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:42:52] PROBLEM - Host cp3063 is DOWN: PING CRITICAL - Packet loss = 100% [03:14:20] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:16:10] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:22:45] !log powercycling crashed cp3063 [03:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:56] RECOVERY - Host cp3063 is UP: PING OK - Packet loss = 0%, RTA = 83.37 ms [03:53:22] 10Operations, 10ops-eqiad, 10serviceops: (No Need By Date Provided) rack/setup/install mw[1385-1413].eqiad.wmnet - https://phabricator.wikimedia.org/T241849 (10Jclark-ctr) @jijiki will most likely be early March [09:18:16] !log addshore@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildPropertyTerms.php --wiki=wikidatawiki --sleep 4 --batch-size=25 # In a screen for T219301 [09:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:20] T219301: Migrate to and read from new store for property terms - https://phabricator.wikimedia.org/T219301 [09:28:16] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: Degraded RAID on analytics1030 - https://phabricator.wikimedia.org/T243971 (10Peachey88) [09:31:14] hey addshore while you're here... [09:31:22] o/ [09:31:25] what's the new eta on having the ful migration done? [09:31:34] (this is my monthly check-in) [09:31:48] I should be able to give you a better prediction at the start of next week [09:32:14] do we think it's a month or a week or much longer than a month? just looking for a ballpark figure [09:32:28] Started again around 12 hours ago, and nearly at Q 2 million, so maybe it is 4 million ish every day? so maybe 20 days? [09:32:38] ballpark would be 1 month [09:32:44] so I should note to check in again around a month from now... that works for me. [09:32:46] thanks much! [09:32:55] id like to see it happen sooner however, and it might, but if it does, youll know about it :) [09:34:06] :-) [09:34:23] this is more for my own bookkeeping on a phab task [10:32:56] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:33:05] ^ checking [10:34:44] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:35:19] nothing interesting [10:50:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Buster vms: include python3 versions of openstack clients [puppet] - 10https://gerrit.wikimedia.org/r/569084 (owner: 10Andrew Bogott) [11:53:59] (03PS2) 10WMDE-leszek: Wikibase: added config variables to configure entity sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569031 (https://phabricator.wikimedia.org/T242087) [11:54:01] (03PS1) 10WMDE-leszek: Beta wikidata: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569204 (https://phabricator.wikimedia.org/T242087) [11:54:03] (03PS1) 10WMDE-leszek: Beta commons: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569205 (https://phabricator.wikimedia.org/T242087) [11:54:05] (03PS1) 10WMDE-leszek: Beta cluster: use entity source Wikibase setting for all wikibase-enabled wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569206 (https://phabricator.wikimedia.org/T242087) [11:54:07] (03PS1) 10WMDE-leszek: Beta commons: Remove custom wmgWikibaseRepoForeignRepositories setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569207 (https://phabricator.wikimedia.org/T242087) [11:54:09] (03PS1) 10WMDE-leszek: Beta cluster: remove custom wmgWikibaseClientRepositories settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569208 (https://phabricator.wikimedia.org/T242087) [11:54:11] (03PS1) 10WMDE-leszek: Test wikidata: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569209 (https://phabricator.wikimedia.org/T242087) [11:58:57] (03CR) 10jerkins-bot: [V: 04-1] Beta commons: Remove custom wmgWikibaseRepoForeignRepositories setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569207 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [12:00:11] (03CR) 10jerkins-bot: [V: 04-1] Beta cluster: remove custom wmgWikibaseClientRepositories settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569208 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [12:01:32] (03CR) 10jerkins-bot: [V: 04-1] Test wikidata: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569209 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [12:21:10] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:26:40] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:39:28] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.179e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [12:43:07] ^ this alert can be looked at later [12:56:00] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [13:06:29] (03PS2) 10WMDE-leszek: Beta commons: Remove custom wmgWikibaseRepoForeignRepositories setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569207 (https://phabricator.wikimedia.org/T242087) [13:06:35] (03PS2) 10WMDE-leszek: Beta cluster: remove custom wmgWikibaseClientRepositories settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569208 (https://phabricator.wikimedia.org/T242087) [13:06:37] (03PS2) 10WMDE-leszek: Test wikidata: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569209 (https://phabricator.wikimedia.org/T242087) [13:27:16] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:32:46] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:34:36] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:58:30] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:00:20] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:02:50] (03PS1) 10Arturo Borrero Gonzalez: cloud: hiera: puppetmaster: refactor hiera [puppet] - 10https://gerrit.wikimedia.org/r/569230 (https://phabricator.wikimedia.org/T229441) [14:04:00] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:09:30] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:09:54] (03CR) 10Arturo Borrero Gonzalez: "Sharing this idea with you." [puppet] - 10https://gerrit.wikimedia.org/r/569230 (https://phabricator.wikimedia.org/T229441) (owner: 10Arturo Borrero Gonzalez) [14:22:22] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:26:02] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:29:17] ACKNOWLEDGEMENT - Check whether microcode mitigations for CPU vulnerabilities are applied on ganeti1009 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} Effie Mouzeli Will be taken care of after all hands - The acknowledgement expires at: 2020-02-03 13:27:00. https://wikitech.wikimedia.org/wiki/Microcode [14:29:17] ACKNOWLEDGEMENT - Check whether microcode mitigations for CPU vulnerabilities are applied on ganeti1010 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} Effie Mouzeli Will be taken care of after all hands - The acknowledgement expires at: 2020-02-03 13:27:00. https://wikitech.wikimedia.org/wiki/Microcode [14:29:17] ACKNOWLEDGEMENT - Check whether microcode mitigations for CPU vulnerabilities are applied on ganeti1011 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} Effie Mouzeli Will be taken care of after all hands - The acknowledgement expires at: 2020-02-03 13:27:00. https://wikitech.wikimedia.org/wiki/Microcode [14:29:17] ACKNOWLEDGEMENT - Check whether microcode mitigations for CPU vulnerabilities are applied on ganeti1012 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} Effie Mouzeli Will be taken care of after all hands - The acknowledgement expires at: 2020-02-03 13:27:00. https://wikitech.wikimedia.org/wiki/Microcode [14:29:17] ACKNOWLEDGEMENT - Check whether microcode mitigations for CPU vulnerabilities are applied on ganeti1013 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} Effie Mouzeli Will be taken care of after all hands - The acknowledgement expires at: 2020-02-03 13:27:00. https://wikitech.wikimedia.org/wiki/Microcode [14:29:17] ACKNOWLEDGEMENT - Check whether microcode mitigations for CPU vulnerabilities are applied on ganeti1014 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} Effie Mouzeli Will be taken care of after all hands - The acknowledgement expires at: 2020-02-03 13:27:00. https://wikitech.wikimedia.org/wiki/Microcode [14:29:18] ACKNOWLEDGEMENT - Check whether microcode mitigations for CPU vulnerabilities are applied on ganeti1015 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} Effie Mouzeli Will be taken care of after all hands - The acknowledgement expires at: 2020-02-03 13:27:00. https://wikitech.wikimedia.org/wiki/Microcode [14:29:18] ACKNOWLEDGEMENT - Check whether microcode mitigations for CPU vulnerabilities are applied on ganeti1016 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} Effie Mouzeli Will be taken care of after all hands - The acknowledgement expires at: 2020-02-03 13:27:00. https://wikitech.wikimedia.org/wiki/Microcode [14:29:18] ACKNOWLEDGEMENT - Check whether microcode mitigations for CPU vulnerabilities are applied on ganeti1018 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} Effie Mouzeli Will be taken care of after all hands - The acknowledgement expires at: 2020-02-03 13:27:00. https://wikitech.wikimedia.org/wiki/Microcode [14:29:19] ACKNOWLEDGEMENT - Check whether microcode mitigations for CPU vulnerabilities are applied on ganeti1019 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} Effie Mouzeli Will be taken care of after all hands - The acknowledgement expires at: 2020-02-03 13:27:00. https://wikitech.wikimedia.org/wiki/Microcode [14:29:20] ACKNOWLEDGEMENT - Check whether microcode mitigations for CPU vulnerabilities are applied on ganeti1020 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} Effie Mouzeli Will be taken care of after all hands - The acknowledgement expires at: 2020-02-03 13:27:00. https://wikitech.wikimedia.org/wiki/Microcode [14:29:20] ACKNOWLEDGEMENT - Check whether microcode mitigations for CPU vulnerabilities are applied on ganeti1021 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} Effie Mouzeli Will be taken care of after all hands - The acknowledgement expires at: 2020-02-03 13:27:00. https://wikitech.wikimedia.org/wiki/Microcode [14:29:20] ACKNOWLEDGEMENT - Check whether microcode mitigations for CPU vulnerabilities are applied on ganeti1022 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear} Effie Mouzeli Will be taken care of after all hands - The acknowledgement expires at: 2020-02-03 13:27:00. https://wikitech.wikimedia.org/wiki/Microcode [14:33:39] :D [14:38:52] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:40:40] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:41:08] :/ [14:50:40] 10Operations, 10Analytics, 10Analytics-Cluster: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10jijiki) 05Resolved→03Open [14:51:56] 10Operations, 10Analytics, 10Analytics-Cluster: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10jijiki) Host alerted again about /srv being full, /srv/home is 119G. [14:52:47] ACKNOWLEDGEMENT - Disk space on notebook1004 is CRITICAL: DISK CRITICAL - free space: /srv 2102 MB (1% inode=78%): Effie Mouzeli Reopened T232068 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1004&var-datasource=eqiad+prometheus/ops [14:55:32] PROBLEM - etherpad_up reduced availability on icinga1001 is CRITICAL: 0 le 0.8 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:57:22] RECOVERY - etherpad_up reduced availability on icinga1001 is OK: (C)0.8 le (W)0.9 le 1 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:11:56] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:13:46] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:37:34] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:41:12] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:03:12] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:05:02] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:10:04] 10Operations, 10Analytics, 10Analytics-Cluster: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10elukey) @Groceryheist hello, can you check your home directory size ? :) [16:23:24] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:25:14] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:34:24] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:34:32] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:34:45] ok that is different now [16:36:22] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:38:04] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:52:41] * addshore reads up [16:53:21] are these error spikes all abusefilter? O_o [16:53:52] the app or the api pnes? [16:54:03] ones* [16:54:10] api [16:54:27] it is a bot doing too many expensive requests and timing out [16:54:34] I see [16:54:38] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:54:44] I just saw this, many skipeyness for abusefilter https://usercontent.irccloud-cdn.com/file/OodmiHwl/image.png [16:54:46] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [16:54:46] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [16:55:20] addshore: which dashoard are you looking at ? [16:55:31] thats just on the logstash homepage [16:55:49] oh lol [16:56:36] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [16:57:11] Abusefilter looks normal to me [16:57:19] there's always lots of noise in logstash about it [16:57:30] just zhwiki then :) [16:58:26] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:59:14] !log Re-enable notifications on the dbstore1005:3318 check T243871 [16:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:17] T243871: Long query running on dbstore1005:3318 - https://phabricator.wikimedia.org/T243871 [17:00:06] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:05:38] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:25:33] (03PS1) 10Bstorm: wiki-replicas: Correct the actor subquery against revision [puppet] - 10https://gerrit.wikimedia.org/r/569249 (https://phabricator.wikimedia.org/T243984) [17:32:04] (03CR) 10Brian Wolff: [C: 03+1] wiki-replicas: Correct the actor subquery against revision [puppet] - 10https://gerrit.wikimedia.org/r/569249 (https://phabricator.wikimedia.org/T243984) (owner: 10Bstorm) [17:44:12] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:47:54] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:51:32] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:53:24] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:59:16] (03PS1) 10WMDE-leszek: Test wikibase clients: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569256 (https://phabricator.wikimedia.org/T242087) [17:59:19] (03PS1) 10WMDE-leszek: Test commons: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569257 (https://phabricator.wikimedia.org/T242087) [17:59:21] (03PS1) 10WMDE-leszek: Wikidata: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569258 (https://phabricator.wikimedia.org/T242087) [17:59:23] (03PS1) 10WMDE-leszek: Wikidata client wikis: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569259 (https://phabricator.wikimedia.org/T242087) [17:59:24] (03PS1) 10WMDE-leszek: Commons: Define entity sources configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569260 (https://phabricator.wikimedia.org/T242087) [17:59:27] (03PS1) 10WMDE-leszek: Wikidata/Wikibase: use entity source Wikibase setting for all wikibase-enabled wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569261 (https://phabricator.wikimedia.org/T242087) [17:59:37] (03PS1) 10WMDE-leszek: Set wmgUseEntitySourceBasedFederation to true for all wikibase-enabled wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569262 (https://phabricator.wikimedia.org/T241971) [17:59:39] (03PS1) 10WMDE-leszek: Wikibase: Removed config option wmgUseEntitySourceBasedFederation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569263 (https://phabricator.wikimedia.org/T241975) [18:01:58] (03CR) 10jerkins-bot: [V: 04-1] Wikibase: Removed config option wmgUseEntitySourceBasedFederation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569263 (https://phabricator.wikimedia.org/T241975) (owner: 10WMDE-leszek) [18:03:19] !log depool ats-tls on cp4029 [18:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:11] !log depool varnish-fe on cp4029 [18:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:34] !log restarted ats-tls and varnish-fe on cp4029 [18:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:53] (03PS1) 10Majavah: Add wgImportSources for hiwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569267 (https://phabricator.wikimedia.org/T244022) [18:14:20] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:14:25] !log repool cp4029 [18:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:37] (03CR) 10jerkins-bot: [V: 04-1] Wikidata/Wikibase: use entity source Wikibase setting for all wikibase-enabled wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569261 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [18:16:12] (03CR) 10Majavah: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569267 (https://phabricator.wikimedia.org/T244022) (owner: 10Majavah) [18:16:14] PROBLEM - Webrequests Varnishkafka log producer on cp4029 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [18:16:52] PROBLEM - statsv Varnishkafka log producer on cp4029 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [18:17:23] (03CR) 10jerkins-bot: [V: 04-1] Set wmgUseEntitySourceBasedFederation to true for all wikibase-enabled wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569262 (https://phabricator.wikimedia.org/T241971) (owner: 10WMDE-leszek) [18:17:36] PROBLEM - eventlogging Varnishkafka log producer on cp4029 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [18:17:42] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp4032.ulsfo.wmnet [18:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:49] !log repool cp4032 (buster) [18:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:04] RECOVERY - Webrequests Varnishkafka log producer on cp4029 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [18:18:42] RECOVERY - statsv Varnishkafka log producer on cp4029 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [18:19:28] RECOVERY - eventlogging Varnishkafka log producer on cp4029 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [18:22:22] (03PS1) 10BBlack: geodns: bugfix for TR entry, use esams [dns] - 10https://gerrit.wikimedia.org/r/569269 [18:25:25] (03CR) 10Bstorm: "I confirmed in my local dev environment that this runs and produces a functional view. I'll merge and start deploying it." [puppet] - 10https://gerrit.wikimedia.org/r/569249 (https://phabricator.wikimedia.org/T243984) (owner: 10Bstorm) [18:25:36] (03CR) 10Bstorm: [C: 03+2] wiki-replicas: Correct the actor subquery against revision [puppet] - 10https://gerrit.wikimedia.org/r/569249 (https://phabricator.wikimedia.org/T243984) (owner: 10Bstorm) [18:29:08] (03CR) 10BBlack: [V: 03+2 C: 03+2] "jerkins, where are you?" [dns] - 10https://gerrit.wikimedia.org/r/569269 (owner: 10BBlack) [18:32:51] 10Operations, 10Traffic: cp4029 varnish-fe freakout - https://phabricator.wikimedia.org/T243634 (10BBlack) We also tested depooling just port 80 yesterday, which didn't affect anything (fd leak was still growing), which means this isn't driven by external->:80 traffic. cp4029 was at ~400K fds this morning, so... [18:46:12] (03CR) 10Brian Wolff: [C: 03+1] "Oh, i just noticed there is an actor_revision view that probably needs the same change applied." [puppet] - 10https://gerrit.wikimedia.org/r/569249 (https://phabricator.wikimedia.org/T243984) (owner: 10Bstorm) [18:52:13] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/569249 (https://phabricator.wikimedia.org/T243984) (owner: 10Bstorm) [18:53:48] (03PS1) 10Bstorm: wiki-replicas: Correct the actor_revision subquery against revision [puppet] - 10https://gerrit.wikimedia.org/r/569270 (https://phabricator.wikimedia.org/T243984) [18:56:05] 10Operations, 10Traffic: ulsfo varinsh-fe vcache processes overflow on FDs - https://phabricator.wikimedia.org/T243634 (10CDanis) [18:59:08] (03CR) 10Brian Wolff: [C: 03+1] wiki-replicas: Correct the actor_revision subquery against revision [puppet] - 10https://gerrit.wikimedia.org/r/569270 (https://phabricator.wikimedia.org/T243984) (owner: 10Bstorm) [19:00:20] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/569271 [19:00:22] (03CR) 10Bstorm: [C: 03+2] wiki-replicas: Correct the actor_revision subquery against revision [puppet] - 10https://gerrit.wikimedia.org/r/569270 (https://phabricator.wikimedia.org/T243984) (owner: 10Bstorm) [19:02:39] (03CR) 10jerkins-bot: [V: 04-1] WIP [puppet] - 10https://gerrit.wikimedia.org/r/569271 (owner: 10CDanis) [19:06:58] (03PS2) 10CDanis: ulsfo cp-text: Prometheus export # of vcache fds [puppet] - 10https://gerrit.wikimedia.org/r/569271 (https://phabricator.wikimedia.org/T243634) [19:08:46] (03PS3) 10CDanis: ulsfo cp-text: Prometheus export # of vcache fds [puppet] - 10https://gerrit.wikimedia.org/r/569271 (https://phabricator.wikimedia.org/T243634) [19:14:29] (03CR) 10CDanis: "PCC looks right to me: https://puppet-compiler.wmflabs.org/compiler1001/20579/" [puppet] - 10https://gerrit.wikimedia.org/r/569271 (https://phabricator.wikimedia.org/T243634) (owner: 10CDanis) [19:48:22] (03CR) 10Alexandros Kosiaris: "> Alexandros Kosiaris, thanks for the reviews. After adding the config.app, I'm getting an error from Jenkins. Do I need to make configmap" [deployment-charts] - 10https://gerrit.wikimedia.org/r/565788 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [19:48:38] (03PS1) 10Jhedden: wiki replicas: update comment filtering in maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/569276 [19:52:32] (03PS2) 10Jhedden: wiki replicas: update comment filtering in maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/569276 [19:53:28] (03PS3) 10Jhedden: wiki replicas: update comment filtering in maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/569276 [19:54:08] (03PS4) 10Jhedden: wiki replicas: update comment filtering in maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/569276 [20:50:32] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:07:30] (03CR) 10Bstorm: [C: 03+2] wiki replicas: update comment filtering in maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/569276 (owner: 10Jhedden) [21:07:42] (03CR) 10Bstorm: [C: 03+2] "Nice work, tests out perfectly locally." [puppet] - 10https://gerrit.wikimedia.org/r/569276 (owner: 10Jhedden) [21:17:16] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:21:38] !log updated actor views on labsdb1012 [21:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:08] !log updated views on labsdb1009 [21:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:20] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:32:24] !log updated views on labsdb1010 [21:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:54] (03CR) 10CDanis: [C: 03+2] "visual irl review by godog" [puppet] - 10https://gerrit.wikimedia.org/r/569271 (https://phabricator.wikimedia.org/T243634) (owner: 10CDanis) [21:40:50] (03CR) 10BryanDavis: [C: 03+1] "Thank you for finding and fixing this logic bug I hid in my prior patch Jason." [puppet] - 10https://gerrit.wikimedia.org/r/569276 (owner: 10Jhedden) [21:43:33] (03PS1) 10Bstorm: wiki-replicas: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/569284 (https://phabricator.wikimedia.org/T243984) [21:45:35] (03PS2) 10Bstorm: wiki-replicas: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/569284 (https://phabricator.wikimedia.org/T243984) [21:50:28] (03CR) 10Bstorm: [C: 03+2] wiki-replicas: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/569284 (https://phabricator.wikimedia.org/T243984) (owner: 10Bstorm) [21:51:36] (03PS1) 10CDanis: prom-file-count: include symlinks & other special files [puppet] - 10https://gerrit.wikimedia.org/r/569285 (https://phabricator.wikimedia.org/T243634) [21:52:16] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:54:38] (03PS2) 10CDanis: prom-file-count: include symlinks & other special files [puppet] - 10https://gerrit.wikimedia.org/r/569285 (https://phabricator.wikimedia.org/T243634) [21:55:52] (03CR) 10CDanis: [C: 03+2] prom-file-count: include symlinks & other special files [puppet] - 10https://gerrit.wikimedia.org/r/569285 (https://phabricator.wikimedia.org/T243634) (owner: 10CDanis) [21:59:27] !log depooled labsdb1011 [21:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:22] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:05:36] 10Operations, 10Traffic, 10Patch-For-Review: ulsfo varinsh-fe vcache processes overflow on FDs - https://phabricator.wikimedia.org/T243634 (10CDanis) https://grafana.wikimedia.org/d/OU_pxz8Wz/cdanis-ulsfo-vcache-open-fds?orgId=1 [22:09:42] (03PS1) 10Bstorm: Revert "wiki-replicas: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/569287 [22:10:08] (03PS2) 10Bstorm: Revert "wiki-replicas: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/569287 [22:11:25] (03CR) 10Bstorm: [C: 03+2] Revert "wiki-replicas: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/569287 (owner: 10Bstorm) [22:14:22] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:14:27] !log repooled labsdb1011 now that view work is done [22:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:39] 10Operations, 10Traffic: ulsfo varinsh-fe vcache processes overflow on FDs - https://phabricator.wikimedia.org/T243634 (10CDanis) Here's some `lsof` output from a faulty-looking vcache process, showing garbage-y sockets that aren't actually associated with any TCP stream: `cache-mai 16000 vcache *870u soc... [22:52:58] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:00:20] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:07:42] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:09:32] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:54:30] (03CR) 10Cwhite: "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/569285 (https://phabricator.wikimedia.org/T243634) (owner: 10CDanis)