[00:21:39] RECOVERY - HTTPS-wmflabs on tools.wmflabs.org is OK: SSL OK - Certificate toolforge.org valid until 2020-03-19 23:00:30 +0000 (expires in 89 days) https://phabricator.wikimedia.org/tag/toolforge/ [00:31:17] (03CR) 10Faidon Liambotis: "This is expected to become green in Q3. Why not add it and then acknowledge it in Icinga? That way it's going to stay out of our view unti" [puppet] - 10https://gerrit.wikimedia.org/r/550053 (owner: 10CRusnov) [00:32:41] (03CR) 10CRusnov: [C: 03+2] "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/550053 (owner: 10CRusnov) [00:33:12] (03PS4) 10CRusnov: hieradata/netbox: Add accounting report to alerts [puppet] - 10https://gerrit.wikimedia.org/r/550053 [00:37:25] 10Operations, 10netbox: Sync new ganeti clusters with netbox - https://phabricator.wikimedia.org/T241166 (10crusnov) Yes this is a duplicate. I'll merge it. [00:38:21] 10Operations, 10netbox: Sync new ganeti clusters with netbox - https://phabricator.wikimedia.org/T241166 (10crusnov) [01:05:09] !log volker-e@deploy1001 Started deploy [design/style-guide@b61669a]: Deploy design/style-guide: [01:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:16] !log volker-e@deploy1001 Finished deploy [design/style-guide@b61669a]: Deploy design/style-guide: (duration: 00m 07s) [01:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:53] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=ulsfo https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:06:52] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, researchers & wmf for Shay Nowick - https://phabricator.wikimedia.org/T240917 (10SNowick_WMF) Thank you for getting this done before break, appreciate it! Shay [01:07:23] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:15:10] 10Operations, 10ops-codfw: Degraded RAID on ganeti2002 - https://phabricator.wikimedia.org/T239009 (10Papaul) [01:27:58] (03PS1) 10Reedy: Fix comment about wgCanonicalServer to use https [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559987 [01:34:01] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:34:11] (03PS1) 10Reedy: Add wikis to wgCanonicalServer that are already in wgServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559990 (https://phabricator.wikimedia.org/T188369) [01:35:00] (03PS2) 10Reedy: Add wikis to wgCanonicalServer that are already in wgServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559990 (https://phabricator.wikimedia.org/T188369) [01:35:33] (03CR) 10Reedy: [C: 03+2] Add wikis to wgCanonicalServer that are already in wgServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559990 (https://phabricator.wikimedia.org/T188369) (owner: 10Reedy) [01:35:49] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:36:31] (03Merged) 10jenkins-bot: Add wikis to wgCanonicalServer that are already in wgServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559990 (https://phabricator.wikimedia.org/T188369) (owner: 10Reedy) [01:36:50] (03PS2) 10Reedy: Fix comment about wgCanonicalServer to use https [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559987 [01:36:59] (03CR) 10Reedy: [C: 03+2] Fix comment about wgCanonicalServer to use https [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559987 (owner: 10Reedy) [01:37:57] (03Merged) 10jenkins-bot: Fix comment about wgCanonicalServer to use https [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559987 (owner: 10Reedy) [01:39:09] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T188369 (duration: 00m 53s) [01:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:39:14] T188369: Site matrix includes non-existent 'projectcom.wikipedia.org' - https://phabricator.wikimedia.org/T188369 [01:44:21] !log reedy@deploy1001 update-interwiki-cache aborted: Update interwiki cache (duration: 00m 07s) [01:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:45:00] (03PS1) 10Reedy: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559993 [01:45:02] (03CR) 10Reedy: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559993 (owner: 10Reedy) [01:45:04] (03PS1) 10Reedy: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559994 [01:45:06] (03CR) 10Reedy: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559994 (owner: 10Reedy) [01:46:06] (03CR) 10jerkins-bot: [V: 04-1] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559994 (owner: 10Reedy) [01:50:33] (03Abandoned) 10Reedy: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559993 (owner: 10Reedy) [01:50:47] (03Abandoned) 10Reedy: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559994 (owner: 10Reedy) [02:22:47] (03CR) 10Minhducsun2002: "So, should this be merged?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559347 (https://phabricator.wikimedia.org/T150618) (owner: 10Minhducsun2002) [02:26:11] (03CR) 10Ebe123: "Well yes, but patience! We are taking a break from configuration change deployments with the holidays. Your task is approved and you have " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559347 (https://phabricator.wikimedia.org/T150618) (owner: 10Minhducsun2002) [02:30:11] (03PS1) 10Dzahn: planet: convert more http to https feed links [puppet] - 10https://gerrit.wikimedia.org/r/560002 [02:47:21] (03CR) 10Paladox: [C: 03+1] planet: convert more http to https feed links [puppet] - 10https://gerrit.wikimedia.org/r/560002 (owner: 10Dzahn) [02:48:07] (03CR) 10Dzahn: "detected with the script" [puppet] - 10https://gerrit.wikimedia.org/r/560002 (owner: 10Dzahn) [02:49:31] (03CR) 10Dzahn: [C: 03+2] planet: convert more http to https feed links [puppet] - 10https://gerrit.wikimedia.org/r/560002 (owner: 10Dzahn) [02:51:45] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/dashboard/file/swift?panelId=9&fullscreen&orgId=1&var-DC=eqiad [02:51:53] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/dashboard/file/swift?panelId=9&fullscreen&orgId=1&var-DC=codfw [03:05:48] (03PS1) 10Zhuyifei1999: toolforge: Move package 'fish' from shell_environ to exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/560008 (https://phabricator.wikimedia.org/T241290) [03:07:30] 10Operations, 10ContentSecurityPolicy, 10Gerrit, 10Phabricator, and 2 others: Add gerrit.wikimedia.org to the Phabricator CSP - https://phabricator.wikimedia.org/T218308 (10Paladox) The CSP header is being added here https://github.com/phacility/phabricator/blob/241f06c9ffc63909ec04e29f67f53cb310fb1953/src... [03:08:36] 10Operations, 10ContentSecurityPolicy, 10Gerrit, 10Phabricator, and 2 others: Add gerrit.wikimedia.org to the Phabricator CSP - https://phabricator.wikimedia.org/T218308 (10Dzahn) Looks like the headers are coming from "Aphront". https://github.com/phacility/phabricator/blob/241f06c9ffc63909ec04e29f67f53c... [03:20:27] (03PS1) 10Ayounsi: FNM: Bump the ICMP threshold to 4000 [puppet] - 10https://gerrit.wikimedia.org/r/560010 [03:22:24] 10Operations, 10ContentSecurityPolicy, 10Gerrit, 10Phabricator, and 2 others: Add gerrit.wikimedia.org to the Phabricator CSP - https://phabricator.wikimedia.org/T218308 (10Dzahn) So.. the value for $csp is set to the $altdom / $alt_host setting which is in Hiera as `hiera('phabricator_altdomain', 'phab.wm... [03:23:30] 10Operations, 10netops: fastnetmon fired for routine text-lb.esams traffic - https://phabricator.wikimedia.org/T241281 (10ayounsi) This is most likely a false positive triggered on the ICMP: threshold_icmp_pps = 2000 Incoming icmp pps: 2124 packets per second I created https://gerrit.wikimedia.org/r/#/c/opera... [03:33:47] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:35:33] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:50:47] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/dashboard/file/swift?panelId=9&fullscreen&orgId=1&var-DC=eqiad [03:50:57] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/dashboard/file/swift?panelId=9&fullscreen&orgId=1&var-DC=codfw [03:51:31] (03PS2) 10Dzahn: admins: add ifried to ldap_only_admins, wmf group [puppet] - 10https://gerrit.wikimedia.org/r/559704 (https://phabricator.wikimedia.org/T240988) [03:57:20] (03CR) 10Dzahn: [C: 03+2] admins: add ifried to ldap_only_admins, wmf group [puppet] - 10https://gerrit.wikimedia.org/r/559704 (https://phabricator.wikimedia.org/T240988) (owner: 10Dzahn) [04:01:09] !log LDAP - added ifried to wmf (T240988) for Turnilo / Superset access [04:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:01:15] T240988: Request for Superset & Turnilo Access - https://phabricator.wikimedia.org/T240988 [04:01:44] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Request for Superset & Turnilo Access - https://phabricator.wikimedia.org/T240988 (10Dzahn) 05Open→03Resolved a:03Dzahn @ifried This is done. Your access should work now. [04:18:31] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:20:19] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:50:56] Do we keep web server access log? [04:51:16] (Well, I know holiday is already on the way but if someone is lazy enough to respond :P) [04:53:05] (specifically for authenticated access to OTRS, ticket.wikimedia.org, by chance?) [04:59:27] there are _some_ logs but that doesn't mean i can tell you who logged in at what time. could you please ask via Phabricator with background what you really need? [05:03:35] about to log off.. PM me if an emergency.. otherwise please make a ticket [06:22:23] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:24:11] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:45:07] RECOVERY - Check whether microcode mitigations for CPU vulnerabilities are applied on netflow3001 is OK: OK - All expected CPU flags found https://wikitech.wikimedia.org/wiki/Microcode [09:08:57] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:10:47] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:32:27] (03CR) 10Amire80: [C: 03+1] "Looks OK, although this reminds me of https://phabricator.wikimedia.org/T214139 ." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559965 (https://phabricator.wikimedia.org/T241288) (owner: 10Vogone) [10:38:27] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:40:13] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:03:29] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:05:19] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:26:51] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:30:27] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:37:10] mutante: (not emergency) well WMF legal knows the detail anyway, I was just wondering if the log is preserved or not :-P [13:34:11] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:35:59] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:39:16] !log ladsgroup@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildPropertyTerms.php --wiki=testwikidatawiki --sleep 2 --batch-size=50 (T241209) [13:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:23] T241209: Full rebuild of new wikidata terms tables on test after T237984 - https://phabricator.wikimedia.org/T241209 [13:53:53] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:55:41] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:15:39] (03CR) 10CDanis: [C: 03+2] FNM: Bump the ICMP threshold to 4000 [puppet] - 10https://gerrit.wikimedia.org/r/560010 (owner: 10Ayounsi) [14:18:49] 10Operations, 10netops: fastnetmon fired for routine text-lb.esams traffic - https://phabricator.wikimedia.org/T241281 (10CDanis) Thanks! I had been confused by `Attack protocol: tcp` in the reports. It fired several more times and every time it was for >2000 ICMP pps, so I merged this. [14:40:17] 10Operations, 10netops: fastnetmon fired for routine text-lb.esams traffic - https://phabricator.wikimedia.org/T241281 (10CDanis) a:05ayounsi→03CDanis I'll keep an eye on this and close if there's no other noise. [15:39:11] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10Structured-Data-Backlog, and 5 others: Some jobs are not being processed / are processed slowly - https://phabricator.wikimedia.org/T240518 (10Aklapper) [15:42:11] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:02:47] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:04:35] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:07:11] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:31] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:15:17] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:25:51] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 59 probes of 510 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [16:31:37] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 38 probes of 510 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [17:03:31] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:05:19] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:18:51] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [17:32:09] (03CR) 10ArielGlenn: [C: 03+1] "This setup should be fine for francium's replacement." [puppet] - 10https://gerrit.wikimedia.org/r/559550 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [19:02:17] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559576 (https://phabricator.wikimedia.org/T241163) (owner: 10Ammarpad) [19:08:13] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [19:08:45] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1123 - https://phabricator.wikimedia.org/T240534 (10Marostegui) Can you coordinate with @jcrespo to get it replaced next week? His nick in IRC is jynus, in case via IRC is easier. Thanks! [19:40:49] (03PS4) 10Vogone: Add basic transwiki sources for ltwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559965 (https://phabricator.wikimedia.org/T241288) [19:45:07] (03PS5) 10Urbanecm: Add basic transwiki sources for ltwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559965 (https://phabricator.wikimedia.org/T241288) (owner: 10Vogone) [19:45:19] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559965 (https://phabricator.wikimedia.org/T241288) (owner: 10Vogone) [19:46:13] (03CR) 10jerkins-bot: [V: 04-1] Add basic transwiki sources for ltwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559965 (https://phabricator.wikimedia.org/T241288) (owner: 10Vogone) [19:46:26] (03CR) 10Urbanecm: [C: 03+1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559965 (https://phabricator.wikimedia.org/T241288) (owner: 10Vogone) [19:49:36] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559612 (https://phabricator.wikimedia.org/T240845) (owner: 10NicholasG04) [20:00:53] PROBLEM - Host cp3051 is DOWN: PING CRITICAL - Packet loss = 100% [20:07:05] PROBLEM - Host cp3055 is DOWN: PING CRITICAL - Packet loss = 100% [22:23:03] (03CR) 10BryanDavis: [C: 03+1] toolforge: Move package 'fish' from shell_environ to exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/560008 (https://phabricator.wikimedia.org/T241290) (owner: 10Zhuyifei1999) [22:27:48] (03CR) 10BryanDavis: [C: 03+1] "PCC changes: https://puppet-compiler.wmflabs.org/compiler1003/20110/" [puppet] - 10https://gerrit.wikimedia.org/r/560008 (https://phabricator.wikimedia.org/T241290) (owner: 10Zhuyifei1999) [22:39:57] * volans looking ^^^ [22:44:53] 10Operations, 10Traffic: cp3051 crashed - https://phabricator.wikimedia.org/T241306 (10Volans) [22:45:08] !log powercycle cp3051 - T241306 [22:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:17] T241306: cp3051 crashed - https://phabricator.wikimedia.org/T241306 [22:48:59] RECOVERY - Host cp3051 is UP: PING OK - Packet loss = 0%, RTA = 83.42 ms [23:04:31] 10Operations, 10Traffic: cp3051 crashed - https://phabricator.wikimedia.org/T241306 (10Volans) Nothing in `racadm`, checked both `getsel` and `lclog view`. Nothing in syslog & co. FYI in `dmesg` during the end of the boot process it logged a bunch of `kvm: disabled by bios`. [23:12:09] 10Operations, 10Traffic: cp3055 crashed - https://phabricator.wikimedia.org/T240425 (10Volans) The host crashed again today, nothing in `racadm`, checked both `getsel` and `lclog view`. [23:12:13] !log powercycle cp3055 - T240425 [23:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:19] T240425: cp3055 crashed - https://phabricator.wikimedia.org/T240425 [23:15:33] RECOVERY - Host cp3055 is UP: PING OK - Packet loss = 0%, RTA = 83.43 ms [23:24:51] 10Operations, 10Traffic: cp3055 crashed - https://phabricator.wikimedia.org/T240425 (10Volans) Nothing on the host logs either. For the record it crashed 7 minutes after cp3051 (see T241306) and both are part of the upload esams cluster. Like cp3051 this one too logged a bunch of `kvm: disabled by bios` durin... [23:27:03] 10Operations, 10Traffic: cp3051 crashed - https://phabricator.wikimedia.org/T241306 (10Volans) p:05Triage→03Normal [23:27:38] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Volans) [23:40:11] !log Testing Mastodon logging of SAL messages (T52109) [23:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:20] T52109: Add Mastodon support to stashbot - https://phabricator.wikimedia.org/T52109